summaryrefslogtreecommitdiffstats
path: root/kernel
Commit message (Collapse)AuthorAgeFilesLines
* printk: syslog: close window between wait and readJohn Ogness2021-07-261-19/+36
| | | | | | | | | | | | | | | | | | Syslog's SYSLOG_ACTION_READ is supposed to block until the next syslog record can be read, and then it should read that record. However, because @syslog_lock is not held between waking up and reading the record, another reader could read the record first, thus causing SYSLOG_ACTION_READ to return with a value of 0, never having read _anything_. By holding @syslog_lock between waking up and reading, it can be guaranteed that SYSLOG_ACTION_READ blocks until it successfully reads a syslog record (or a real error occurs). Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20210715193359.25946-7-john.ogness@linutronix.de
* printk: convert @syslog_lock to mutexJohn Ogness2021-07-261-29/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | @syslog_lock was a raw_spin_lock to simplify the transition of removing @logbuf_lock and the safe buffers. With that transition complete, and since all uses of @syslog_lock are within sleepable contexts, @syslog_lock can become a mutex. Note that until now register_console() would disable interrupts using irqsave, which implies that it may be called with interrupts disabled. And indeed, there is one possible call chain on parisc where this happens: handle_interruption(code=1) /* High-priority machine check (HPMC) */ pdc_console_restart() pdc_console_init_force() register_console() However, register_console() calls console_lock(), which might sleep. So it has never been allowed to call register_console() from an atomic context and the above call chain is a bug. Note that the removal of read_syslog_seq_irq() is slightly changing the behavior of SYSLOG_ACTION_READ by testing against a possibly outdated @seq value. However, the value of @seq could have changed after the test, so it is not a new window. A follow-up commit closes this window. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20210715193359.25946-6-john.ogness@linutronix.de
* printk: remove NMI trackingJohn Ogness2021-07-263-46/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All NMI contexts are handled the same as the safe context: store the message and defer printing. There is no need to have special NMI context tracking for this. Using in_nmi() is enough. There are several parts of the kernel that are manually calling into the printk NMI context tracking in order to cause general printk deferred printing: arch/arm/kernel/smp.c arch/powerpc/kexec/crash.c kernel/trace/trace.c For arm/kernel/smp.c and powerpc/kexec/crash.c, provide a new function pair printk_deferred_enter/exit that explicitly achieves the same objective. For ftrace, remove the printk context manipulation completely. It was added in commit 03fc7f9c99c1 ("printk/nmi: Prevent deadlock when accessing the main log buffer in NMI"). The purpose was to enforce storing messages directly into the ring buffer even in NMI context. It really should have only modified the behavior in NMI context. There is no need for a special behavior any longer. All messages are always stored directly now. The console deferring is handled transparently in vprintk(). Signed-off-by: John Ogness <john.ogness@linutronix.de> [pmladek@suse.com: Remove special handling in ftrace.c completely. Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20210715193359.25946-5-john.ogness@linutronix.de
* printk: remove safe buffersJohn Ogness2021-07-265-428/+48
| | | | | | | | | | | | | | | | | | | | | | With @logbuf_lock removed, the high level printk functions for storing messages are lockless. Messages can be stored from any context, so there is no need for the NMI and safe buffers anymore. Remove the NMI and safe buffers. Although the safe buffers are removed, the NMI and safe context tracking is still in place. In these contexts, store the message immediately but still use irq_work to defer the console printing. Since printk recursion tracking is in place, safe context tracking for most of printk is not needed. Remove it. Only safe context tracking relating to the console and console_owner locks is left in place. This is because the console and console_owner locks are needed for the actual printing. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20210715193359.25946-4-john.ogness@linutronix.de
* printk: track/limit recursionJohn Ogness2021-07-261-3/+83
| | | | | | | | | | | | | | | Currently the printk safe buffers provide a form of recursion protection by redirecting to the safe buffers whenever printk() is recursively called. In preparation for removal of the safe buffers, provide an alternate explicit recursion protection. Recursion is limited to 3 levels per-CPU and per-context. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20210715193359.25946-3-john.ogness@linutronix.de
* Merge branch 'akpm' (patches from Andrew)Linus Torvalds2021-06-298-26/+25
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge misc updates from Andrew Morton: "191 patches. Subsystems affected by this patch series: kthread, ia64, scripts, ntfs, squashfs, ocfs2, kernel/watchdog, and mm (gup, pagealloc, slab, slub, kmemleak, dax, debug, pagecache, gup, swap, memcg, pagemap, mprotect, bootmem, dma, tracing, vmalloc, kasan, initialization, pagealloc, and memory-failure)" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (191 commits) mm,hwpoison: make get_hwpoison_page() call get_any_page() mm,hwpoison: send SIGBUS with error virutal address mm/page_alloc: split pcp->high across all online CPUs for cpuless nodes mm/page_alloc: allow high-order pages to be stored on the per-cpu lists mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA docs: remove description of DISCONTIGMEM arch, mm: remove stale mentions of DISCONIGMEM mm: remove CONFIG_DISCONTIGMEM m68k: remove support for DISCONTIGMEM arc: remove support for DISCONTIGMEM arc: update comment about HIGHMEM implementation alpha: remove DISCONTIGMEM and NUMA mm/page_alloc: move free_the_page mm/page_alloc: fix counting of managed_pages mm/page_alloc: improve memmap_pages dbg msg mm: drop SECTION_SHIFT in code comments mm/page_alloc: introduce vm.percpu_pagelist_high_fraction mm/page_alloc: limit the number of pages on PCP lists when reclaim is active mm/page_alloc: scale the number of pages that are batch freed ...
| * mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEMMike Rapoport2021-06-291-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After removal of the DISCONTIGMEM memory model the FLAT_NODE_MEM_MAP configuration option is equivalent to FLATMEM. Drop CONFIG_FLAT_NODE_MEM_MAP and use CONFIG_FLATMEM instead. Link: https://lkml.kernel.org/r/20210608091316.3622-10-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matt Turner <mattst88@gmail.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Vineet Gupta <vgupta@synopsys.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMAMike Rapoport2021-06-291-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After removal of DISCINTIGMEM the NEED_MULTIPLE_NODES and NUMA configuration options are equivalent. Drop CONFIG_NEED_MULTIPLE_NODES and use CONFIG_NUMA instead. Done with $ sed -i 's/CONFIG_NEED_MULTIPLE_NODES/CONFIG_NUMA/' \ $(git grep -wl CONFIG_NEED_MULTIPLE_NODES) $ sed -i 's/NEED_MULTIPLE_NODES/NUMA/' \ $(git grep -wl NEED_MULTIPLE_NODES) with manual tweaks afterwards. [rppt@linux.ibm.com: fix arm boot crash] Link: https://lkml.kernel.org/r/YMj9vHhHOiCVN4BF@linux.ibm.com Link: https://lkml.kernel.org/r/20210608091316.3622-9-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matt Turner <mattst88@gmail.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Vineet Gupta <vgupta@synopsys.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * mm/page_alloc: introduce vm.percpu_pagelist_high_fractionMel Gorman2021-06-291-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This introduces a new sysctl vm.percpu_pagelist_high_fraction. It is similar to the old vm.percpu_pagelist_fraction. The old sysctl increased both pcp->batch and pcp->high with the higher pcp->high potentially reducing zone->lock contention. However, the higher pcp->batch value also potentially increased allocation latency while the PCP was refilled. This sysctl only adjusts pcp->high so that zone->lock contention is potentially reduced but allocation latency during a PCP refill remains the same. # grep -E "high:|batch" /proc/zoneinfo | tail -2 high: 649 batch: 63 # sysctl vm.percpu_pagelist_high_fraction=8 # grep -E "high:|batch" /proc/zoneinfo | tail -2 high: 35071 batch: 63 # sysctl vm.percpu_pagelist_high_fraction=64 high: 4383 batch: 63 # sysctl vm.percpu_pagelist_high_fraction=0 high: 649 batch: 63 [mgorman@techsingularity.net: fix documentation] Link: https://lkml.kernel.org/r/20210528151010.GQ30378@techsingularity.net Link: https://lkml.kernel.org/r/20210525080119.5455-7-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hdanton@sina.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * mm/page_alloc: delete vm.percpu_pagelist_fractionMel Gorman2021-06-291-8/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "Calculate pcp->high based on zone sizes and active CPUs", v2. The per-cpu page allocator (PCP) is meant to reduce contention on the zone lock but the sizing of batch and high is archaic and neither takes the zone size into account or the number of CPUs local to a zone. With larger zones and more CPUs per node, the contention is getting worse. Furthermore, the fact that vm.percpu_pagelist_fraction adjusts both batch and high values means that the sysctl can reduce zone lock contention but also increase allocation latencies. This series disassociates pcp->high from pcp->batch and then scales pcp->high based on the size of the local zone with limited impact to reclaim and accounting for active CPUs but leaves pcp->batch static. It also adapts the number of pages that can be on the pcp list based on recent freeing patterns. The motivation is partially to adjust to larger memory sizes but is also driven by the fact that large batches of page freeing via release_pages() often shows zone contention as a major part of the problem. Another is a bug report based on an older kernel where a multi-terabyte process can takes several minutes to exit. A workaround was to use vm.percpu_pagelist_fraction to increase the pcp->high value but testing indicated that a production workload could not use the same values because of an increase in allocation latencies. Unfortunately, I cannot reproduce this test case myself as the multi-terabyte machines are in active use but it should alleviate the problem. The series aims to address both and partially acts as a pre-requisite. pcp only works with order-0 which is useless for SLUB (when using high orders) and THP (unconditionally). To store high-order pages on PCP, the pcp->high values need to be increased first. This patch (of 6): The vm.percpu_pagelist_fraction is used to increase the batch and high limits for the per-cpu page allocator (PCP). The intent behind the sysctl is to reduce zone lock acquisition when allocating/freeing pages but it has a problem. While it can decrease contention, it can also increase latency on the allocation side due to unreasonably large batch sizes. This leads to games where an administrator adjusts percpu_pagelist_fraction on the fly to work around contention and allocation latency problems. This series aims to alleviate the problems with zone lock contention while avoiding the allocation-side latency problems. For the purposes of review, it's easier to remove this sysctl now and reintroduce a similar sysctl later in the series that deals only with pcp->high. Link: https://lkml.kernel.org/r/20210525080119.5455-1-mgorman@techsingularity.net Link: https://lkml.kernel.org/r/20210525080119.5455-2-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hdanton@sina.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * kernel/events/uprobes: use vma_lookup() in find_active_uprobe()Liam Howlett2021-06-291-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use vma_lookup() to find the VMA at a specific address. As vma_lookup() will return NULL if the address is not within any VMA, the start address no longer needs to be validated. Link: https://lkml.kernel.org/r/20210521174745.2219620-17-Liam.Howlett@Oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Davidlohr Bueso <dbueso@suse.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * perf: MAP_EXECUTABLE does not indicate VM_MAYEXECDavid Hildenbrand2021-06-291-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "perf/binfmt/mm: remove in-tree usage of MAP_EXECUTABLE". Stumbling over the history of MAP_EXECUTABLE, I noticed that we still have some in-tree users that we can get rid of. This patch (of 3): Before commit e9714acf8c43 ("mm: kill vma flag VM_EXECUTABLE and mm->num_exe_file_vmas"), VM_EXECUTABLE indicated MAP_EXECUTABLE. MAP_EXECUTABLE is nowadays essentially ignored by the kernel and does not relate to VM_MAYEXEC. Link: https://lkml.kernel.org/r/20210421093453.6904-1-david@redhat.com Link: https://lkml.kernel.org/r/20210421093453.6904-2-david@redhat.com Fixes: f972eb63b100 ("perf: Pass protection and flags bits through mmap2 interface") Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Greg Ungerer <gerg@linux-m68k.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Kevin Brodsky <Kevin.Brodsky@arm.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Feng Tang <feng.tang@intel.com> Cc: Don Zickus <dzickus@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * loop: charge i/o to mem and blk cgDan Schatzberg2021-06-291-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current code only associates with the existing blkcg when aio is used to access the backing file. This patch covers all types of i/o to the backing file and also associates the memcg so if the backing file is on tmpfs, memory is charged appropriately. This patch also exports cgroup_get_e_css and int_active_memcg so it can be used by the loop module. Link: https://lkml.kernel.org/r/20210610173944.1203706-4-schatzberg.dan@gmail.com Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Jens Axboe <axboe@kernel.dk> Cc: Chris Down <chris@chrisdown.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Ming Lei <ming.lei@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * mm: gup: pack has_pinned in MMF_HAS_PINNEDAndrea Arcangeli2021-06-291-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | has_pinned 32bit can be packed in the MMF_HAS_PINNED bit as a noop cleanup. Any atomic_inc/dec to the mm cacheline shared by all threads in pin-fast would reintroduce a loss of SMP scalability to pin-fast, so there's no future potential usefulness to keep an atomic in the mm for this. set_bit(MMF_HAS_PINNED) will be theoretically a bit slower than WRITE_ONCE (atomic_set is equivalent to WRITE_ONCE), but the set_bit (just like atomic_set after this commit) has to be still issued only once per "mm", so the difference between the two will be lost in the noise. will-it-scale "mmap2" shows no change in performance with enterprise config as expected. will-it-scale "pin_fast" retains the > 4000% SMP scalability performance improvement against upstream as expected. This is a noop as far as overall performance and SMP scalability are concerned. [peterx@redhat.com: pack has_pinned in MMF_HAS_PINNED] Link: https://lkml.kernel.org/r/YJqWESqyxa8OZA+2@t490s [akpm@linux-foundation.org: coding style fixes] [peterx@redhat.com: fix build for task_mmu.c, introduce mm_set_has_pinned_flag, fix comments] Link: https://lkml.kernel.org/r/20210507150553.208763-4-peterx@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Kirill Shutemov <kirill@shutemov.name> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * kernel: watchdog: modify the explanation related to watchdog threadWang Qing2021-06-291-8/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The watchdog thread has been replaced by cpu_stop_work, modify the explanation related. Link: https://lkml.kernel.org/r/1619687073-24686-2-git-send-email-wangqing@vivo.com Signed-off-by: Wang Qing <wangqing@vivo.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Cc: Joe Perches <joe@perches.com> Cc: Stephen Kitt <steve@sk2.org> Cc: Kees Cook <keescook@chromium.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: "Guilherme G. Piccoli" <gpiccoli@canonical.com> Cc: Qais Yousef <qais.yousef@arm.com> Cc: Santosh Sivaraj <santosh@fossix.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * kthread_worker: fix return value when kthread_mod_delayed_work() races with ↵Petr Mladek2021-06-291-7/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kthread_cancel_delayed_work_sync() kthread_mod_delayed_work() might race with kthread_cancel_delayed_work_sync() or another kthread_mod_delayed_work() call. The function lets the other operation win when it sees work->canceling counter set. And it returns @false. But it should return @true as it is done by the related workqueue API, see mod_delayed_work_on(). The reason is that the return value might be used for reference counting. It has to distinguish the case when the number of queued works has changed or stayed the same. The change is safe. kthread_mod_delayed_work() return value is not checked anywhere at the moment. Link: https://lore.kernel.org/r/20210521163526.GA17916@redhat.com Link: https://lkml.kernel.org/r/20210610133051.15337-4-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Reported-by: Oleg Nesterov <oleg@redhat.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Minchan Kim <minchan@google.com> Cc: <jenhaochen@google.com> Cc: Martin Liu <liumartin@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge tag 'pm-5.14-rc1' of ↵Linus Torvalds2021-06-294-13/+13
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "These add hybrid processors support to the intel_pstate driver and make it work with more processor models when HWP is disabled, make the intel_idle driver use special C6 idle state paremeters when package C-states are disabled, add cooling support to the tegra30 devfreq driver, rework the TEO (timer events oriented) cpuidle governor, extend the OPP (operating performance points) framework to use the required-opps DT property in more cases, fix some issues and clean up a number of assorted pieces of code. Specifics: - Make intel_pstate support hybrid processors using abstract performance units in the HWP interface (Rafael Wysocki). - Add Icelake servers and Cometlake support in no-HWP mode to intel_pstate (Giovanni Gherdovich). - Make cpufreq_online() error path be consistent with the CPU device removal path in cpufreq (Rafael Wysocki). - Clean up 3 cpufreq drivers and the statistics code (Hailong Liu, Randy Dunlap, Shaokun Zhang). - Make intel_idle use special idle state parameters for C6 when package C-states are disabled (Chen Yu). - Rework the TEO (timer events oriented) cpuidle governor to address some theoretical shortcomings in it (Rafael Wysocki). - Drop unneeded semicolon from the TEO governor (Wan Jiabing). - Modify the runtime PM framework to accept unassigned suspend and resume callback pointers (Ulf Hansson). - Improve pm_runtime_get_sync() documentation (Krzysztof Kozlowski). - Improve device performance states support in the generic power domains (genpd) framework (Ulf Hansson). - Fix some documentation issues in genpd (Yang Yingliang). - Make the operating performance points (OPP) framework use the required-opps DT property in use cases that are not related to genpd (Hsin-Yi Wang). - Make lazy_link_required_opp_table() use list_del_init instead of list_del/INIT_LIST_HEAD (Yang Yingliang). - Simplify wake IRQs handling in the core system-wide sleep support code and clean up some coding style inconsistencies in it (Tian Tao, Zhen Lei). - Add cooling support to the tegra30 devfreq driver and improve its DT bindings (Dmitry Osipenko). - Fix some assorted issues in the devfreq core and drivers (Chanwoo Choi, Dong Aisheng, YueHaibing)" * tag 'pm-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (39 commits) PM / devfreq: passive: Fix get_target_freq when not using required-opp cpufreq: Make cpufreq_online() call driver->offline() on errors opp: Allow required-opps to be used for non genpd use cases cpuidle: teo: remove unneeded semicolon in teo_select() dt-bindings: devfreq: tegra30-actmon: Add cooling-cells dt-bindings: devfreq: tegra30-actmon: Convert to schema PM / devfreq: userspace: Use DEVICE_ATTR_RW macro PM: runtime: Clarify documentation when callbacks are unassigned PM: runtime: Allow unassigned ->runtime_suspend|resume callbacks PM: runtime: Improve path in rpm_idle() when no callback PM: hibernate: remove leading spaces before tabs PM: sleep: remove trailing spaces and tabs PM: domains: Drop/restore performance state votes for devices at runtime PM PM: domains: Return early if perf state is already set for the device PM: domains: Split code in dev_pm_genpd_set_performance_state() cpuidle: teo: Use kerneldoc documentation in admin-guide cpuidle: teo: Rework most recent idle duration values treatment cpuidle: teo: Change the main idle state selection logic cpuidle: teo: Cosmetic modification of teo_select() cpuidle: teo: Cosmetic modifications of teo_update() ...
| | \
| | \
| *-. \ Merge branches 'pm-core' and 'pm-sleep'Rafael J. Wysocki2021-06-294-13/+13
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * pm-core: PM: runtime: Clarify documentation when callbacks are unassigned PM: runtime: Allow unassigned ->runtime_suspend|resume callbacks PM: runtime: Improve path in rpm_idle() when no callback PM: runtime: document common mistake with pm_runtime_get_sync() * pm-sleep: PM: hibernate: remove leading spaces before tabs PM: sleep: remove trailing spaces and tabs PM: hibernate: fix spelling mistakes PM: wakeirq: Set IRQF_NO_AUTOEN when requesting the IRQ
| | | * | PM: hibernate: remove leading spaces before tabsZhen Lei2021-06-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1) Run the following command to find and remove the leading spaces before tabs: $ find kernel/power/ -type f | xargs sed -r -i 's/^[ ]+\t/\t/' 2) Manually check and correct if necessary Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
| | | * | PM: sleep: remove trailing spaces and tabsZhen Lei2021-06-112-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Run the following command to find and remove the trailing spaces and tabs: $ find kernel/power/ -type f | xargs sed -r -i 's/[ \t]+$//' Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
| | | * | PM: hibernate: fix spelling mistakesZhen Lei2021-05-242-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix some spelling mistakes in comments: corresonds ==> corresponds alocated ==> allocated unitialized ==> uninitialized Deompression ==> Decompression Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* | | | | Merge tag 'timers-core-2021-06-29' of ↵Linus Torvalds2021-06-2912-110/+755
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer updates from Thomas Gleixner: "Time and clocksource/clockevent related updates: Core changes: - Infrastructure to support per CPU "broadcast" devices for per CPU clockevent devices which stop in deep idle states. This allows us to utilize the more efficient architected timer on certain ARM SoCs for normal operation instead of permanentely using the slow to access SoC specific clockevent device. - Print the name of the broadcast/wakeup device in /proc/timer_list - Make the clocksource watchdog more robust against delays between reading the current active clocksource and the watchdog clocksource. Such delays can be caused by NMIs, SMIs and vCPU preemption. Handle this by reading the watchdog clocksource twice, i.e. before and after reading the current active clocksource. In case that the two watchdog reads shows an excessive time delta, the read sequence is repeated up to 3 times. - Improve the debug output and add a test module for the watchdog mechanism. - Reimplementation of the venerable time64_to_tm() function with a faster and significantly smaller version. Straight from the source, i.e. the author of the related research paper contributed this! Driver changes: - No new drivers, not even new device tree bindings! - Fixes, improvements and cleanups and all over the place" * tag 'timers-core-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits) time/kunit: Add missing MODULE_LICENSE() time: Improve performance of time64_to_tm() clockevents: Use list_move() instead of list_del()/list_add() clocksource: Print deviation in nanoseconds when a clocksource becomes unstable clocksource: Provide kernel module to test clocksource watchdog clocksource: Reduce clocksource-skew threshold clocksource: Limit number of CPUs checked for clock synchronization clocksource: Check per-CPU clock synchronization when marked unstable clocksource: Retry clock read if long delays detected clockevents: Add missing parameter documentation clocksource/drivers/timer-ti-dm: Drop unnecessary restore clocksource/arm_arch_timer: Improve Allwinner A64 timer workaround clocksource/drivers/arm_global_timer: Remove duplicated argument in arm_global_timer clocksource/drivers/arm_global_timer: Make symbol 'gt_clk_rate_change_nb' static arm: zynq: don't disable CONFIG_ARM_GLOBAL_TIMER due to CONFIG_CPU_FREQ anymore clocksource/drivers/arm_global_timer: Implement rate compensation whenever source clock changes clocksource/drivers/ingenic: Rename unreasonable array names clocksource/drivers/timer-ti-dm: Save and restore timer TIOCP_CFG clocksource/drivers/mediatek: Ack and disable interrupts on suspend clocksource/drivers/samsung_pwm: Constify source IO memory ...
| * | | | | time/kunit: Add missing MODULE_LICENSE()Thomas Gleixner2021-06-281-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ mingo: MODULE_LICENSE() takes a string. ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | time: Improve performance of time64_to_tm()Cassio Neri2021-06-244-58/+178
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current implementation of time64_to_tm() contains unnecessary loops, branches and look-up tables. The new one uses an arithmetic-based algorithm appeared in [1] and is approximately 3x faster (YMMV). The drawback is that the new code isn't intuitive and contains many 'magic numbers' (not unusual for this type of algorithm). However, [1] justifies all those numbers and, given this function's history, the code is unlikely to need much maintenance, if any at all. Add a KUnit test for it which checks every day in a 160,000 years interval centered at 1970-01-01 against the expected result. [1] Neri, Schneider, "Euclidean Affine Functions and Applications to Calendar Algorithms". https://arxiv.org/abs/2102.06959 Signed-off-by: Cassio Neri <cassio.neri@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210622213616.313046-1-cassio.neri@gmail.com
| * | | | | clockevents: Use list_move() instead of list_del()/list_add()Baokun Li2021-06-221-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Simplify the code. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Baokun Li <libaokun1@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210609070242.1322450-1-libaokun1@huawei.com
| * | | | | clocksource: Print deviation in nanoseconds when a clocksource becomes unstableFeng Tang2021-06-221-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently when an unstable clocksource is detected, the raw counters of that clocksource and watchdog will be printed, which can only be understood after some math calculation. So print the delta in nanoseconds as well to make it easier for humans to check the results. [ paulmck: Fix typo. ] Signed-off-by: Feng Tang <feng.tang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210527190124.440372-6-paulmck@kernel.org
| * | | | | clocksource: Provide kernel module to test clocksource watchdogPaul E. McKenney2021-06-223-2/+207
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the clocksource watchdog marks a clock as unstable, this might be due to that clock being unstable or it might be due to delays that happen to occur between the reads of the two clocks. It would be good to have a way of testing the clocksource watchdog's ability to distinguish between these two causes of clock skew and instability. Therefore, provide a new clocksource-wdtest module selected by a new TEST_CLOCKSOURCE_WATCHDOG Kconfig option. This module has a single module parameter named "holdoff" that provides the number of seconds of delay before testing should start, which defaults to zero when built as a module and to 10 seconds when built directly into the kernel. Very large systems that boot slowly may need to increase the value of this module parameter. This module uses hand-crafted clocksource structures to do its testing, thus avoiding messing up timing for the rest of the kernel and for user applications. This module first verifies that the ->uncertainty_margin field of the clocksource structures are set sanely. It then tests the delay-detection capability of the clocksource watchdog, increasing the number of consecutive delays injected, first provoking console messages complaining about the delays and finally forcing a clock-skew event. Unexpected test results cause at least one WARN_ON_ONCE() console splat. If there are no splats, the test has passed. Finally, it fuzzes the value returned from a clocksource to test the clocksource watchdog's ability to detect time skew. This module checks the state of its clocksource after each test, and uses WARN_ON_ONCE() to emit a console splat if there are any failures. This should enable all types of test frameworks to detect any such failures. This facility is intended for diagnostic use only, and should be avoided on production systems. Reported-by: Chris Mason <clm@fb.com> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Feng Tang <feng.tang@intel.com> Link: https://lore.kernel.org/r/20210527190124.440372-5-paulmck@kernel.org
| * | | | | clocksource: Reduce clocksource-skew thresholdPaul E. McKenney2021-06-222-17/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, WATCHDOG_THRESHOLD is set to detect a 62.5-millisecond skew in a 500-millisecond WATCHDOG_INTERVAL. This requires that clocks be skewed by more than 12.5% in order to be marked unstable. Except that a clock that is skewed by that much is probably destroying unsuspecting software right and left. And given that there are now checks for false-positive skews due to delays between reading the two clocks, it should be possible to greatly decrease WATCHDOG_THRESHOLD, at least for fine-grained clocks such as TSC. Therefore, add a new uncertainty_margin field to the clocksource structure that contains the maximum uncertainty in nanoseconds for the corresponding clock. This field may be initialized manually, as it is for clocksource_tsc_early and clocksource_jiffies, which is copied to refined_jiffies. If the field is not initialized manually, it will be computed at clock-registry time as the period of the clock in question based on the scale and freq parameters to __clocksource_update_freq_scale() function. If either of those two parameters are zero, the tens-of-milliseconds WATCHDOG_THRESHOLD is used as a cowardly alternative to dividing by zero. No matter how the uncertainty_margin field is calculated, it is bounded below by twice WATCHDOG_MAX_SKEW, that is, by 100 microseconds. Note that manually initialized uncertainty_margin fields are not adjusted, but there is a WARN_ON_ONCE() that triggers if any such field is less than twice WATCHDOG_MAX_SKEW. This WARN_ON_ONCE() is intended to discourage production use of the one-nanosecond uncertainty_margin values that are used to test the clock-skew code itself. The actual clock-skew check uses the sum of the uncertainty_margin fields of the two clocksource structures being compared. Integer overflow is avoided because the largest computed value of the uncertainty_margin fields is one billion (10^9), and double that value fits into an unsigned int. However, if someone manually specifies (say) UINT_MAX, they will get what they deserve. Note that the refined_jiffies uncertainty_margin field is initialized to TICK_NSEC, which means that skew checks involving this clocksource will be sufficently forgiving. In a similar vein, the clocksource_tsc_early uncertainty_margin field is initialized to 32*NSEC_PER_MSEC, which replicates the current behavior and allows custom setting if needed in order to address the rare skews detected for this clocksource in current mainline. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Feng Tang <feng.tang@intel.com> Link: https://lore.kernel.org/r/20210527190124.440372-4-paulmck@kernel.org
| * | | | | clocksource: Limit number of CPUs checked for clock synchronizationPaul E. McKenney2021-06-221-2/+72
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, if skew is detected on a clock marked CLOCK_SOURCE_VERIFY_PERCPU, that clock is checked on all CPUs. This is thorough, but might not be what you want on a system with a few tens of CPUs, let alone a few hundred of them. Therefore, by default check only up to eight randomly chosen CPUs. Also provide a new clocksource.verify_n_cpus kernel boot parameter. A value of -1 says to check all of the CPUs, and a non-negative value says to randomly select that number of CPUs, without concern about selecting the same CPU multiple times. However, make use of a cpumask so that a given CPU will be checked at most once. Suggested-by: Thomas Gleixner <tglx@linutronix.de> # For verify_n_cpus=1. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Feng Tang <feng.tang@intel.com> Link: https://lore.kernel.org/r/20210527190124.440372-3-paulmck@kernel.org
| * | | | | clocksource: Check per-CPU clock synchronization when marked unstablePaul E. McKenney2021-06-221-0/+60
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some sorts of per-CPU clock sources have a history of going out of synchronization with each other. However, this problem has purportedy been solved in the past ten years. Except that it is all too possible that the problem has instead simply been made less likely, which might mean that some of the occasional "Marking clocksource 'tsc' as unstable" messages might be due to desynchronization. How would anyone know? Therefore apply CPU-to-CPU synchronization checking to newly unstable clocksource that are marked with the new CLOCK_SOURCE_VERIFY_PERCPU flag. Lists of desynchronized CPUs are printed, with the caveat that if it is the reporting CPU that is itself desynchronized, it will appear that all the other clocks are wrong. Just like in real life. Reported-by: Chris Mason <clm@fb.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Feng Tang <feng.tang@intel.com> Link: https://lore.kernel.org/r/20210527190124.440372-2-paulmck@kernel.org
| * | | | | clocksource: Retry clock read if long delays detectedPaul E. McKenney2021-06-221-6/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the clocksource watchdog marks a clock as unstable, this might be due to that clock being unstable or it might be due to delays that happen to occur between the reads of the two clocks. Yes, interrupts are disabled across those two reads, but there are no shortage of things that can delay interrupts-disabled regions of code ranging from SMI handlers to vCPU preemption. It would be good to have some indication as to why the clock was marked unstable. Therefore, re-read the watchdog clock on either side of the read from the clock under test. If the watchdog clock shows an excessive time delta between its pair of reads, the reads are retried. The maximum number of retries is specified by a new kernel boot parameter clocksource.max_cswd_read_retries, which defaults to three, that is, up to four reads, one initial and up to three retries. If more than one retry was required, a message is printed on the console (the occasional single retry is expected behavior, especially in guest OSes). If the maximum number of retries is exceeded, the clock under test will be marked unstable. However, the probability of this happening due to various sorts of delays is quite small. In addition, the reason (clock-read delays) for the unstable marking will be apparent. Reported-by: Chris Mason <clm@fb.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Feng Tang <feng.tang@intel.com> Link: https://lore.kernel.org/r/20210527190124.440372-1-paulmck@kernel.org
| * | | | | clockevents: Add missing parameter documentationBaokun Li2021-06-221-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add the missing documentation for the @cpu parameter of tick_cleanup_dead_cpu(). Signed-off-by: Baokun Li <libaokun1@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210608024305.2750999-1-libaokun1@huawei.com
| * | | | | timer_list: Print name of per-cpu wakeup deviceWill Deacon2021-05-313-1/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the introduction of per-cpu wakeup devices that can be used in preference to the broadcast timer, print the name of such devices when they are available. Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210524221818.15850-6-will@kernel.org
| * | | | | tick/broadcast: Program wakeup timer when entering idle if requiredWill Deacon2021-05-311-1/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When configuring the broadcast timer on entry to and exit from deep idle states, prefer a per-CPU wakeup timer if one exists. On entry to idle, stop the tick device and transfer the next event into the oneshot wakeup device, which will serve as the wakeup from idle. To avoid the overhead of additional hardware accesses on exit from idle, leave the timer armed and treat the inevitable interrupt as a (possibly spurious) tick event. Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210524221818.15850-5-will@kernel.org
| * | | | | tick/broadcast: Prefer per-cpu oneshot wakeup timers to broadcastWill Deacon2021-05-313-4/+61
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some SoCs have two per-cpu timer implementations where the timer with the higher rating stops in deep idle (i.e. suffers from CLOCK_EVT_FEAT_C3STOP) but is otherwise preferable to the timer with the lower rating. In such a design, selecting the higher rated devices relies on a global broadcast timer and IPIs to wake up from deep idle states. To avoid the reliance on a global broadcast timer and also to reduce the overhead associated with the IPI wakeups, extend tick_install_broadcast_device() to manage per-cpu wakeup timers separately from the broadcast device. For now, these timers remain unused. Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210524221818.15850-4-will@kernel.org
| * | | | | tick/broadcast: Split __tick_broadcast_oneshot_control() into a helperWill Deacon2021-05-311-13/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In preparation for adding support for per-cpu wakeup timers, split _tick_broadcast_oneshot_control() into a helper function which deals only with the broadcast timer management across idle transitions. Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210524221818.15850-3-will@kernel.org
| * | | | | tick/broadcast: Drop unneeded CONFIG_GENERIC_CLOCKEVENTS_BROADCAST guardWill Deacon2021-05-311-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | tick-broadcast.o is only built if CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y so remove the redundant #ifdef guards around the definition of tick_receive_broadcast(). Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210524221818.15850-2-will@kernel.org
| * | | | | clockevents: Use DEVICE_ATTR_[RO|WO] macrosYueHaibing2021-05-311-8/+8
| | |/ / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use the DEVICE_ATTR_[RO|WO] helpers instead of plain DEVICE_ATTR, which makes the code a bit shorter and easier to read. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210523065825.19684-1-yuehaibing@huawei.com
* | | | | Merge tag 'irq-core-2021-06-29' of ↵Linus Torvalds2021-06-297-71/+148
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq updates from Thomas Gleixner: "Updates for the interrupt subsystem: Core changes: - Cleanup and simplification of common code to invoke the low level interrupt flow handlers when this invocation requires irqdomain resolution. Add the necessary core infrastructure. - Provide a proper interface for modular PMU drivers to set the interrupt affinity. - Add a request flag which allows to exclude interrupts from spurious interrupt detection. Useful especially for IPI handlers which always return IRQ_HANDLED which turns the spurious interrupt detection into a pointless waste of CPU cycles. Driver changes: - Bulk convert interrupt chip drivers to the new irqdomain low level flow handler invocation mechanism. - Add device tree bindings for the Renesas R-Car M3-W+ SoC - Enable modular build of the Qualcomm PDC driver - The usual small fixes and improvements" * tag 'irq-core-2021-06-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits) dt-bindings: interrupt-controller: arm,gic-v3: Describe GICv3 optional properties irqchip: gic-pm: Remove redundant error log of clock bulk irqchip/sun4i: Remove unnecessary oom message irqchip/irq-imx-gpcv2: Remove unnecessary oom message irqchip/imgpdc: Remove unnecessary oom message irqchip/gic-v3-its: Remove unnecessary oom message irqchip/gic-v2m: Remove unnecessary oom message irqchip/exynos-combiner: Remove unnecessary oom message irqchip: Bulk conversion to generic_handle_domain_irq() genirq: Move non-irqdomain handle_domain_irq() handling into ARM's handle_IRQ() genirq: Add generic_handle_domain_irq() helper irqchip/nvic: Convert from handle_IRQ() to handle_domain_irq() irqdesc: Fix __handle_domain_irq() comment genirq: Use irq_resolve_mapping() to implement __handle_domain_irq() and co irqdomain: Introduce irq_resolve_mapping() irqdomain: Protect the linear revmap with RCU irqdomain: Cache irq_data instead of a virq number in the revmap irqdomain: Use struct_size() helper when allocating irqdomain irqdomain: Make normal and nomap irqdomains exclusive powerpc: Move the use of irq_domain_add_nomap() behind a config option ...
| * \ \ \ \ Merge tag 'irqchip-5.14' of ↵Thomas Gleixner2021-06-2828-241/+472
| |\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/core Pull irqchip updates from Marc Zyngier: - Revamped the irqdomain internals to consistently cache irqdata - Expose a new API to simplify IRQ handling involving an irqdomain by not using the IRQ number - Convert all the irqchip drivers to this new API - Allow the Qualcomm PDC driver to be compiled as a module - Fix HiSi MBIGEN compile warning when CONFIG_ACPI isn't selected - Remove a bunch of spurious printks on error paths - The obligatory couple of DT updates
| | * | | | | genirq: Move non-irqdomain handle_domain_irq() handling into ARM's handle_IRQ()Marc Zyngier2021-06-101-22/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Despite the name, handle_domain_irq() deals with non-irqdomain handling for the sake of a handful of legacy ARM platforms. Move such handling into ARM's handle_IRQ(), allowing for better code generation for everyone else. This allows us get rid of some complexity, and to rearrange the guards on the various helpers in a more logical way. Signed-off-by: Marc Zyngier <maz@kernel.org>
| | * | | | | genirq: Add generic_handle_domain_irq() helperMarc Zyngier2021-06-101-1/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Provide generic_handle_domain_irq() as a pendent to handle_domain_irq() for non-root interrupt controllers Signed-off-by: Marc Zyngier <maz@kernel.org>
| | * | | | | genirq: Use irq_resolve_mapping() to implement __handle_domain_irq() and coMarc Zyngier2021-06-101-25/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In order to start reaping the benefits of irq_resolve_mapping(), start using it in __handle_domain_irq() and handle_domain_nmi(). This involves splitting generic_handle_irq() to be able to directly provide the irq_desc. Signed-off-by: Marc Zyngier <maz@kernel.org>
| | * | | | | irqdomain: Introduce irq_resolve_mapping()Marc Zyngier2021-06-101-8/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rework irq_find_mapping() to return an both an irq_desc pointer, optionally the virtual irq number, and rename the result to __irq_resolve_mapping(). a new helper called irq_resolve_mapping() is provided for code that doesn't need the virtual irq number. irq_find_mapping() is also rewritten in terms of __irq_resolve_mapping(). Signed-off-by: Marc Zyngier <maz@kernel.org>
| | * | | | | irqdomain: Protect the linear revmap with RCUMarc Zyngier2021-06-101-26/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is pretty odd that the radix tree uses RCU while the linear portion doesn't, leading to potential surprises for the users, depending on how the irqdomain has been created. Fix this by moving the update of the linear revmap under the mutex, and the lookup under the RCU read-side lock. The mutex name is updated to reflect that it doesn't only cover the radix-tree anymore. Signed-off-by: Marc Zyngier <maz@kernel.org>
| | * | | | | irqdomain: Cache irq_data instead of a virq number in the revmapMarc Zyngier2021-06-101-5/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Caching a virq number in the revmap is pretty inefficient, as it means we will need to convert it back to either an irq_data or irq_desc to do anything with it. It is also a bit odd, as the radix tree does cache irq_data pointers. Change the revmap type to be an irq_data pointer instead of an unsigned int, and preserve the current API for now. Signed-off-by: Marc Zyngier <maz@kernel.org>
| | * | | | | irqdomain: Use struct_size() helper when allocating irqdomainMarc Zyngier2021-06-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of open-coding the size computation of struct irqdomain, use the struct_size() helper instead. This is going to be handy as we change the type of the revmap array. Signed-off-by: Marc Zyngier <maz@kernel.org>
| | * | | | | irqdomain: Make normal and nomap irqdomains exclusiveMarc Zyngier2021-06-101-10/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Direct mappings are completely exclusive of normal mappings, meaning that we can refactor the code slightly so that we can get rid of the revmap_direct_max_irq field and use the revmap_size field instead, reducing the size of the irqdomain structure. Signed-off-by: Marc Zyngier <maz@kernel.org>
| | * | | | | powerpc: Move the use of irq_domain_add_nomap() behind a config optionMarc Zyngier2021-06-102-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Only a handful of old PPC systems are still using the old 'nomap' variant of the irqdomain library. Move the associated definitions behind a configuration option, which will allow us to make some more radical changes. Signed-off-by: Marc Zyngier <maz@kernel.org>
| | * | | | | irqdomain: Reimplement irq_linear_revmap() with irq_find_mapping()Marc Zyngier2021-06-101-3/+3
| | |/ / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | irq_linear_revmap() is supposed to be a fast path for domain lookups, but it only exposes low-level details of the irqdomain implementation, details which are better kept private. The *overhead* between the two is only a function call and a couple of tests, so it is likely that noone can show any meaningful difference compared to the cost of taking an interrupt. Reimplement irq_linear_revmap() with irq_find_mapping() in order to preserve source code compatibility, and rename the internal field for a measure. Signed-off-by: Marc Zyngier <maz@kernel.org>