summaryrefslogtreecommitdiffstats
path: root/kernel
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'x86-mm-for-linus' of ↵Linus Torvalds2017-05-011-13/+9
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 mm updates from Ingo Molnar: "The main x86 MM changes in this cycle were: - continued native kernel PCID support preparation patches to the TLB flushing code (Andy Lutomirski) - various fixes related to 32-bit compat syscall returning address over 4Gb in applications, launched from 64-bit binaries - motivated by C/R frameworks such as Virtuozzo. (Dmitry Safonov) - continued Intel 5-level paging enablement: in particular the conversion of x86 GUP to the generic GUP code. (Kirill A. Shutemov) - x86/mpx ABI corner case fixes/enhancements (Joerg Roedel) - ... plus misc updates, fixes and cleanups" * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (62 commits) mm, zone_device: Replace {get, put}_zone_device_page() with a single reference to fix pmem crash x86/mm: Fix flush_tlb_page() on Xen x86/mm: Make flush_tlb_mm_range() more predictable x86/mm: Remove flush_tlb() and flush_tlb_current_task() x86/vm86/32: Switch to flush_tlb_mm_range() in mark_screen_rdonly() x86/mm/64: Fix crash in remove_pagetable() Revert "x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation" x86/boot/e820: Remove a redundant self assignment x86/mm: Fix dump pagetables for 4 levels of page tables x86/mpx, selftests: Only check bounds-vs-shadow when we keep shadow x86/mpx: Correctly report do_mpx_bt_fault() failures to user-space Revert "x86/mm/numa: Remove numa_nodemask_from_meminfo()" x86/espfix: Add support for 5-level paging x86/kasan: Extend KASAN to support 5-level paging x86/mm: Add basic defines/helpers for CONFIG_X86_5LEVEL=y x86/paravirt: Add 5-level support to the paravirt code x86/mm: Define virtual memory map for 5-level paging x86/asm: Remove __VIRTUAL_MASK_SHIFT==47 assert x86/boot: Detect 5-level paging support x86/mm/numa: Remove numa_nodemask_from_meminfo() ...
| * mm, zone_device: Replace {get, put}_zone_device_page() with a single ↵Dan Williams2017-05-011-13/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | reference to fix pmem crash The x86 conversion to the generic GUP code included a small change which causes crashes and data corruption in the pmem code - not good. The root cause is that the /dev/pmem driver code implicitly relies on the x86 get_user_pages() implementation doing a get_page() on the page refcount, because get_page() does a get_zone_device_page() which properly refcounts pmem's separate page struct arrays that are not present in the regular page struct structures. (The pmem driver does this because it can cover huge memory areas.) But the x86 conversion to the generic GUP code changed the get_page() to page_cache_get_speculative() which is faster but doesn't do the get_zone_device_page() call the pmem code relies on. One way to solve the regression would be to change the generic GUP code to use get_page(), but that would slow things down a bit and punish other generic-GUP using architectures for an x86-ism they did not care about. (Arguably the pmem driver was probably not working reliably for them: but nvdimm is an Intel feature, so non-x86 exposure is probably still limited.) So restructure the pmem code's interface with the MM instead: get rid of the get/put_zone_device_page() distinction, integrate put_zone_device_page() into __put_page() and and restructure the pmem completion-wait and teardown machinery: Kirill points out that the calls to {get,put}_dev_pagemap() can be removed from the mm fast path if we take a single get_dev_pagemap() reference to signify that the page is alive and use the final put of the page to drop that reference. This does require some care to make sure that any waits for the percpu_ref to drop to zero occur *after* devm_memremap_page_release(), since it now maintains its own elevated reference. This speeds up things while also making the pmem refcounting more robust going forward. Suggested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com> Tested-by: Kirill Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/149339998297.24933.1129582806028305912.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | Merge branch 'x86-boot-for-linus' of ↵Linus Torvalds2017-05-011-52/+0
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 boot updates from Ingo Molnar: "The biggest changes in this cycle were: - reworking of the e820 code: separate in-kernel and boot-ABI data structures and apply a whole range of cleanups to the kernel side. No change in functionality. - enable KASLR by default: it's used by all major distros and it's out of the experimental stage as well. - ... misc fixes and cleanups" * 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (63 commits) x86/KASLR: Fix kexec kernel boot crash when KASLR randomization fails x86/reboot: Turn off KVM when halting a CPU x86/boot: Fix BSS corruption/overwrite bug in early x86 kernel startup x86: Enable KASLR by default boot/param: Move next_arg() function to lib/cmdline.c for later reuse x86/boot: Fix Sparse warning by including required header file x86/boot/64: Rename start_cpu() x86/xen: Update e820 table handling to the new core x86 E820 code x86/boot: Fix pr_debug() API braindamage xen, x86/headers: Add <linux/device.h> dependency to <asm/xen/page.h> x86/boot/e820: Simplify e820__update_table() x86/boot/e820: Separate the E820 ABI structures from the in-kernel structures x86/boot/e820: Fix and clean up e820_type switch() statements x86/boot/e820: Rename the remaining E820 APIs to the e820__*() prefix x86/boot/e820: Remove unnecessary #include's x86/boot/e820: Rename e820_mark_nosave_regions() to e820__register_nosave_regions() x86/boot/e820: Rename e820_reserve_resources*() to e820__reserve_resources*() x86/boot/e820: Use bool in query APIs x86/boot/e820: Document e820__reserve_setup_data() x86/boot/e820: Clean up __e820__update_table() et al ...
| * | boot/param: Move next_arg() function to lib/cmdline.c for later reuseBaoquan He2017-04-181-52/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | next_arg() will be used to parse boot parameters in the x86/boot/compressed code, so move it to lib/cmdline.c for better code reuse. No change in functionality. Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Gustavo Padovan <gustavo.padovan@collabora.co.uk> Cc: Jens Axboe <axboe@fb.com> Cc: Jessica Yu <jeyu@redhat.com> Cc: Johannes Berg <johannes.berg@intel.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Larry Finger <Larry.Finger@lwfinger.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dan.j.williams@intel.com Cc: dave.jiang@intel.com Cc: dyoung@redhat.com Cc: keescook@chromium.org Cc: zijun_hu <zijun_hu@htc.com> Link: http://lkml.kernel.org/r/1492436099-4017-2-git-send-email-bhe@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | | Merge branch 'perf-core-for-linus' of ↵Linus Torvalds2017-05-018-28/+208
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf updates from Ingo Molnar: "The main changes in this cycle were: Kernel side changes: - Kprobes and uprobes changes: - Make their trampolines read-only while they are used - Make UPROBES_EVENTS default-y which is the distro practice - Apply misc fixes and robustization to probe point insertion. - add support for AMD IOMMU events - extend hw events on Intel Goldmont CPUs - ... plus misc fixes and updates. Tooling side changes: - support s390 jump instructions in perf annotate (Christian Borntraeger) - vendor hardware events updates (Andi Kleen) - add argument support for SDT events in powerpc (Ravi Bangoria) - beautify the statx syscall arguments in 'perf trace' (Arnaldo Carvalho de Melo) - handle inline functions in callchains (Jin Yao) - enable sorting by srcline as key (Milian Wolff) - add 'brstackinsn' field in 'perf script' to reuse the x86 instruction decoder used in the Intel PT code to study hot paths to samples (Andi Kleen) - add PERF_RECORD_NAMESPACES so that the kernel can record information required to associate samples to namespaces, helping in container problem characterization. (Hari Bathini) - allow sorting by symbol_size in 'perf report' and 'perf top' (Charles Baylis) - in perf stat, make system wide (-a) the default option if no target was specified and one of following conditions is met: - no workload specified (current behaviour) - a workload is specified but all requested events are system wide ones, like uncore ones. (Jiri Olsa) - ... plus lots of other updates, enhancements, cleanups and fixes" * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (235 commits) perf tools: Fix the code to strip command name tools arch x86: Sync cpufeatures.h tools arch: Sync arch/x86/lib/memcpy_64.S with the kernel tools: Update asm-generic/mman-common.h copy from the kernel perf tools: Use just forward declarations for struct thread where possible perf tools: Add the right header to obtain PERF_ALIGN() perf tools: Remove poll.h and wait.h from util.h perf tools: Remove string.h, unistd.h and sys/stat.h from util.h perf tools: Remove stale prototypes from builtin.h perf tools: Remove string.h from util.h perf tools: Remove sys/ioctl.h from util.h perf tools: Remove a few more needless includes from util.h perf tools: Include sys/param.h where needed perf callchain: Move callchain specific routines from util.[ch] perf tools: Add compress.h for the *_decompress_to_file() headers perf mem: Fix display of data source snoop indication perf debug: Move dump_stack() and sighandler_dump_stack() to debug.h perf kvm: Make function only used by 'perf kvm' static perf tools: Move timestamp routines from util.h to time-utils.h perf tools: Move units conversion/formatting routines to separate object ...
| * \ \ Merge tag 'v4.11-rc6' into perf/core, to pick up fixesIngo Molnar2017-04-118-70/+103
| |\ \ \ | | | | | | | | | | | | | | | Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * \ \ \ Merge branch 'linus' into perf/core, to pick up fixesIngo Molnar2017-03-3011-387/+584
| |\ \ \ \ | | | | | | | | | | | | | | | | | | Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * \ \ \ \ Merge branch 'perf/urgent' into perf/core, to pick up fixesIngo Molnar2017-03-281-16/+48
| |\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * \ \ \ \ \ Merge tag 'perf-core-for-mingo-4.12-20170316' of ↵Ingo Molnar2017-03-163-16/+28
| |\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo: New features: - Add 'brstackinsn' field in 'perf script' to reuse the x86 instruction decoder used in the Intel PT code to study hot paths to samples (Andi Kleen) Kernel changes: - Default UPROBES_EVENTS to Y (Alexei Starovoitov) - Fix check for kretprobe offset within function entry (Naveen N. Rao) Infrastructure changes: - Introduce util func is_sdt_event() (Ravi Bangoria) - Make perf_event__synthesize_mmap_events() scale on older kernels where reading /proc/pid/maps is way slower than reading /proc/pid/task/pid/maps (Stephane Eranian) Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>
| | * | | | | | uprobes: Default UPROBES_EVENTS to YArnaldo Carvalho de Melo2017-03-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As it is already turned on by most distros, so just flip the default to Y. Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Acked-by: David Ahern <dsahern@gmail.com> Acked-by: Ingo Molnar <mingo@kernel.org> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Wang Nan <wangnan0@huawei.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Anton Blanchard <anton@ozlabs.org> Cc: David Miller <davem@davemloft.net> Cc: Hemant Kumar <hemant@linux.vnet.ibm.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20170316005817.GA6805@ast-mbp.thefacebook.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
| | * | | | | | trace/kprobes: Fix check for kretprobe offset within function entryNaveen N. Rao2017-03-152-15/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | perf specifies an offset from _text and since this offset is fed directly into the arch-specific helper, kprobes tracer rejects installation of kretprobes through perf. Fix this by looking up the actual offset from a function for the specified sym+offset. Refactor and reuse existing routines to limit code duplication -- we repurpose kprobe_addr() for determining final kprobe address and we split out the function entry offset determination into a separate generic helper. Before patch: naveen@ubuntu:~/linux/tools/perf$ sudo ./perf probe -v do_open%return probe-definition(0): do_open%return symbol:do_open file:(null) line:0 offset:0 return:1 lazy:(null) 0 arguments Looking at the vmlinux_path (8 entries long) Using /boot/vmlinux for symbols Open Debuginfo file: /boot/vmlinux Try to find probe point from debuginfo. Matched function: do_open [2d0c7ff] Probe point found: do_open+0 Matched function: do_open [35d76dc] found inline addr: 0xc0000000004ba9c4 Failed to find "do_open%return", because do_open is an inlined function and has no return point. An error occurred in debuginfo analysis (-22). Trying to use symbols. Opening /sys/kernel/debug/tracing//README write=0 Opening /sys/kernel/debug/tracing//kprobe_events write=1 Writing event: r:probe/do_open _text+4469776 Failed to write event: Invalid argument Error: Failed to add events. Reason: Invalid argument (Code: -22) naveen@ubuntu:~/linux/tools/perf$ dmesg | tail <snip> [ 33.568656] Given offset is not valid for return probe. After patch: naveen@ubuntu:~/linux/tools/perf$ sudo ./perf probe -v do_open%return probe-definition(0): do_open%return symbol:do_open file:(null) line:0 offset:0 return:1 lazy:(null) 0 arguments Looking at the vmlinux_path (8 entries long) Using /boot/vmlinux for symbols Open Debuginfo file: /boot/vmlinux Try to find probe point from debuginfo. Matched function: do_open [2d0c7d6] Probe point found: do_open+0 Matched function: do_open [35d76b3] found inline addr: 0xc0000000004ba9e4 Failed to find "do_open%return", because do_open is an inlined function and has no return point. An error occurred in debuginfo analysis (-22). Trying to use symbols. Opening /sys/kernel/debug/tracing//README write=0 Opening /sys/kernel/debug/tracing//kprobe_events write=1 Writing event: r:probe/do_open _text+4469808 Writing event: r:probe/do_open_1 _text+4956344 Added new events: probe:do_open (on do_open%return) probe:do_open_1 (on do_open%return) You can now use it in all perf tools, such as: perf record -e probe:do_open_1 -aR sleep 1 naveen@ubuntu:~/linux/tools/perf$ sudo cat /sys/kernel/debug/kprobes/list c000000000041370 k kretprobe_trampoline+0x0 [OPTIMIZED] c0000000004ba0b8 r do_open+0x8 [DISABLED] c000000000443430 r do_open+0x0 [DISABLED] Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: linuxppc-dev@lists.ozlabs.org Link: http://lkml.kernel.org/r/d8cd1ef420ec22e3643ac332fdabcffc77319a42.1488961018.git.naveen.n.rao@linux.vnet.ibm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
| * | | | | | | perf/core: Add a flag for partial AUX recordsAlexander Shishkin2017-03-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Intel PT driver needs to be able to communicate partial AUX transactions, that is, transactions with gaps in data for reasons other than no room left in the buffer (i.e. truncated transactions). Therefore, this condition does not imply a wakeup for the consumer. To this end, add a new "partial" AUX flag. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: vince@deater.net Link: http://lkml.kernel.org/r/20170220133352.17995-4-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | | | perf/core: Keep AUX flags in the output handleWill Deacon2017-03-161-11/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In preparation for adding more flags to perf AUX records, introduce a separate API for setting the flags for a session, rather than appending more bool arguments to perf_aux_output_end. This allows to set each flag at the time a corresponding condition is detected, instead of tracking it in each driver's private state. Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: vince@deater.net Link: http://lkml.kernel.org/r/20170220133352.17995-3-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | | | Merge branch 'linus' into perf/core, to pick up fixesIngo Molnar2017-03-1621-86/+201
| |\ \ \ \ \ \ \ | | |/ / / / / / | |/| | | | | | | | | | | | | | Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | | | kprobes: Convert kprobe_exceptions_notify to use NOKPROBE_SYMBOLNaveen N. Rao2017-03-141-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit fc62d0207ae0 ("kprobes: Introduce weak variant of kprobe_exceptions_notify()") used the __kprobes annotation to exclude kprobe_exceptions_notify from being probed. Since NOKPROBE_SYMBOL() is a better way to do this enabling the symbol to be discovered as being blacklisted, change over to using NOKPROBE_SYMBOL(). Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/3f25bf400da5c222cd9b10eec6ded2d6b58209f8.1488991670.git.naveen.n.rao@linux.vnet.ibm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
| * | | | | | | perf: Add PERF_RECORD_NAMESPACES to include namespaces related infoHari Bathini2017-03-133-0/+144
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the advert of container technologies like docker, that depend on namespaces for isolation, there is a need for tracing support for namespaces. This patch introduces new PERF_RECORD_NAMESPACES event for recording namespaces related info. By recording info for every namespace, it is left to userspace to take a call on the definition of a container and trace containers by updating perf tool accordingly. Each namespace has a combination of device and inode numbers. Though every namespace has the same device number currently, that may change in future to avoid the need for a namespace of namespaces. Considering such possibility, record both device and inode numbers separately for each namespace. Signed-off-by: Hari Bathini <hbathini@linux.vnet.ibm.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexei Starovoitov <ast@fb.com> Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> Cc: Aravinda Prasad <aravinda@linux.vnet.ibm.com> Cc: Brendan Gregg <brendan.d.gregg@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Sargun Dhillon <sargun@sargun.me> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/148891929686.25309.2827618988917007768.stgit@hbathini.in.ibm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
| * | | | | | | Merge tag 'perf-core-for-mingo-4.11-20170306' of ↵Ingo Molnar2017-03-076-18/+29
| |\ \ \ \ \ \ \ | | |_|_|_|_|/ / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo: New features: - Allow sorting by symbol_size in 'perf report' and 'perf top' (Charles Baylis) E.g.: # perf report -s symbol_size,symbol Samples: 9K of event 'cycles:k', Event count (approx.): 2870461623 Overhead Symbol size Symbol 14.55% 326 [k] flush_tlb_mm_range 7.20% 1045 [k] filemap_map_pages 5.82% 124 [k] vma_interval_tree_insert 5.18% 2430 [k] unmap_page_range 2.57% 571 [k] vma_interval_tree_remove 1.94% 494 [k] page_add_file_rmap 1.82% 740 [k] page_remove_rmap 1.66% 1017 [k] release_pages 1.57% 1636 [k] update_blocked_averages 1.57% 76 [k] unlock_page - Add support for -p/--pid, -a/--all-cpus and -C/--cpu in 'perf ftrace' (Namhyung Kim) Change in behaviour: - Make system wide (-a) the default option if no target was specified and one of following conditions is met: - No workload specified (current behaviour) - A workload is specified but all requested events are system wide ones, like uncore ones. (Jiri Olsa) Fixes: - Add missing initialization to the instruction decoder used in the intel PT/BTS code, which was causing lots of failures in 'perf test', looking for a value when there was none (Adrian Hunter) Infrastructure changes: - Add arch code needed to adopt the kernel's refcount_t to aid in catching bugs when using atomic_t as a reference counter, basically cmpxchg related functions (Arnaldo Carvalho de Melo) - Convert the code using atomic_t as reference counts to refcount_t (Elena Rashetova) - Add feature test for sched_getcpu() to more easily check for its presence in the many libc implementations and accross different versions of such C libraries (Arnaldo Carvalho de Melo) - Issue a HW watchdog disable hint in 'perf stat' for when some of the requested events can't get counted because a PMU counter is taken by that watchdog (Borislav Petkov). - Add mapping for Intel's KnightsMill PMU events (Karol Wachowski) Documentation changes: - Clarify the term 'convergence' in: perf bench numa numa-mem -h --show_convergence (Jiri Olsa) Kernel code changes: - Ensure probe location is at function entry in kretprobes (Naveen N. Rao) - Allow return probes with offsets and absolute addresses (Naveen N. Rao) Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>
| | * | | | | | trace/kprobes: Add back warning about offset in return probesSteven Rostedt (VMware)2017-03-031-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Let's not remove the warning about offsets and return probes when the offset is invalid. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: linuxppc-dev@lists.ozlabs.org Link: http://lkml.kernel.org/r/20170227115204.00f92846@gandalf.local.home Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
| | * | | | | | trace/kprobes: Allow return probes with offsets and absolute addressesNaveen N. Rao2017-03-032-8/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since the kernel includes many non-global functions with same names, we will need to use offsets from other symbols (typically _text/_stext) or absolute addresses to place return probes on specific functions. Also, the core register_kretprobe() API never forbid use of offsets or absolute addresses with kretprobes. Allow its use with the trace infrastructure. To distinguish kernels that support this, update ftrace README to explicitly call this out. Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: linuxppc-dev@lists.ozlabs.org Link: http://lkml.kernel.org/r/183e7ce2921a08c9c755ee9a5da3134febc6695b.1487770934.git.naveen.n.rao@linux.vnet.ibm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
| | * | | | | | kretprobes: Ensure probe location is at function entryNaveen N. Rao2017-03-031-0/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kretprobes can be registered by specifying an absolute address or by specifying offset to a symbol. However, we need to ensure this falls at function entry so as to be able to determine the return address. Validate the same during kretprobe registration. By default, there should not be any offset from a function entry, as determined through a kallsyms_lookup(). Introduce arch_function_offset_within_entry() as a way for architectures to override this. Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: linuxppc-dev@lists.ozlabs.org Link: http://lkml.kernel.org/r/f1583bc4839a3862cfc2acefcc56f9c8837fa2ba.1487770934.git.naveen.n.rao@linux.vnet.ibm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
* | | | | | | | Merge branch 'locking-core-for-linus' of ↵Linus Torvalds2017-05-0113-485/+852
|\ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: "The main changes in this cycle were: - a big round of FUTEX_UNLOCK_PI improvements, fixes, cleanups and general restructuring - lockdep updates such as new checks for lock_downgrade() - introduce the new atomic_try_cmpxchg() locking API and use it to optimize refcount code generation - ... plus misc fixes, updates and cleanups" * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits) MAINTAINERS: Add FUTEX SUBSYSTEM futex: Clarify mark_wake_futex memory barrier usage futex: Fix small (and harmless looking) inconsistencies futex: Avoid freeing an active timer rtmutex: Plug preempt count leak in rt_mutex_futex_unlock() rtmutex: Fix more prio comparisons rtmutex: Fix PI chain order integrity sched,tracing: Update trace_sched_pi_setprio() sched/rtmutex: Refactor rt_mutex_setprio() rtmutex: Clean up sched/deadline/rtmutex: Dont miss the dl_runtime/dl_period update sched/rtmutex/deadline: Fix a PI crash for deadline tasks rtmutex: Deboost before waking up the top waiter locking/ww-mutex: Limit stress test to 2 seconds locking/atomic: Fix atomic_try_cmpxchg() semantics lockdep: Fix per-cpu static objects futex: Drop hb->lock before enqueueing on the rtmutex futex: Futex_unlock_pi() determinism futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock() futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock() ...
| * | | | | | | | futex: Clarify mark_wake_futex memory barrier usageDarren Hart (VMware)2017-04-151-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Clarify the scenario described in mark_wake_futex requiring the smp_store_release(). Update the comment to explicitly refer to the plist_del now under __unqueue_futex() (previously plist_del was in the same function as the comment). Signed-off-by: Darren Hart (VMware) <dvhart@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20170414223138.GA4222@fury Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | | | | | futex: Fix small (and harmless looking) inconsistenciesPeter Zijlstra2017-04-141-4/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During (post-commit) review Darren spotted a few minor things. One (harmless AFAICT) type inconsistency and a comment that wasn't as clear as hoped. Reported-by: Darren Hart (VMWare) <dvhart@infradead.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Darren Hart (VMware) <dvhart@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | | | | futex: Avoid freeing an active timerThomas Gleixner2017-04-141-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Alexander reported a hrtimer debug_object splat: ODEBUG: free active (active state 0) object type: hrtimer hint: hrtimer_wakeup (kernel/time/hrtimer.c:1423) debug_object_free (lib/debugobjects.c:603) destroy_hrtimer_on_stack (kernel/time/hrtimer.c:427) futex_lock_pi (kernel/futex.c:2740) do_futex (kernel/futex.c:3399) SyS_futex (kernel/futex.c:3447 kernel/futex.c:3415) do_syscall_64 (arch/x86/entry/common.c:284) entry_SYSCALL64_slow_path (arch/x86/entry/entry_64.S:249) Which was caused by commit: cfafcd117da0 ("futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()") ... losing the hrtimer_cancel() in the shuffle. Where previously the hrtimer_cancel() was done by rt_mutex_slowlock() we now need to do it manually. Reported-by: Alexander Levin <alexander.levin@verizon.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Fixes: cfafcd117da0 ("futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()") Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1704101802370.2906@nanos Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | | | | Merge branch 'linus' into locking/core, to pick up fixesIngo Molnar2017-04-1417-441/+639
| |\ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | | | | | rtmutex: Plug preempt count leak in rt_mutex_futex_unlock()Mike Galbraith2017-04-051-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mark_wakeup_next_waiter() already disables preemption, doing so again leaves us with an unpaired preempt_disable(). Fixes: 2a1c60299406 ("rtmutex: Deboost before waking up the top waiter") Signed-off-by: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: xlpang@redhat.com Cc: rostedt@goodmis.org Link: http://lkml.kernel.org/r/1491379707.6538.2.camel@gmx.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | | | | | | rtmutex: Fix more prio comparisonsPeter Zijlstra2017-04-041-3/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There was a pure ->prio comparison left in try_to_wake_rt_mutex(), convert it to use rt_mutex_waiter_less(), noting that greater-or-equal is not-less (both in kernel priority view). This necessitated the introduction of cmp_task() which creates a pointer to an unnamed stack variable of struct rt_mutex_waiter type to compare against tasks. With this, we can now also create and employ rt_mutex_waiter_equal(). Reviewed-and-tested-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: xlpang@redhat.com Cc: rostedt@goodmis.org Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170323150216.455584638@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | | | | | | rtmutex: Fix PI chain order integrityPeter Zijlstra2017-04-042-2/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | rt_mutex_waiter::prio is a copy of task_struct::prio which is updated during the PI chain walk, such that the PI chain order isn't messed up by (asynchronous) task state updates. Currently rt_mutex_waiter_less() uses task state for deadline tasks; this is broken, since the task state can, as said above, change asynchronously, causing the RB tree order to change without actual tree update -> FAIL. Fix this by also copying the deadline into the rt_mutex_waiter state and updating it along with its prio field. Ideally we would also force PI chain updates whenever DL tasks update their deadline parameter, but for first approximation this is less broken than it was. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: xlpang@redhat.com Cc: rostedt@goodmis.org Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170323150216.403992539@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | | | | | | sched,tracing: Update trace_sched_pi_setprio()Peter Zijlstra2017-04-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pass the PI donor task, instead of a numerical priority. Numerical priorities are not sufficient to describe state ever since SCHED_DEADLINE. Annotate all sched tracepoints that are currently broken; fixing them will bork userspace. *hate*. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: xlpang@redhat.com Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170323150216.353599881@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | | | | | | sched/rtmutex: Refactor rt_mutex_setprio()Peter Zijlstra2017-04-042-95/+83
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the introduction of SCHED_DEADLINE the whole notion that priority is a single number is gone, therefore the @prio argument to rt_mutex_setprio() doesn't make sense anymore. So rework the code to pass a pi_task instead. Note this also fixes a problem with pi_top_task caching; previously we would not set the pointer (call rt_mutex_update_top_task) if the priority didn't change, this could lead to a stale pointer. As for the XXX, I think its fine to use pi_task->prio, because if it differs from waiter->prio, a PI chain update is immenent. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: xlpang@redhat.com Cc: rostedt@goodmis.org Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170323150216.303827095@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | | | | | | rtmutex: Clean upPeter Zijlstra2017-04-043-19/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previous patches changed the meaning of the return value of rt_mutex_slowunlock(); update comments and code to reflect this. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: xlpang@redhat.com Cc: rostedt@goodmis.org Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170323150216.255058238@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | | | | | | sched/deadline/rtmutex: Dont miss the dl_runtime/dl_period updateXunlei Pang2017-04-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently dl tasks will actually return at the very beginning of rt_mutex_adjust_prio_chain() in !detect_deadlock cases: if (waiter->prio == task->prio) { if (!detect_deadlock) goto out_unlock_pi; // out here else requeue = false; } As the deadline value of blocked deadline tasks(waiters) without changing their sched_class(thus prio doesn't change) never changes, this seems reasonable, but it actually misses the chance of updating rt_mutex_waiter's "dl_runtime(period)_copy" if a waiter updates its deadline parameters(dl_runtime, dl_period) or boosted waiter changes to !deadline class. Thus, force deadline task not out by adding the !dl_prio() condition. Signed-off-by: Xunlei Pang <xlpang@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/1460633827-345-7-git-send-email-xlpang@redhat.com Link: http://lkml.kernel.org/r/20170323150216.206577901@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | | | | | | sched/rtmutex/deadline: Fix a PI crash for deadline tasksXunlei Pang2017-04-043-8/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A crash happened while I was playing with deadline PI rtmutex. BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 IP: [<ffffffff810eeb8f>] rt_mutex_get_top_task+0x1f/0x30 PGD 232a75067 PUD 230947067 PMD 0 Oops: 0000 [#1] SMP CPU: 1 PID: 10994 Comm: a.out Not tainted Call Trace: [<ffffffff810b658c>] enqueue_task+0x2c/0x80 [<ffffffff810ba763>] activate_task+0x23/0x30 [<ffffffff810d0ab5>] pull_dl_task+0x1d5/0x260 [<ffffffff810d0be6>] pre_schedule_dl+0x16/0x20 [<ffffffff8164e783>] __schedule+0xd3/0x900 [<ffffffff8164efd9>] schedule+0x29/0x70 [<ffffffff8165035b>] __rt_mutex_slowlock+0x4b/0xc0 [<ffffffff81650501>] rt_mutex_slowlock+0xd1/0x190 [<ffffffff810eeb33>] rt_mutex_timed_lock+0x53/0x60 [<ffffffff810ecbfc>] futex_lock_pi.isra.18+0x28c/0x390 [<ffffffff810ed8b0>] do_futex+0x190/0x5b0 [<ffffffff810edd50>] SyS_futex+0x80/0x180 This is because rt_mutex_enqueue_pi() and rt_mutex_dequeue_pi() are only protected by pi_lock when operating pi waiters, while rt_mutex_get_top_task(), will access them with rq lock held but not holding pi_lock. In order to tackle it, we introduce new "pi_top_task" pointer cached in task_struct, and add new rt_mutex_update_top_task() to update its value, it can be called by rt_mutex_setprio() which held both owner's pi_lock and rq lock. Thus "pi_top_task" can be safely accessed by enqueue_task_dl() under rq lock. Originally-From: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Xunlei Pang <xlpang@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170323150216.157682758@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | | | | | | rtmutex: Deboost before waking up the top waiterXunlei Pang2017-04-043-32/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We should deboost before waking the high-priority task, such that we don't run two tasks with the same "state" (priority, deadline, sched_class, etc). In order to make sure the boosting task doesn't start running between unlock and deboost (due to 'spurious' wakeup), we move the deboost under the wait_lock, that way its serialized against the wait loop in __rt_mutex_slowlock(). Doing the deboost early can however lead to priority-inversion if current would get preempted after the deboost but before waking our high-prio task, hence we disable preemption before doing deboost, and enabling it after the wake up is over. This gets us the right semantic order, but most importantly however; this change ensures pointer stability for the next patch, where we have rt_mutex_setprio() cache a pointer to the top-most waiter task. If we, as before this change, do the wakeup first and then deboost, this pointer might point into thin air. [peterz: Changelog + patch munging] Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Xunlei Pang <xlpang@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170323150216.110065320@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | | | | | | Merge branch 'sched/core' into locking/coreThomas Gleixner2017-04-0412-304/+648
| |\ \ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Required for the rtmutex/sched_deadline patches which depend on both branches
| * | | | | | | | | | locking/ww-mutex: Limit stress test to 2 secondsChris Wilson2017-03-301-9/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use a timeout rather than a fixed number of loops to avoid running for very long periods, such as under the kbuilder VMs. Reported-by: kernel test robot <xiaolong.ye@intel.com> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170310105733.6444-1-chris@chris-wilson.co.uk Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | | | | | | | lockdep: Fix per-cpu static objectsPeter Zijlstra2017-03-262-1/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit 383776fa7527 ("locking/lockdep: Handle statically initialized PER_CPU locks properly") we try to collapse per-cpu locks into a single class by giving them all the same key. For this key we choose the canonical address of the per-cpu object, which would be the offset into the per-cpu area. This has two problems: - there is a case where we run !0 lock->key through static_obj() and expect this to pass; it doesn't for canonical pointers. - 0 is a valid canonical address. Cure both issues by redefining the canonical address as the address of the per-cpu variable on the boot CPU. Since I didn't want to rely on CPU0 being the boot-cpu, or even existing at all, track the boot CPU in a variable. Fixes: 383776fa7527 ("locking/lockdep: Handle statically initialized PER_CPU locks properly") Reported-by: kernel test robot <fengguang.wu@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Borislav Petkov <bp@suse.de> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: linux-mm@kvack.org Cc: wfg@linux.intel.com Cc: kernel test robot <fengguang.wu@intel.com> Cc: LKP <lkp@01.org> Link: http://lkml.kernel.org/r/20170320114108.kbvcsuepem45j5cr@hirez.programming.kicks-ass.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | | | | | | | futex: Drop hb->lock before enqueueing on the rtmutexPeter Zijlstra2017-03-233-30/+52
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When PREEMPT_RT_FULL does the spinlock -> rt_mutex substitution the PI chain code will (falsely) report a deadlock and BUG. The problem is that it hold hb->lock (now an rt_mutex) while doing task_blocks_on_rt_mutex on the futex's pi_state::rtmutex. This, when interleaved just right with futex_unlock_pi() leads it to believe to see an AB-BA deadlock. Task1 (holds rt_mutex, Task2 (does FUTEX_LOCK_PI) does FUTEX_UNLOCK_PI) lock hb->lock lock rt_mutex (as per start_proxy) lock hb->lock Which is a trivial AB-BA. It is not an actual deadlock, because it won't be holding hb->lock by the time it actually blocks on the rt_mutex, but the chainwalk code doesn't know that and it would be a nightmare to handle this gracefully. To avoid this problem, do the same as in futex_unlock_pi() and drop hb->lock after acquiring wait_lock. This still fully serializes against futex_unlock_pi(), since adding to the wait_list does the very same lock dance, and removing it holds both locks. Aside of solving the RT problem this makes the lock and unlock mechanism symetric and reduces the hb->lock held time. Reported-and-tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: juri.lelli@arm.com Cc: xlpang@redhat.com Cc: rostedt@goodmis.org Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: dvhart@infradead.org Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170322104152.161341537@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | | | | | | | futex: Futex_unlock_pi() determinismPeter Zijlstra2017-03-231-13/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The problem with returning -EAGAIN when the waiter state mismatches is that it becomes very hard to proof a bounded execution time on the operation. And seeing that this is a RT operation, this is somewhat important. While in practise; given the previous patch; it will be very unlikely to ever really take more than one or two rounds, proving so becomes rather hard. However, now that modifying wait_list is done while holding both hb->lock and wait_lock, the scenario can be avoided entirely by acquiring wait_lock while still holding hb-lock. Doing a hand-over, without leaving a hole. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: xlpang@redhat.com Cc: rostedt@goodmis.org Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: dvhart@infradead.org Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170322104152.112378812@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | | | | | | | futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()Peter Zijlstra2017-03-233-42/+62
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | By changing futex_lock_pi() to use rt_mutex_*_proxy_lock() all wait_list modifications are done under both hb->lock and wait_lock. This closes the obvious interleave pattern between futex_lock_pi() and futex_unlock_pi(), but not entirely so. See below: Before: futex_lock_pi() futex_unlock_pi() unlock hb->lock lock hb->lock unlock hb->lock lock rt_mutex->wait_lock unlock rt_mutex_wait_lock -EAGAIN lock rt_mutex->wait_lock list_add unlock rt_mutex->wait_lock schedule() lock rt_mutex->wait_lock list_del unlock rt_mutex->wait_lock <idem> -EAGAIN lock hb->lock After: futex_lock_pi() futex_unlock_pi() lock hb->lock lock rt_mutex->wait_lock list_add unlock rt_mutex->wait_lock unlock hb->lock schedule() lock hb->lock unlock hb->lock lock hb->lock lock rt_mutex->wait_lock list_del unlock rt_mutex->wait_lock lock rt_mutex->wait_lock unlock rt_mutex_wait_lock -EAGAIN unlock hb->lock It does however solve the earlier starvation/live-lock scenario which got introduced with the -EAGAIN since unlike the before scenario; where the -EAGAIN happens while futex_unlock_pi() doesn't hold any locks; in the after scenario it happens while futex_unlock_pi() actually holds a lock, and then it is serialized on that lock. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: xlpang@redhat.com Cc: rostedt@goodmis.org Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: dvhart@infradead.org Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170322104152.062785528@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | | | | | | | futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock()Peter Zijlstra2017-03-233-12/+55
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the ultimate goal of keeping rt_mutex wait_list and futex_q waiters consistent it's necessary to split 'rt_mutex_futex_lock()' into finer parts, such that only the actual blocking can be done without hb->lock held. Split split_mutex_finish_proxy_lock() into two parts, one that does the blocking and one that does remove_waiter() when the lock acquire failed. When the rtmutex was acquired successfully the waiter can be removed in the acquisiton path safely, since there is no concurrency on the lock owner. This means that, except for futex_lock_pi(), all wait_list modifications are done with both hb->lock and wait_lock held. [bigeasy@linutronix.de: fix for futex_requeue_pi_signal_restart] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: xlpang@redhat.com Cc: rostedt@goodmis.org Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: dvhart@infradead.org Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170322104152.001659630@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | | | | | | | futex,rt_mutex: Introduce rt_mutex_init_waiter()Peter Zijlstra2017-03-233-7/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since there's already two copies of this code, introduce a helper now before adding a third one. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: xlpang@redhat.com Cc: rostedt@goodmis.org Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: dvhart@infradead.org Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170322104151.950039479@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | | | | | | | futex: Pull rt_mutex_futex_unlock() out from under hb->lockPeter Zijlstra2017-03-231-54/+100
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There's a number of 'interesting' problems, all caused by holding hb->lock while doing the rt_mutex_unlock() equivalient. Notably: - a PI inversion on hb->lock; and, - a SCHED_DEADLINE crash because of pointer instability. The previous changes: - changed the locking rules to cover {uval,pi_state} with wait_lock. - allow to do rt_mutex_futex_unlock() without dropping wait_lock; which in turn allows to rely on wait_lock atomicity completely. - simplified the waiter conundrum. It's now sufficient to hold rtmutex::wait_lock and a reference on the pi_state to protect the state consistency, so hb->lock can be dropped before calling rt_mutex_futex_unlock(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: xlpang@redhat.com Cc: rostedt@goodmis.org Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: dvhart@infradead.org Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170322104151.900002056@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | | | | | | | futex: Rework inconsistent rt_mutex/futex_q statePeter Zijlstra2017-03-231-36/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a weird state in the futex_unlock_pi() path when it interleaves with a concurrent futex_lock_pi() at the point where it drops hb->lock. In this case, it can happen that the rt_mutex wait_list and the futex_q disagree on pending waiters, in particular rt_mutex will find no pending waiters where futex_q thinks there are. In this case the rt_mutex unlock code cannot assign an owner. The futex side fixup code has to cleanup the inconsistencies with quite a bunch of interesting corner cases. Simplify all this by changing wake_futex_pi() to return -EAGAIN when this situation occurs. This then gives the futex_lock_pi() code the opportunity to continue and the retried futex_unlock_pi() will now observe a coherent state. The only problem is that this breaks RT timeliness guarantees. That is, consider the following scenario: T1 and T2 are both pinned to CPU0. prio(T2) > prio(T1) CPU0 T1 lock_pi() queue_me() <- Waiter is visible preemption T2 unlock_pi() loops with -EAGAIN forever Which is undesirable for PI primitives. Future patches will rectify this. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: xlpang@redhat.com Cc: rostedt@goodmis.org Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: dvhart@infradead.org Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170322104151.850383690@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | | | | | | | futex: Cleanup refcountingPeter Zijlstra2017-03-231-4/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a put_pit_state() as counterpart for get_pi_state() so the refcounting becomes consistent. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: xlpang@redhat.com Cc: rostedt@goodmis.org Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: dvhart@infradead.org Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170322104151.801778516@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | | | | | | | futex: Change locking rulesPeter Zijlstra2017-03-231-33/+132
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently futex-pi relies on hb->lock to serialize everything. But hb->lock creates another set of problems, especially priority inversions on RT where hb->lock becomes a rt_mutex itself. The rt_mutex::wait_lock is the most obvious protection for keeping the futex user space value and the kernel internal pi_state in sync. Rework and document the locking so rt_mutex::wait_lock is held accross all operations which modify the user space value and the pi state. This allows to invoke rt_mutex_unlock() (including deboost) without holding hb->lock as a next step. Nothing yet relies on the new locking rules. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: xlpang@redhat.com Cc: rostedt@goodmis.org Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: dvhart@infradead.org Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170322104151.751993333@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | | | | | | | futex,rt_mutex: Provide futex specific rt_mutex APIPeter Zijlstra2017-03-233-32/+62
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Part of what makes futex_unlock_pi() intricate is that rt_mutex_futex_unlock() -> rt_mutex_slowunlock() can drop rt_mutex::wait_lock. This means it cannot rely on the atomicy of wait_lock, which would be preferred in order to not rely on hb->lock so much. The reason rt_mutex_slowunlock() needs to drop wait_lock is because it can race with the rt_mutex fastpath, however futexes have their own fast path. Since futexes already have a bunch of separate rt_mutex accessors, complete that set and implement a rt_mutex variant without fastpath for them. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: xlpang@redhat.com Cc: rostedt@goodmis.org Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: dvhart@infradead.org Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170322104151.702962446@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | | | | | | | futex: Remove rt_mutex_deadlock_account_*()Peter Zijlstra2017-03-234-43/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These are unused and clutter up the code. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: xlpang@redhat.com Cc: rostedt@goodmis.org Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: dvhart@infradead.org Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170322104151.652692478@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | | | | | | | futex: Use smp_store_release() in mark_wake_futex()Peter Zijlstra2017-03-231-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since the futex_q can dissapear the instruction after assigning NULL, this really should be a RELEASE barrier. That stops loads from hitting dead memory too. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: xlpang@redhat.com Cc: rostedt@goodmis.org Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: dvhart@infradead.org Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170322104151.604296452@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | | | | | | | futex: Cleanup variable names for futex_top_waiter()Peter Zijlstra2017-03-231-15/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | futex_top_waiter() returns the top-waiter on the pi_mutex. Assinging this to a variable 'match' totally obscures the code. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: xlpang@redhat.com Cc: rostedt@goodmis.org Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: dvhart@infradead.org Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170322104151.554710645@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>