summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* fscrypt: allow synchronous bio decryptionEric Biggers2018-05-027-21/+45
| | | | | | | | | | | | | | | | | | Currently, fscrypt provides fscrypt_decrypt_bio_pages() which decrypts a bio's pages asynchronously, then unlocks them afterwards. But, this assumes that decryption is the last "postprocessing step" for the bio, so it's incompatible with additional postprocessing steps such as authenticity verification after decryption. Therefore, rename the existing fscrypt_decrypt_bio_pages() to fscrypt_enqueue_decrypt_bio(). Then, add fscrypt_decrypt_bio() which decrypts the pages in the bio synchronously without unlocking the pages, nor setting them Uptodate; and add fscrypt_enqueue_decrypt_work(), which enqueues work on the fscrypt_read_workqueue. The new functions will be used by filesystems that support both fscrypt and fs-verity. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
* Merge branch 'for-linus' of ↵Linus Torvalds2018-05-012-0/+7
|\ | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rkuo/linux-hexagon-kernel Pull hexagon fixes from Richard Kuo: "Some small fixes for module compilation" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rkuo/linux-hexagon-kernel: hexagon: export csum_partial_copy_nocheck hexagon: add memset_io() helper
| * hexagon: export csum_partial_copy_nocheckArnd Bergmann2018-05-011-0/+1
| | | | | | | | | | | | | | | | This is needed to link ipv6 as a loadable module, which in turn happens in allmodconfig. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Richard Kuo <rkuo@codeaurora.org>
| * hexagon: add memset_io() helperArnd Bergmann2018-05-011-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | We already have memcpy_toio(), but not memset_io(), so let's add the obvious version to allow building an allmodconfig kernel without errors like drivers/gpu/drm/ttm/ttm_bo_util.c: In function 'ttm_bo_move_memcpy': drivers/gpu/drm/ttm/ttm_bo_util.c:390:3: error: implicit declaration of function 'memset_io' [-Werror=implicit-function-declaration] Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Richard Kuo <rkuo@codeaurora.org>
* | Merge tag 'xfs-4.17-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds2018-05-014-6/+42
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull xfs fixes from Darrick Wong: "Here are a few more bug fixes for xfs for 4.17-rc4. Most of them are fixes for bad behavior. This series has been run through a full xfstests run during LSF and through a quick xfstests run against this morning's master, with no major failures reported. Summary: - Enhance inode fork verifiers to prevent loading of corrupted metadata. - Fix a crash when we try to convert extents format inodes to btree format, we run out of space, but forget to revert the in-core state changes. - Fix file size checks when doing INSERT_RANGE that could cause files to end up negative size if there previously was an extent mapped at s_maxbytes. - Fix a bug when doing a remove-then-add ATTR_REPLACE xattr update where we forget to clear ATTR_REPLACE after the remove, which causes the attr to be lost and the fs to shut down due to (what it thinks is) inconsistent in-core state" * tag 'xfs-4.17-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: don't fail when converting shortform attr to long form during ATTR_REPLACE xfs: prevent creating negative-sized file via INSERT_RANGE xfs: set format back to extents if xfs_bmap_extents_to_btree xfs: enhance dinode verifier
| * | xfs: don't fail when converting shortform attr to long form during ATTR_REPLACEDarrick J. Wong2018-04-171-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Kanda Motohiro reported that expanding a tiny xattr into a large xattr fails on XFS because we remove the tiny xattr from a shortform fork and then try to re-add it after converting the fork to extents format having not removed the ATTR_REPLACE flag. This fails because the attr is no longer present, causing a fs shutdown. This is derived from the patch in his bug report, but we really shouldn't ignore a nonzero retval from the remove call. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199119 Reported-by: kanda.motohiro@gmail.com Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * | xfs: prevent creating negative-sized file via INSERT_RANGEDarrick J. Wong2018-04-171-5/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During the "insert range" fallocate operation, i_size grows by the specified 'len' bytes. XFS verifies that i_size + len < s_maxbytes, as it should. But this comparison is done using the signed 'loff_t', and 'i_size + len' can wrap around to a negative value, causing the check to incorrectly pass, resulting in an inode with "negative" i_size. This is possible on 64-bit platforms, where XFS sets s_maxbytes = LLONG_MAX. ext4 and f2fs don't run into this because they set a smaller s_maxbytes. Fix it by using subtraction instead. Reproducer: xfs_io -f file -c "truncate $(((1<<63)-1))" -c "finsert 0 4096" Fixes: a904b1ca5751 ("xfs: Add support FALLOC_FL_INSERT_RANGE for fallocate") Cc: <stable@vger.kernel.org> # v4.1+ Originally-From: Eric Biggers <ebiggers@google.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> [darrick: fix signed integer addition overflow too] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * | xfs: set format back to extents if xfs_bmap_extents_to_btreeEric Sandeen2018-04-171-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If xfs_bmap_extents_to_btree fails in a mode where we call xfs_iroot_realloc(-1) to de-allocate the root, set the format back to extents. Otherwise we can assume we can dereference ifp->if_broot based on the XFS_DINODE_FMT_BTREE format, and crash. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199423 Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
| * | xfs: enhance dinode verifierEric Sandeen2018-04-171-0/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add several more validations to xfs_dinode_verify: - For LOCAL data fork formats, di_nextents must be 0. - For LOCAL attr fork formats, di_anextents must be 0. - For inodes with no attr fork offset, - format must be XFS_DINODE_FMT_EXTENTS if set at all - di_anextents must be 0. Thanks to dchinner for pointing out a couple related checks I had forgotten to add. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199377 Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* | | Merge tag 'errseq-v4.17' of ↵Linus Torvalds2018-04-301-14/+9
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux Pull errseq infrastructure fix from Jeff Layton: "The PostgreSQL developers recently had a spirited discussion about the writeback error handling in Linux, and reached out to us about a behavoir change to the code that bit them when the errseq_t changes were merged. When we changed to using errseq_t for tracking writeback errors, we lost the ability for an application to see a writeback error that occurred before the open on which the fsync was issued. This was problematic for PostgreSQL which offloads fsync calls to a completely separate process from the DB writers. This patch restores that ability. If the errseq_t value in the inode does not have the SEEN flag set, then we just return 0 for the sample. That ensures that any recorded error is always delivered at least once. Note that we might still lose the error if the inode gets evicted from the cache before anything can reopen it, but that was the case before errseq_t was merged. At LSF/MM we had some discussion about keeping inodes with unreported writeback errors around in the cache for longer (possibly indefinitely), but that's really a separate problem" * tag 'errseq-v4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: errseq: Always report a writeback error once
| * | | errseq: Always report a writeback error onceMatthew Wilcox2018-04-271-14/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The errseq_t infrastructure assumes that errors which occurred before the file descriptor was opened are of no interest to the application. This turns out to be a regression for some applications, notably Postgres. Before errseq_t, a writeback error would be reported exactly once (as long as the inode remained in memory), so Postgres could open a file, call fsync() and find out whether there had been a writeback error on that file from another process. This patch changes the errseq infrastructure to report errors to all file descriptors which are opened after the error occurred, but before it was reported to any file descriptor. This restores the user-visible behaviour. Cc: stable@vger.kernel.org Fixes: 5660e13d2fd6 ("fs: new infrastructure for writeback error handling and reporting") Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Jeff Layton <jlayton@redhat.com>
* | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparcLinus Torvalds2018-04-303-3/+3
|\ \ \ \ | |_|_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Fixup license text for oradax driver, from Rob Gardner. - Release device object with put_device() instead of straight kfree(), from Arvind Yadav. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc: sparc: vio: use put_device() instead of kfree() sparc64: Fix mistake in oradax license text
| * | | sparc: vio: use put_device() instead of kfree()Arvind Yadav2018-04-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Never directly free @dev after calling device_register(), even if it returned an error. Always use put_device() to give up the reference initialized. Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | sparc64: Fix mistake in oradax license textRob Gardner2018-04-302-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The license text in both oradax files mistakenly specifies "version 3" of the GNU General Public License. This is corrected to specify "version 2". Signed-off-by: Rob Gardner <rob.gardner@oracle.com> Signed-off-by: Jonathan Helman <jonathan.helman@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | Linux v4.17-rc3v4.17-rc3Linus Torvalds2018-04-291-1/+1
| | | |
* | | | Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds2018-04-2912-19/+93
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "Another set of x86 related updates: - Fix the long broken x32 version of the IPC user space headers which was noticed by Arnd Bergman in course of his ongoing y2038 work. GLIBC seems to have non broken private copies of these headers so this went unnoticed. - Two microcode fixlets which address some more fallout from the recent modifications in that area: - Unconditionally save the microcode patch, which was only saved when CPU_HOTPLUG was enabled causing failures in the late loading mechanism - Make the later loader synchronization finally work under all circumstances. It was exiting early and causing timeout failures due to a missing synchronization point. - Do not use mwait_play_dead() on AMD systems to prevent excessive power consumption as the CPU cannot go into deep power states from there. - Address an annoying sparse warning due to lost type qualifiers of the vmemmap and vmalloc base address constants. - Prevent reserving crash kernel region on Xen PV as this leads to the wrong perception that crash kernels actually work there which is not the case. Xen PV has its own crash mechanism handled by the hypervisor. - Add missing TLB cpuid values to the table to make the printout on certain machines correct. - Enumerate the new CLDEMOTE instruction - Fix an incorrect SPDX identifier - Remove stale macros" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/ipc: Fix x32 version of shmid64_ds and msqid64_ds x86/setup: Do not reserve a crash kernel region if booted on Xen PV x86/cpu/intel: Add missing TLB cpuid values x86/smpboot: Don't use mwait_play_dead() on AMD systems x86/mm: Make vmemmap and vmalloc base address constants unsigned long x86/vector: Remove the unused macro FPU_IRQ x86/vector: Remove the macro VECTOR_OFFSET_START x86/cpufeatures: Enumerate cldemote instruction x86/microcode: Do not exit early from __reload_late() x86/microcode/intel: Save microcode patch unconditionally x86/jailhouse: Fix incorrect SPDX identifier
| * | | | x86/ipc: Fix x32 version of shmid64_ds and msqid64_dsArnd Bergmann2018-04-272-0/+73
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A bugfix broke the x32 shmid64_ds and msqid64_ds data structure layout (as seen from user space) a few years ago: Originally, __BITS_PER_LONG was defined as 64 on x32, so we did not have padding after the 64-bit __kernel_time_t fields, After __BITS_PER_LONG got changed to 32, applications would observe extra padding. In other parts of the uapi headers we seem to have a mix of those expecting either 32 or 64 on x32 applications, so we can't easily revert the path that broke these two structures. Instead, this patch decouples x32 from the other architectures and moves it back into arch specific headers, partially reverting the even older commit 73a2d096fdf2 ("x86: remove all now-duplicate header files"). It's not clear whether this ever made any difference, since at least glibc carries its own (correct) copy of both of these header files, so possibly no application has ever observed the definitions here. Based on a suggestion from H.J. Lu, I tried out the tool from https://github.com/hjl-tools/linux-header to find other such bugs, which pointed out the same bug in statfs(), which also has a separate (correct) copy in glibc. Fixes: f4b4aae18288 ("x86/headers/uapi: Fix __BITS_PER_LONG value for x32 builds") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: "H . J . Lu" <hjl.tools@gmail.com> Cc: Jeffrey Walton <noloader@gmail.com> Cc: stable@vger.kernel.org Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lkml.kernel.org/r/20180424212013.3967461-1-arnd@arndb.de
| * | | | x86/setup: Do not reserve a crash kernel region if booted on Xen PVPetr Tesarik2018-04-271-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Xen PV domains cannot shut down and start a crash kernel. Instead, the crashing kernel makes a SCHEDOP_shutdown hypercall with the reason code SHUTDOWN_crash, cf. xen_crash_shutdown() machine op in arch/x86/xen/enlighten_pv.c. A crash kernel reservation is merely a waste of RAM in this case. It may also confuse users of kexec_load(2) and/or kexec_file_load(2). When flags include KEXEC_ON_CRASH or KEXEC_FILE_ON_CRASH, respectively, these syscalls return success, which is technically correct, but the crash kexec image will never be actually used. Signed-off-by: Petr Tesarik <ptesarik@suse.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Juergen Gross <jgross@suse.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Dou Liyang <douly.fnst@cn.fujitsu.com> Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: xen-devel@lists.xenproject.org Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@suse.de> Cc: Jean Delvare <jdelvare@suse.de> Link: https://lkml.kernel.org/r/20180425120835.23cef60c@ezekiel.suse.cz
| * | | | x86/cpu/intel: Add missing TLB cpuid valuesjacek.tomaka@poczta.fm2018-04-261-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make kernel print the correct number of TLB entries on Intel Xeon Phi 7210 (and others) Before: [ 0.320005] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0 After: [ 0.320005] Last level dTLB entries: 4KB 256, 2MB 128, 4MB 128, 1GB 16 The entries do exist in the official Intel SMD but the type column there is incorrect (states "Cache" where it should read "TLB"), but the entries for the values 0x6B, 0x6C and 0x6D are correctly described as 'Data TLB'. Signed-off-by: Jacek Tomaka <jacek.tomaka@poczta.fm> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20180423161425.24366-1-jacekt@dugeo.com
| * | | | x86/smpboot: Don't use mwait_play_dead() on AMD systemsYazen Ghannam2018-04-261-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Recent AMD systems support using MWAIT for C1 state. However, MWAIT will not allow deeper cstates than C1 on current systems. play_dead() expects to use the deepest state available. The deepest state available on AMD systems is reached through SystemIO or HALT. If MWAIT is available, it is preferred over the other methods, so the CPU never reaches the deepest possible state. Don't try to use MWAIT to play_dead() on AMD systems. Instead, use CPUIDLE to enter the deepest state advertised by firmware. If CPUIDLE is not available then fallback to HALT. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@suse.de> Cc: stable@vger.kernel.org Cc: Yazen Ghannam <Yazen.Ghannam@amd.com> Link: https://lkml.kernel.org/r/20180403140228.58540-1-Yazen.Ghannam@amd.com
| * | | | x86/mm: Make vmemmap and vmalloc base address constants unsigned longJiri Kosina2018-04-261-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commits 9b46a051e4 ("x86/mm: Initialize vmemmap_base at boot-time") and a7412546d8 ("x86/mm: Adjust vmalloc base and size at boot-time") lost the type information for __VMALLOC_BASE_L4, __VMALLOC_BASE_L5, __VMEMMAP_BASE_L4 and __VMEMMAP_BASE_L5 constants. Declare them explicitly unsigned long again. Fixes: 9b46a051e4 ("x86/mm: Initialize vmemmap_base at boot-time") Fixes: a7412546d8 ("x86/mm: Adjust vmalloc base and size at boot-time") Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Link: https://lkml.kernel.org/r/nycvar.YFH.7.76.1804121437350.28129@cbobk.fhfr.pm
| * | | | x86/vector: Remove the unused macro FPU_IRQDou Liyang2018-04-261-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The macro FPU_IRQ has never been used since v3.10, So remove it. Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: hpa@zytor.com Link: https://lkml.kernel.org/r/20180426060832.27312-1-douly.fnst@cn.fujitsu.com
| * | | | x86/vector: Remove the macro VECTOR_OFFSET_STARTDou Liyang2018-04-261-5/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now, Linux uses matrix allocator for vector assignment, the original assignment code which used VECTOR_OFFSET_START has been removed. So remove the stale macro as well. Fixes: commit 69cde0004a4b ("x86/vector: Use matrix allocator for vector assignment") Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: David Rientjes <rientjes@google.com> Cc: hpa@zytor.com Link: https://lkml.kernel.org/r/20180425020553.17210-1-douly.fnst@cn.fujitsu.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | x86/cpufeatures: Enumerate cldemote instructionFenghua Yu2018-04-261-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cldemote is a new instruction in future x86 processors. It hints to hardware that a specified cache line should be moved ("demoted") from the cache(s) closest to the processor core to a level more distant from the processor core. This instruction is faster than snooping to make the cache line available for other cores. cldemote instruction is indicated by the presence of the CPUID feature flag CLDEMOTE (CPUID.(EAX=0x7, ECX=0):ECX[bit25]). More details on cldemote instruction can be found in the latest Intel Architecture Instruction Set Extensions and Future Features Programming Reference. Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: "Ravi V Shankar" <ravi.v.shankar@intel.com> Cc: "H. Peter Anvin" <hpa@linux.intel.com> Cc: "Ashok Raj" <ashok.raj@intel.com> Link: https://lkml.kernel.org/r/1524508162-192587-1-git-send-email-fenghua.yu@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | x86/microcode: Do not exit early from __reload_late()Borislav Petkov2018-04-241-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Vitezslav reported a case where the "Timeout during microcode update!" panic would hit. After a deeper look, it turned out that his .config had CONFIG_HOTPLUG_CPU disabled which practically made save_mc_for_early() a no-op. When that happened, the discovered microcode patch wasn't saved into the cache and the late loading path wouldn't find any. This, then, lead to early exit from __reload_late() and thus CPUs waiting until the timeout is reached, leading to the panic. In hindsight, that function should have been written so it does not return before the post-synchronization. Oh well, I know better now... Fixes: bb8c13d61a62 ("x86/microcode: Fix CPU synchronization routine") Reported-by: Vitezslav Samel <vitezslav@samel.cz> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Vitezslav Samel <vitezslav@samel.cz> Tested-by: Ashok Raj <ashok.raj@intel.com> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20180418081140.GA2439@pc11.op.pod.cz Link: https://lkml.kernel.org/r/20180421081930.15741-2-bp@alien8.de
| * | | | x86/microcode/intel: Save microcode patch unconditionallyBorislav Petkov2018-04-241-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | save_mc_for_early() was a no-op on !CONFIG_HOTPLUG_CPU but the generic_load_microcode() path saves the microcode patches it has found into the cache of patches which is used for late loading too. Regardless of whether CPU hotplug is used or not. Make the saving unconditional so that late loading can find the proper patch. Reported-by: Vitezslav Samel <vitezslav@samel.cz> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Vitezslav Samel <vitezslav@samel.cz> Tested-by: Ashok Raj <ashok.raj@intel.com> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20180418081140.GA2439@pc11.op.pod.cz Link: https://lkml.kernel.org/r/20180421081930.15741-1-bp@alien8.de
| * | | | x86/jailhouse: Fix incorrect SPDX identifierThomas Gleixner2018-04-232-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | GPL2.0 is not a valid SPDX identiier. Replace it with GPL-2.0. Fixes: 4a362601baa6 ("x86/jailhouse: Add infrastructure for running in non-root cell") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Jan Kiszka <jan.kiszka@siemens.com> Cc: Kate Stewart <kstewart@linuxfoundation.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Philippe Ombredanne <pombredanne@nexb.com> Link: https://lkml.kernel.org/r/20180422220832.815346488@linutronix.de
* | | | | Merge branch 'x86-pti-for-linus' of ↵Linus Torvalds2018-04-297-34/+99
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 pti fixes from Thomas Gleixner: "A set of updates for the x86/pti related code: - Preserve r8-r11 in int $0x80. r8-r11 need to be preserved, but the int$80 entry code removed that quite some time ago. Make it correct again. - A set of fixes for the Global Bit work which went into 4.17 and caused a bunch of interesting regressions: - Triggering a BUG in the page attribute code due to a missing check for early boot stage - Warnings in the page attribute code about holes in the kernel text mapping which are caused by the freeing of the init code. Handle such holes gracefully. - Reduce the amount of kernel memory which is set global to the actual text and do not incidentally overlap with data. - Disable the global bit when RANDSTRUCT is enabled as it partially defeats the hardening. - Make the page protection setup correct for vma->page_prot population again. The adjustment of the protections fell through the crack during the Global bit rework and triggers warnings on machines which do not support certain features, e.g. NX" * 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/entry/64/compat: Preserve r8-r11 in int $0x80 x86/pti: Filter at vma->vm_page_prot population x86/pti: Disallow global kernel text with RANDSTRUCT x86/pti: Reduce amount of kernel text allowed to be Global x86/pti: Fix boot warning from Global-bit setting x86/pti: Fix boot problems from Global-bit setting
| * | | | | x86/entry/64/compat: Preserve r8-r11 in int $0x80Andy Lutomirski2018-04-272-18/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 32-bit user code that uses int $80 doesn't care about r8-r11. There is, however, some 64-bit user code that intentionally uses int $0x80 to invoke 32-bit system calls. From what I've seen, basically all such code assumes that r8-r15 are all preserved, but the kernel clobbers r8-r11. Since I doubt that there's any code that depends on int $0x80 zeroing r8-r11, change the kernel to preserve them. I suspect that very little user code is broken by the old clobber, since r8-r11 are only rarely allocated by gcc, and they're clobbered by function calls, so they only way we'd see a problem is if the same function that invokes int $0x80 also spills something important to one of these registers. The current behavior seems to date back to the historical commit "[PATCH] x86-64 merge for 2.6.4". Before that, all regs were preserved. I can't find any explanation of why this change was made. Update the test_syscall_vdso_32 testcase as well to verify the new behavior, and it strengthens the test to make sure that the kernel doesn't accidentally permute r8..r15. Suggested-by: Denys Vlasenko <dvlasenk@redhat.com> Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Link: https://lkml.kernel.org/r/d4c4d9985fbe64f8c9e19291886453914b48caee.1523975710.git.luto@kernel.org
| * | | | | x86/pti: Filter at vma->vm_page_prot populationDave Hansen2018-04-253-1/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit ce9962bf7e22bb3891655c349faff618922d4a73 0day reported warnings at boot on 32-bit systems without NX support: attempted to set unsupported pgprot: 8000000000000025 bits: 8000000000000000 supported: 7fffffffffffffff WARNING: CPU: 0 PID: 1 at arch/x86/include/asm/pgtable.h:540 handle_mm_fault+0xfc1/0xfe0: check_pgprot at arch/x86/include/asm/pgtable.h:535 (inlined by) pfn_pte at arch/x86/include/asm/pgtable.h:549 (inlined by) do_anonymous_page at mm/memory.c:3169 (inlined by) handle_pte_fault at mm/memory.c:3961 (inlined by) __handle_mm_fault at mm/memory.c:4087 (inlined by) handle_mm_fault at mm/memory.c:4124 The problem is that due to the recent commit which removed auto-massaging of page protections, filtering page permissions at PTE creation time is not longer done, so vma->vm_page_prot is passed unfiltered to PTE creation. Filter the page protections before they are installed in vma->vm_page_prot. Fixes: fb43d6cb91 ("x86/mm: Do not auto-massage page protections") Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kees Cook <keescook@google.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: linux-mm@kvack.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Nadav Amit <namit@vmware.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Link: https://lkml.kernel.org/r/20180420222028.99D72858@viggo.jf.intel.com
| * | | | | x86/pti: Disallow global kernel text with RANDSTRUCTDave Hansen2018-04-251-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 26d35ca6c3776784f8156e1d6f80cc60d9a2a915 RANDSTRUCT derives its hardening benefits from the attacker's lack of knowledge about the layout of kernel data structures. Keep the kernel image non-global in cases where RANDSTRUCT is in use to help keep the layout a secret. Fixes: 8c06c7740 (x86/pti: Leave kernel text global for !PCID) Reported-by: Kees Cook <keescook@google.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: linux-mm@kvack.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Nadav Amit <namit@vmware.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Link: https://lkml.kernel.org/r/20180420222026.D0B4AAC9@viggo.jf.intel.com
| * | | | | x86/pti: Reduce amount of kernel text allowed to be GlobalDave Hansen2018-04-251-3/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit abb67605203687c8b7943d760638d0301787f8d9 Kees reported to me that I made too much of the kernel image global. It was far more than just text: I think this is too much set global: _end is after data, bss, and brk, and all kinds of other stuff that could hold secrets. I think this should match what mark_rodata_ro() is doing. This does exactly that. We use __end_rodata_hpage_align as our marker both because it is huge-page-aligned and it does not contain any sections we expect to hold secrets. Kees's logic was that r/o data is in the kernel image anyway and, in the case of traditional distributions, can be freely downloaded from the web, so there's no reason to hide it. Fixes: 8c06c7740 (x86/pti: Leave kernel text global for !PCID) Reported-by: Kees Cook <keescook@google.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: linux-mm@kvack.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Nadav Amit <namit@vmware.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Link: https://lkml.kernel.org/r/20180420222023.1C8B2B20@viggo.jf.intel.com
| * | | | | x86/pti: Fix boot warning from Global-bit settingDave Hansen2018-04-251-10/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 231df823c4f04176f607afc4576c989895cff40e The pageattr.c code attempts to process "faults" when it goes looking for PTEs to change and finds non-present entries. It allows these faults in the linear map which is "expected to have holes", but WARN()s about them elsewhere, like when called on the kernel image. However, change_page_attr_clear() is now called on the kernel image in the process of trying to clear the Global bit. This trips the warning in __cpa_process_fault() if a non-present PTE is encountered in the kernel image. The "holes" in the kernel image result from free_init_pages()'s use of set_memory_np(). These holes are totally fine, and result from normal operation, just as they would be in the kernel linear map. Just silence the warning when holes in the kernel image are encountered. Fixes: 39114b7a7 (x86/pti: Never implicitly clear _PAGE_GLOBAL for kernel image) Reported-by: Mariusz Ceier <mceier@gmail.com> Reported-by: Aaro Koskinen <aaro.koskinen@nokia.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Aaro Koskinen <aaro.koskinen@nokia.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Nadav Amit <namit@vmware.com> Cc: Kees Cook <keescook@google.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: linux-mm@kvack.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Link: https://lkml.kernel.org/r/20180420222021.1C7D2B3F@viggo.jf.intel.com
| * | | | | x86/pti: Fix boot problems from Global-bit settingDave Hansen2018-04-251-2/+2
| |/ / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 16dce603adc9de4237b7bf2ff5c5290f34373e7b Part of the global bit _setting_ patches also includes clearing the Global bit when it should not be enabled. That is done with set_memory_nonglobal(), which uses change_page_attr_clear() in pageattr.c under the covers. The TLB flushing code inside pageattr.c has has checks like BUG_ON(irqs_disabled()), looking for interrupt disabling that might cause deadlocks. But, these also trip in early boot on certain preempt configurations. Just copy the existing BUG_ON() sequence from cpa_flush_range() to the other two sites and check for early boot. Fixes: 39114b7a7 (x86/pti: Never implicitly clear _PAGE_GLOBAL for kernel image) Reported-by: Mariusz Ceier <mceier@gmail.com> Reported-by: Aaro Koskinen <aaro.koskinen@nokia.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Aaro Koskinen <aaro.koskinen@nokia.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Nadav Amit <namit@vmware.com> Cc: Kees Cook <keescook@google.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: linux-mm@kvack.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Link: https://lkml.kernel.org/r/20180420222019.20C4A410@viggo.jf.intel.com
* | | | | Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds2018-04-2915-109/+119
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Thomas Gleixner: "Two fixes from the timer departement: - Fix a long standing issue in the NOHZ tick code which causes RB tree corruption, delayed timers and other malfunctions. The cause for this is code which modifies the expiry time of an enqueued hrtimer. - Revert the CLOCK_MONOTONIC/CLOCK_BOOTTIME unification due to regression reports. Seems userspace _is_ relying on the documented behaviour despite our hope that it wont" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: Revert: Unify CLOCK_MONOTONIC and CLOCK_BOOTTIME tick/sched: Do not mess with an enqueued hrtimer
| * | | | | Revert: Unify CLOCK_MONOTONIC and CLOCK_BOOTTIMEThomas Gleixner2018-04-2615-104/+114
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Revert commits 92af4dcb4e1c ("tracing: Unify the "boot" and "mono" tracing clocks") 127bfa5f4342 ("hrtimer: Unify MONOTONIC and BOOTTIME clock behavior") 7250a4047aa6 ("posix-timers: Unify MONOTONIC and BOOTTIME clock behavior") d6c7270e913d ("timekeeping: Remove boot time specific code") f2d6fdbfd238 ("Input: Evdev - unify MONOTONIC and BOOTTIME clock behavior") d6ed449afdb3 ("timekeeping: Make the MONOTONIC clock behave like the BOOTTIME clock") 72199320d49d ("timekeeping: Add the new CLOCK_MONOTONIC_ACTIVE clock") As stated in the pull request for the unification of CLOCK_MONOTONIC and CLOCK_BOOTTIME, it was clear that we might have to revert the change. As reported by several folks systemd and other applications rely on the documented behaviour of CLOCK_MONOTONIC on Linux and break with the above changes. After resume daemons time out and other timeout related issues are observed. Rafael compiled this list: * systemd kills daemons on resume, after >WatchdogSec seconds of suspending (Genki Sky). [Verified that that's because systemd uses CLOCK_MONOTONIC and expects it to not include the suspend time.] * systemd-journald misbehaves after resume: systemd-journald[7266]: File /var/log/journal/016627c3c4784cd4812d4b7e96a34226/system.journal corrupted or uncleanly shut down, renaming and replacing. (Mike Galbraith). * NetworkManager reports "networking disabled" and networking is broken after resume 50% of the time (Pavel). [May be because of systemd.] * MATE desktop dims the display and starts the screensaver right after system resume (Pavel). * Full system hang during resume (me). [May be due to systemd or NM or both.] That happens on debian and open suse systems. It's sad, that these problems were neither catched in -next nor by those folks who expressed interest in this change. Reported-by: Rafael J. Wysocki <rjw@rjwysocki.net> Reported-by: Genki Sky <sky@genki.is>, Reported-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kevin Easton <kevin@guarana.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Salyzyn <salyzyn@android.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org>
| * | | | | tick/sched: Do not mess with an enqueued hrtimerThomas Gleixner2018-04-261-5/+5
| |/ / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Kaike reported that in tests rdma hrtimers occasionaly stopped working. He did great debugging, which provided enough context to decode the problem. CPU 3 CPU 2 idle start sched_timer expires = 712171000000 queue->next = sched_timer start rdmavt timer. expires = 712172915662 lock(baseof(CPU3)) tick_nohz_stop_tick() tick = 716767000000 timerqueue_add(tmr) hrtimer_set_expires(sched_timer, tick); sched_timer->expires = 716767000000 <---- FAIL if (tmr->expires < queue->next->expires) hrtimer_start(sched_timer) queue->next = tmr; lock(baseof(CPU3)) unlock(baseof(CPU3)) timerqueue_remove() timerqueue_add() ts->sched_timer is queued and queue->next is pointing to it, but then ts->sched_timer.expires is modified. This not only corrupts the ordering of the timerqueue RB tree, it also makes CPU2 see the new expiry time of timerqueue->next->expires when checking whether timerqueue->next needs to be updated. So CPU2 sees that the rdma timer is earlier than timerqueue->next and sets the rdma timer as new next. Depending on whether it had also seen the new time at RB tree enqueue, it might have queued the rdma timer at the wrong place and then after removing the sched_timer the RB tree is completely hosed. The problem was introduced with a commit which tried to solve inconsistency between the hrtimer in the tick_sched data and the underlying hardware clockevent. It split out hrtimer_set_expires() to store the new tick time in both the NOHZ and the NOHZ + HIGHRES case, but missed the fact that in the NOHZ + HIGHRES case the hrtimer might still be queued. Use hrtimer_start(timer, tick...) for the NOHZ + HIGHRES case which sets timer->expires after canceling the timer and move the hrtimer_set_expires() invocation into the NOHZ only code path which is not affected as it merily uses the hrtimer as next event storage so code pathes can be shared with the NOHZ + HIGHRES case. Fixes: d4af6d933ccf ("nohz: Fix spurious warning when hrtimer and clockevent get out of sync") Reported-by: "Wan Kaike" <kaike.wan@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Frederic Weisbecker <frederic@kernel.org> Cc: "Marciniszyn Mike" <mike.marciniszyn@intel.com> Cc: Anna-Maria Gleixner <anna-maria@linutronix.de> Cc: linux-rdma@vger.kernel.org Cc: "Dalessandro Dennis" <dennis.dalessandro@intel.com> Cc: "Fleck John" <john.fleck@intel.com> Cc: stable@vger.kernel.org Cc: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: "Weiny Ira" <ira.weiny@intel.com> Cc: "linux-rdma@vger.kernel.org" Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1804241637390.1679@nanos.tec.linutronix.de Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1804242119210.1597@nanos.tec.linutronix.de
* | | | | Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds2018-04-2913-78/+129
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Thomas Gleixner: "The perf update contains the following bits: x86: - Prevent setting freeze_on_smi on PerfMon V1 CPUs to avoid #GP perf stat: - Keep the '/' event modifier separator in fallback, for example when fallbacking from 'cpu/cpu-cycles/' to user level only, where it should become 'cpu/cpu-cycles/u' and not 'cpu/cpu-cycles/:u' (Jiri Olsa) - Fix PMU events parsing rule, improving error reporting for invalid events (Jiri Olsa) - Disable write_backward and other event attributes for !group events in a group, fixing, for instance this group: '{cycles,msr/aperf/}:S' that has leader sampling (:S) and where just the 'cycles', the leader event, should have the write_backward attribute set, in this case it all fails because the PMU where 'msr/aperf/' lives doesn't accepts write_backward style sampling (Jiri Olsa) - Only fall back group read for leader (Kan Liang) - Fix core PMU alias list for x86 platform (Kan Liang) - Print out hint for mixed PMU group error (Kan Liang) - Fix duplicate PMU name for interval print (Kan Liang) Core: - Set main kernel end address properly when reading kernel and module maps (Namhyung Kim) perf mem: - Fix incorrect entries and add missing man options (Sangwon Hong) s/390: - Remove s390 specific strcmp_cpuid_cmp function (Thomas Richter) - Adapt 'perf test' case record+probe_libc_inet_pton.sh for s390 - Fix s390 undefined record__auxtrace_init() return value in 'perf record' (Thomas Richter)" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel: Don't enable freeze-on-smi for PerfMon V1 perf stat: Fix duplicate PMU name for interval print perf evsel: Only fall back group read for leader perf stat: Print out hint for mixed PMU group error perf pmu: Fix core PMU alias list for X86 platform perf record: Fix s390 undefined record__auxtrace_init() return value perf mem: Document incorrect and missing options perf evsel: Disable write_backward for leader sampling group events perf pmu: Fix pmu events parsing rule perf stat: Keep the / modifier separator in fallback perf test: Adapt test case record+probe_libc_inet_pton.sh for s390 perf list: Remove s390 specific strcmp_cpuid_cmp function perf machine: Set main kernel end address properly
| * \ \ \ \ Merge tag 'perf-urgent-for-mingo-4.17-20180425' of ↵Ingo Molnar2018-04-2612-75/+123
| |\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent Pull perf/urgent fixes from Arnaldo Carvalho de Melo: perf stat: - Keep the '/' event modifier separator in fallback, for example when fallbacking from 'cpu/cpu-cycles/' to user level only, where it should become 'cpu/cpu-cycles/u' and not 'cpu/cpu-cycles/:u' (Jiri Olsa) - Fix PMU events parsing rule, improving error reporting for invalid events (Jiri Olsa) - Disable write_backward and other event attributes for !group events in a group, fixing, for instance this group: '{cycles,msr/aperf/}:S' that has leader sampling (:S) and where just the 'cycles', the leader event, should have the write_backward attribute set, in this case it all fails because the PMU where 'msr/aperf/' lives doesn't accepts write_backward style sampling (Jiri Olsa) - Only fall back group read for leader (Kan Liang) - Fix core PMU alias list for x86 platform (Kan Liang) - Print out hint for mixed PMU group error (Kan Liang) - Fix duplicate PMU name for interval print (Kan Liang) Core: - Set main kernel end address properly when reading kernel and module maps (Namhyung Kim) perf mem: - Fix incorrect entries and add missing man options (Sangwon Hong) s/390: - Remove s390 specific strcmp_cpuid_cmp function (Thomas Richter) - Adapt 'perf test' case record+probe_libc_inet_pton.sh for s390 - Fix s390 undefined record__auxtrace_init() return value in 'perf record' (Thomas Richter) Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>
| | * | | | | perf stat: Fix duplicate PMU name for interval printKan Liang2018-04-242-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PMU name is printed repeatedly for interval print, for example: perf stat --no-merge -e 'unc_m_clockticks' -a -I 1000 # time counts unit events 1.001053069 243,702,144 unc_m_clockticks [uncore_imc_4] 1.001053069 244,268,304 unc_m_clockticks [uncore_imc_2] 1.001053069 244,427,386 unc_m_clockticks [uncore_imc_0] 1.001053069 244,583,760 unc_m_clockticks [uncore_imc_5] 1.001053069 244,738,971 unc_m_clockticks [uncore_imc_3] 1.001053069 244,880,309 unc_m_clockticks [uncore_imc_1] 2.002024821 240,818,200 unc_m_clockticks [uncore_imc_4] [uncore_imc_4] 2.002024821 240,767,812 unc_m_clockticks [uncore_imc_2] [uncore_imc_2] 2.002024821 240,764,215 unc_m_clockticks [uncore_imc_0] [uncore_imc_0] 2.002024821 240,759,504 unc_m_clockticks [uncore_imc_5] [uncore_imc_5] 2.002024821 240,755,992 unc_m_clockticks [uncore_imc_3] [uncore_imc_3] 2.002024821 240,750,403 unc_m_clockticks [uncore_imc_1] [uncore_imc_1] For each print, the PMU name is unconditionally appended to the counter->name. Need to check the counter->name first. If the PMU name is already appended, do nothing. Committer notes: Add and use perf_evsel->uniquified_name bool instead of doing the more expensive strstr(event->name, pmu->name). Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Agustin Vega-Frias <agustinv@codeaurora.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com> Cc: Jin Yao <yao.jin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Shaokun Zhang <zhangshaokun@hisilicon.com> Cc: Will Deacon <will.deacon@arm.com> Fixes: 8c5421c016a4 ("perf pmu: Display pmu name when printing unmerged events in stat") Link: http://lkml.kernel.org/r/1524594014-79243-5-git-send-email-kan.liang@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
| | * | | | | perf evsel: Only fall back group read for leaderKan Liang2018-04-241-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Perf doesn't support mixed events from different PMUs (except software event) in a group. The perf stat should output <not counted>/<not supported> for all events, but it doesn't. For example, perf stat -e '{cycles,uncore_imc_5/umask=0xF,event=0x4/,instructions}' <not counted> cycles <not supported> uncore_imc_5/umask=0xF,event=0x4/ 1,024,300 instructions If perf fails to open an event, it doesn't error out directly. It will disable some features and retry, until the event is opened or all features are disabled. The disabled features will not be re-enabled. The group read is one of these features. For the example as above, the IMC event and the leader event "cycles" are from different PMUs. Opening the IMC event must fail. The group read feature must be disabled for IMC event and the followed event "instructions". The "instructions" event has the same PMU as the leader "cycles". It can be opened successfully. Since the group read feature has been disabled, the "instructions" event will be read as a single event, which definitely has a value. The group read fallback is still useful for the case which kernel doesn't support group read. It is good enough to be handled only by the leader. For the fallback request from members, it must be caused by an error. The fallback only breaks the semantics of group. Limit the group read fallback only for the leader. Committer testing: On a broadwell t450s notebook: Before: # perf stat -e '{cycles,unc_cbo_cache_lookup.read_i,instructions}' sleep 1 Performance counter stats for 'sleep 1': <not counted> cycles <not supported> unc_cbo_cache_lookup.read_i 818,206 instructions 1.003170887 seconds time elapsed Some events weren't counted. Try disabling the NMI watchdog: echo 0 > /proc/sys/kernel/nmi_watchdog perf stat ... echo 1 > /proc/sys/kernel/nmi_watchdog After: # perf stat -e '{cycles,unc_cbo_cache_lookup.read_i,instructions}' sleep 1 Performance counter stats for 'sleep 1': <not counted> cycles <not supported> unc_cbo_cache_lookup.read_i <not counted> instructions 1.001380511 seconds time elapsed Some events weren't counted. Try disabling the NMI watchdog: echo 0 > /proc/sys/kernel/nmi_watchdog perf stat ... echo 1 > /proc/sys/kernel/nmi_watchdog # Reported-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Agustin Vega-Frias <agustinv@codeaurora.org> Cc: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com> Cc: Jin Yao <yao.jin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Shaokun Zhang <zhangshaokun@hisilicon.com> Cc: Will Deacon <will.deacon@arm.com> Fixes: 82bf311e15d2 ("perf stat: Use group read for event groups") Link: http://lkml.kernel.org/r/1524594014-79243-3-git-send-email-kan.liang@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
| | * | | | | perf stat: Print out hint for mixed PMU group errorKan Liang2018-04-241-1/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Perf doesn't support mixed events from different PMUs (except software event) in a group. For this case, only "<not counted>" or "<not supported>" are printed out. There is no hint which guides users to fix the issue. Checking the PMU type of events to determine if they are from the same PMU. There may be false alarm for the checking. E.g. the core PMU has different PMU type. But it should not happen often. The false alarm can also be tolerated, because: - It only happens on error path. - It just provides a possible solution for the issue. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Cc: Agustin Vega-Frias <agustinv@codeaurora.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com> Cc: Jin Yao <yao.jin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Shaokun Zhang <zhangshaokun@hisilicon.com> Cc: Will Deacon <will.deacon@arm.com> Link: http://lkml.kernel.org/r/1524594014-79243-2-git-send-email-kan.liang@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
| | * | | | | perf pmu: Fix core PMU alias list for X86 platformKan Liang2018-04-241-13/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When counting uncore event with alias, core event is mistakenly involved, for example: perf stat --no-merge -e "unc_m_cas_count.all" -C0 sleep 1 Performance counter stats for 'CPU(s) 0': 0 unc_m_cas_count.all [uncore_imc_4] 0 unc_m_cas_count.all [uncore_imc_2] 0 unc_m_cas_count.all [uncore_imc_0] 153,640 unc_m_cas_count.all [cpu] 0 unc_m_cas_count.all [uncore_imc_5] 25,026 unc_m_cas_count.all [uncore_imc_3] 0 unc_m_cas_count.all [uncore_imc_1] 1.001447890 seconds time elapsed The reason is that current implementation doesn't check PMU name of a event when adding its alias into the alias list for core PMU. The uncore event aliases are mistakenly added. This bug was introduced in: commit 14b22ae028de ("perf pmu: Add helper function is_pmu_core to detect PMU CORE devices") Checking the PMU name for all PMUs on X86 and other architectures except ARM. There is no behavior change for ARM. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Agustin Vega-Frias <agustinv@codeaurora.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Ganapatrao Kulkarni <ganapatrao.kulkarni@cavium.com> Cc: Jin Yao <yao.jin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Shaokun Zhang <zhangshaokun@hisilicon.com> Cc: Will Deacon <will.deacon@arm.com> Fixes: 14b22ae028de ("perf pmu: Add helper function is_pmu_core to detect PMU CORE devices") Link: http://lkml.kernel.org/r/1524594014-79243-1-git-send-email-kan.liang@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
| | * | | | | perf record: Fix s390 undefined record__auxtrace_init() return valueThomas Richter2018-04-231-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Command 'perf record' calls: cmd_report() record__auxtrace_init() auxtrace_record__init() On s390 function auxtrace_record__init() returns random return value due to missing initialization. This sometime causes 'perf record' to exit immediately without error message and creating a perf.data file. Fix this by setting error the return code to zero before returning from platform specific functions which may not set the error code in call cases. Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Link: http://lkml.kernel.org/r/20180423142940.21143-1-tmricht@linux.ibm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
| | * | | | | perf mem: Document incorrect and missing optionsSangwon Hong2018-04-231-12/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Several options were incorrectly described, some lacked describing required arguments while others were simply not documented, fix it. Signed-off-by: Sangwon Hong <qpakzk@gmail.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Taeung Song <treeze.taeung@gmail.com> Link: http://lkml.kernel.org/r/1524382146-19609-1-git-send-email-qpakzk@gmail.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
| | * | | | | perf evsel: Disable write_backward for leader sampling group eventsJiri Olsa2018-04-232-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | .. and other related fields that do not need to be enabled for events that have sampling leader. It fixes the perf top usage Ingo reported broken: # perf top -e '{cycles,msr/aperf/}:S' The 'msr/aperf/' event is configured for write_back sampling, which is not allowed by the MSR PMU, so it fails to create the event. Adjusting related attr test. Reported-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20180423090823.32309-6-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
| | * | | | | perf pmu: Fix pmu events parsing ruleJiri Olsa2018-04-231-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently all the event parsing fails end up in the event_pmu rule, and display misleading help like: $ perf stat -e inst kill event syntax error: 'inst' \___ Cannot find PMU `inst'. Missing kernel support? ... The reason is that the event_pmu is too strong and match also single string. Changing it to force the '/' separators to be part of the rule, and getting the proper error now: $ perf stat -e inst kill event syntax error: 'inst' \___ parser error Run 'perf list' for a list of valid events ... Signed-off-by: Jiri Olsa <jolsa@kernel.org> Reported-by: Ingo Molnar <mingo@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20180423090823.32309-5-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
| | * | | | | perf stat: Keep the / modifier separator in fallbackJiri Olsa2018-04-231-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The 'perf stat' fallback for EACCES error sets the exclude_kernel perf_event_attr and tries perf_event_open() again with it. In addition, it also changes the name of the event to reflect that change by adding the 'u' modifier. But it does not take into account the '/' separator, so the event name can end up mangled, like: (note the '/:' characters) $ perf stat -e cpu/cpu-cycles/ kill ... 386,832 cpu/cpu-cycles/:u Adding the code to check on the '/' separator and set the following correct event name: $ perf stat -e cpu/cpu-cycles/ kill ... 388,548 cpu/cpu-cycles/u Signed-off-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20180423090823.32309-4-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
| | * | | | | perf test: Adapt test case record+probe_libc_inet_pton.sh for s390Thomas Richter2018-04-231-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | perf test case 58 (record+probe_libc_inet_pton.sh) executed on s390x using kernel 4.16.0rc3 displays this result: # perf trace --no-syscalls -e probe_libc:inet_pton/call-graph=dwarf/ ping -6 -c 1 ::1 probe_libc:inet_pton: (3ffa0240448) __GI___inet_pton (/usr/lib64/libc-2.26.so) gaih_inet (inlined) __GI_getaddrinfo (inlined) main (/usr/bin/ping) __libc_start_main (/usr/lib64/libc-2.26.so) _start (/usr/bin/ping) After I installed kernel 4.16.0 the same tests uses commands: # perf record -e probe_libc:inet_pton/call-graph=dwarf/ -o /tmp/perf.data.abc ping -6 -c 1 ::1 # perf script -i /tmp/perf.data.abc and displays: ping 39048 [006] 84230.381198: probe_libc:inet_pton: (3ffa0240448) 140448 __GI___inet_pton (/usr/lib64/libc-2.26.so) fbde1 gaih_inet (inlined) fe2b9 __GI_getaddrinfo (inlined) 398d main (/usr/bin/ping) Nothing else changed including glibc elfutils and other libraries picked up by the build. The entries for __libc_start_main and _start are missing. I bisected missing __libc_start_main and _start to commit Fixes: 3d20c6246690 ("perf unwind: Unwind with libdw doesn't take symfs into account") When I undo this commit I get this call stack on s390: [root@s35lp76 perf]# ./perf script -i /tmp/perf.data.abc ping 39048 [006] 84230.381198: probe_libc:inet_pton: (3ffa0240448) 140448 __GI___inet_pton (/usr/lib64/libc-2.26.so) fbde1 gaih_inet (inlined) fe2b9 __GI_getaddrinfo (inlined) 398d main (/usr/bin/ping) 22fbd __libc_start_main (/usr/lib64/libc-2.26.so) 457b _start (/usr/bin/ping) Looks like dwarf functions dwfl_xxx create different call back stack trace when using file /usr/lib/debug/usr/bin/ping-20161105-7.fc27.s390x.debug instead of file /usr/bin/ping. Fix this test case on s390 and do not expect any call back stack entry after the main() function. Also be more robust and accept a leading __GI_ prefix in front of getaddrinfo. On x86 this test case shows the same call stack using both kernel versions 4.16.0rc3 and 4.16.0 and also stops at main: [root@f27 perf]# ./perf script -i /tmp/perf.data.tmr ping 4446 [000] 172.027088: probe_libc:inet_pton: (7fdfa08c93c0) 1393c0 __GI___inet_pton (/usr/lib64/libc-2.26.so) fe60d getaddrinfo (/usr/lib64/libc-2.26.so) 2f40 main (/usr/bin/ping) [root@f27 perf]# Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Reviewed-by: Hendrik Brueckner <brueckner@linux.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Martin Vuille <jpmv27@aim.com> Link: http://lkml.kernel.org/r/20180423082428.7930-1-tmricht@linux.ibm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
| | * | | | | perf list: Remove s390 specific strcmp_cpuid_cmp functionThomas Richter2018-04-233-24/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make the type field in pmu-events/arch/s390/mapfile.cvs more generic to match the created cpuid string for s390. The pattern also checks for the counter first version number and counter second version number ([13]\.[1-5]) and the authorization field which follows. These numbers do not exist in the cpuid identification string when perf commands are executed on a z/VM environment (which does not support CPU counter measurement facility). CPUID string for LPAR: cpuid : IBM,3906,704,M03,3.5,002f CPUID string for z/VM: cpuid : IBM,2964,702,N96 This allows the removal of s390 specific cpuid compare code and uses the common compare function with its regular expression matching algorithm. Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Reviewed-by: Hendrik Brueckner <brueckner@linux.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Link: http://lkml.kernel.org/r/20180423081745.3672-1-tmricht@linux.ibm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>