summaryrefslogtreecommitdiffstats
path: root/include/uapi
Commit message (Collapse)AuthorAgeFilesLines
* bpf: fix uapi hole for 32 bit compat applicationsDaniel Borkmann2018-06-011-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In 64 bit, we have a 4 byte hole between ifindex and netns_dev in the case of struct bpf_map_info but also struct bpf_prog_info. In net-next commit b85fab0e67b ("bpf: Add gpl_compatible flag to struct bpf_prog_info") added a bitfield into it to expose some flags related to programs. Thus, add an unnamed __u32 bitfield for both so that alignment keeps the same in both 32 and 64 bit cases, and can be naturally extended from there as in b85fab0e67b. Before: # file test.o test.o: ELF 32-bit LSB relocatable, Intel 80386, version 1 (SYSV), not stripped # pahole test.o struct bpf_map_info { __u32 type; /* 0 4 */ __u32 id; /* 4 4 */ __u32 key_size; /* 8 4 */ __u32 value_size; /* 12 4 */ __u32 max_entries; /* 16 4 */ __u32 map_flags; /* 20 4 */ char name[16]; /* 24 16 */ __u32 ifindex; /* 40 4 */ __u64 netns_dev; /* 44 8 */ __u64 netns_ino; /* 52 8 */ /* size: 64, cachelines: 1, members: 10 */ /* padding: 4 */ }; After (same as on 64 bit): # file test.o test.o: ELF 32-bit LSB relocatable, Intel 80386, version 1 (SYSV), not stripped # pahole test.o struct bpf_map_info { __u32 type; /* 0 4 */ __u32 id; /* 4 4 */ __u32 key_size; /* 8 4 */ __u32 value_size; /* 12 4 */ __u32 max_entries; /* 16 4 */ __u32 map_flags; /* 20 4 */ char name[16]; /* 24 16 */ __u32 ifindex; /* 40 4 */ /* XXX 4 bytes hole, try to pack */ __u64 netns_dev; /* 48 8 */ __u64 netns_ino; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ /* size: 64, cachelines: 1, members: 10 */ /* sum members: 60, holes: 1, sum holes: 4 */ }; Reported-by: Dmitry V. Levin <ldv@altlinux.org> Reported-by: Eugene Syromiatnikov <esyr@redhat.com> Fixes: 52775b33bb507 ("bpf: offload: report device information about offloaded maps") Fixes: 675fc275a3a2d ("bpf: offload: report device information for offloaded programs") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2018-05-252-2/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David Miller: "Let's begin the holiday weekend with some networking fixes: 1) Whoops need to restrict cfg80211 wiphy names even more to 64 bytes. From Eric Biggers. 2) Fix flags being ignored when using kernel_connect() with SCTP, from Xin Long. 3) Use after free in DCCP, from Alexey Kodanev. 4) Need to check rhltable_init() return value in ipmr code, from Eric Dumazet. 5) XDP handling fixes in virtio_net from Jason Wang. 6) Missing RTA_TABLE in rtm_ipv4_policy[], from Roopa Prabhu. 7) Need to use IRQ disabling spinlocks in mlx4_qp_lookup(), from Jack Morgenstein. 8) Prevent out-of-bounds speculation using indexes in BPF, from Daniel Borkmann. 9) Fix regression added by AF_PACKET link layer cure, from Willem de Bruijn. 10) Correct ENIC dma mask, from Govindarajulu Varadarajan. 11) Missing config options for PMTU tests, from Stefano Brivio" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (48 commits) ibmvnic: Fix partial success login retries selftests/net: Add missing config options for PMTU tests mlx4_core: allocate ICM memory in page size chunks enic: set DMA mask to 47 bit ppp: remove the PPPIOCDETACH ioctl ipv4: remove warning in ip_recv_error net : sched: cls_api: deal with egdev path only if needed vhost: synchronize IOTLB message with dev cleanup packet: fix reserve calculation net/mlx5: IPSec, Fix a race between concurrent sandbox QP commands net/mlx5e: When RXFCS is set, add FCS data into checksum calculation bpf: properly enforce index mask to prevent out-of-bounds speculation net/mlx4: Fix irq-unsafe spinlock usage net: phy: broadcom: Fix bcm_write_exp() net: phy: broadcom: Fix auxiliary control register reads net: ipv4: add missing RTA_TABLE to rtm_ipv4_policy net/mlx4: fix spelling mistake: "Inrerface" -> "Interface" and rephrase message ibmvnic: Only do H_EOI for mobility events tuntap: correctly set SOCKWQ_ASYNC_NOSPACE virtio-net: fix leaking page for gso packet during mergeable XDP ...
| * ppp: remove the PPPIOCDETACH ioctlEric Biggers2018-05-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The PPPIOCDETACH ioctl effectively tries to "close" the given ppp file before f_count has reached 0, which is fundamentally a bad idea. It does check 'f_count < 2', which excludes concurrent operations on the file since they would only be possible with a shared fd table, in which case each fdget() would take a file reference. However, it fails to account for the fact that even with 'f_count == 1' the file can still be linked into epoll instances. As reported by syzbot, this can trivially be used to cause a use-after-free. Yet, the only known user of PPPIOCDETACH is pppd versions older than ppp-2.4.2, which was released almost 15 years ago (November 2003). Also, PPPIOCDETACH apparently stopped working reliably at around the same time, when the f_count check was added to the kernel, e.g. see https://lkml.org/lkml/2002/12/31/83. Also, the current 'f_count < 2' check makes PPPIOCDETACH only work in single-threaded applications; it always fails if called from a multithreaded application. All pppd versions released in the last 15 years just close() the file descriptor instead. Therefore, instead of hacking around this bug by exporting epoll internals to modules, and probably missing other related bugs, just remove the PPPIOCDETACH ioctl and see if anyone actually notices. Leave a stub in place that prints a one-time warning and returns EINVAL. Reported-by: syzbot+16363c99d4134717c05b@syzkaller.appspotmail.com Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Biggers <ebiggers@google.com> Acked-by: Paul Mackerras <paulus@ozlabs.org> Reviewed-by: Guillaume Nault <g.nault@alphalink.fr> Tested-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
| * cfg80211: further limit wiphy names to 64 bytesEric Biggers2018-05-181-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | wiphy names were recently limited to 128 bytes by commit a7cfebcb7594 ("cfg80211: limit wiphy names to 128 bytes"). As it turns out though, this isn't sufficient because dev_vprintk_emit() needs the syslog header string "SUBSYSTEM=ieee80211\0DEVICE=+ieee80211:$devname" to fit into 128 bytes. This triggered the "device/subsystem name too long" WARN when the device name was >= 90 bytes. As before, this was reproduced by syzbot by sending an HWSIM_CMD_NEW_RADIO command to the MAC80211_HWSIM generic netlink family. Fix it by further limiting wiphy names to 64 bytes. Reported-by: syzbot+e64565577af34b3768dc@syzkaller.appspotmail.com Fixes: a7cfebcb7594 ("cfg80211: limit wiphy names to 128 bytes") Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
* | Merge branch 'speck-v20' of ↵Linus Torvalds2018-05-212-2/+15
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Merge speculative store buffer bypass fixes from Thomas Gleixner: - rework of the SPEC_CTRL MSR management to accomodate the new fancy SSBD (Speculative Store Bypass Disable) bit handling. - the CPU bug and sysfs infrastructure for the exciting new Speculative Store Bypass 'feature'. - support for disabling SSB via LS_CFG MSR on AMD CPUs including Hyperthread synchronization on ZEN. - PRCTL support for dynamic runtime control of SSB - SECCOMP integration to automatically disable SSB for sandboxed processes with a filter flag for opt-out. - KVM integration to allow guests fiddling with SSBD including the new software MSR VIRT_SPEC_CTRL to handle the LS_CFG based oddities on AMD. - BPF protection against SSB .. this is just the core and x86 side, other architecture support will come separately. * 'speck-v20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (49 commits) bpf: Prevent memory disambiguation attack x86/bugs: Rename SSBD_NO to SSB_NO KVM: SVM: Implement VIRT_SPEC_CTRL support for SSBD x86/speculation, KVM: Implement support for VIRT_SPEC_CTRL/LS_CFG x86/bugs: Rework spec_ctrl base and mask logic x86/bugs: Remove x86_spec_ctrl_set() x86/bugs: Expose x86_spec_ctrl_base directly x86/bugs: Unify x86_spec_ctrl_{set_guest,restore_host} x86/speculation: Rework speculative_store_bypass_update() x86/speculation: Add virtualized speculative store bypass disable support x86/bugs, KVM: Extend speculation control for VIRT_SPEC_CTRL x86/speculation: Handle HT correctly on AMD x86/cpufeatures: Add FEATURE_ZEN x86/cpufeatures: Disentangle SSBD enumeration x86/cpufeatures: Disentangle MSR_SPEC_CTRL enumeration from IBRS x86/speculation: Use synthetic bits for IBRS/IBPB/STIBP KVM: SVM: Move spec control call after restore of GS x86/cpu: Make alternative_msr_write work for 32-bit code x86/bugs: Fix the parameters alignment and missing void x86/bugs: Make cpu_show_common() static ...
| * seccomp: Add filter flag to opt-out of SSB mitigationKees Cook2018-05-051-2/+3
| | | | | | | | | | | | | | | | | | If a seccomp user is not interested in Speculative Store Bypass mitigation by default, it can set the new SECCOMP_FILTER_FLAG_SPEC_ALLOW flag when adding filters. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * prctl: Add force disable speculationThomas Gleixner2018-05-051-0/+1
| | | | | | | | | | | | | | | | | | | | For certain use cases it is desired to enforce mitigations so they cannot be undone afterwards. That's important for loader stubs which want to prevent a child from disabling the mitigation again. Will also be used for seccomp(). The extra state preserving of the prctl state for SSB is a preparatory step for EBPF dymanic speculation control. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * prctl: Add speculation control prctlsThomas Gleixner2018-05-031-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add two new prctls to control aspects of speculation related vulnerabilites and their mitigations to provide finer grained control over performance impacting mitigations. PR_GET_SPECULATION_CTRL returns the state of the speculation misfeature which is selected with arg2 of prctl(2). The return value uses bit 0-2 with the following meaning: Bit Define Description 0 PR_SPEC_PRCTL Mitigation can be controlled per task by PR_SET_SPECULATION_CTRL 1 PR_SPEC_ENABLE The speculation feature is enabled, mitigation is disabled 2 PR_SPEC_DISABLE The speculation feature is disabled, mitigation is enabled If all bits are 0 the CPU is not affected by the speculation misfeature. If PR_SPEC_PRCTL is set, then the per task control of the mitigation is available. If not set, prctl(PR_SET_SPECULATION_CTRL) for the speculation misfeature will fail. PR_SET_SPECULATION_CTRL allows to control the speculation misfeature, which is selected by arg2 of prctl(2) per task. arg3 is used to hand in the control value, i.e. either PR_SPEC_ENABLE or PR_SPEC_DISABLE. The common return values are: EINVAL prctl is not implemented by the architecture or the unused prctl() arguments are not 0 ENODEV arg2 is selecting a not supported speculation misfeature PR_SET_SPECULATION_CTRL has these additional return values: ERANGE arg3 is incorrect, i.e. it's not either PR_SPEC_ENABLE or PR_SPEC_DISABLE ENXIO prctl control of the selected speculation misfeature is disabled The first supported controlable speculation misfeature is PR_SPEC_STORE_BYPASS. Add the define so this can be shared between architectures. Based on an initial patch from Tim Chen and mostly rewritten. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller2018-05-131-0/+3
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pablo Neira Ayuso says: ==================== Netfilter/IPVS fixes for net The following patchset contains Netfilter/IPVS fixes for your net tree, they are: 1) Fix handling of simultaneous open TCP connection in conntrack, from Jozsef Kadlecsik. 2) Insufficient sanitify check of xtables extension names, from Florian Westphal. 3) Skip unnecessary synchronize_rcu() call when transaction log is already empty, from Florian Westphal. 4) Incorrect destination mac validation in ebt_stp, from Stephen Hemminger. 5) xtables module reference counter leak in nft_compat, from Florian Westphal. 6) Incorrect connection reference counting logic in IPVS one-packet scheduler, from Julian Anastasov. 7) Wrong stats for 32-bits CPU in IPVS, also from Julian. 8) Calm down sparse error in netfilter core, also from Florian. 9) Use nla_strlcpy to fix compilation warning in nfnetlink_acct and nfnetlink_cthelper, again from Florian. 10) Missing module alias in icmp and icmp6 xtables extensions, from Florian Westphal. 11) Base chain statistics in nf_tables may be unset/null, from Florian. 12) Fix handling of large matchinfo size in nft_compat, this includes one preparation for before this fix. From Florian. 13) Fix bogus EBUSY error when deleting chains due to incorrect reference counting from the preparation phase of the two-phase commit protocol. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * | netfilter: Fix handling simultaneous open in TCP conntrackJozsef Kadlecsik2018-04-271-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Dominique Martinet reported a TCP hang problem when simultaneous open was used. The problem is that the tcp_conntracks state table is not smart enough to handle the case. The state table could be fixed by introducing a new state, but that would require more lines of code compared to this patch, due to the required backward compatibility with ctnetlink. Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Reported-by: Dominique Martinet <asmadeus@codewreck.org> Tested-by: Dominique Martinet <asmadeus@codewreck.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2018-05-111-0/+2
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David Miller: 1) Verify lengths of keys provided by the user is AF_KEY, from Kevin Easton. 2) Add device ID for BCM89610 PHY. Thanks to Bhadram Varka. 3) Add Spectre guards to some ATM code, courtesy of Gustavo A. R. Silva. 4) Fix infinite loop in NSH protocol code. To Eric Dumazet we are most grateful for this fix. 5) Line up /proc/net/netlink headers properly. This fix from YU Bo, we do appreciate. 6) Use after free in TLS code. Once again we are blessed by the honorable Eric Dumazet with this fix. 7) Fix regression in TLS code causing stalls on partial TLS records. This fix is bestowed upon us by Andrew Tomt. 8) Deal with too small MTUs properly in LLC code, another great gift from Eric Dumazet. 9) Handle cached route flushing properly wrt. MTU locking in ipv4, to Hangbin Liu we give thanks for this. 10) Fix regression in SO_BINDTODEVIC handling wrt. UDP socket demux. Paolo Abeni, he gave us this. 11) Range check coalescing parameters in mlx4 driver, thank you Moshe Shemesh. 12) Some ipv6 ICMP error handling fixes in rxrpc, from our good brother David Howells. 13) Fix kexec on mlx5 by freeing IRQs in shutdown path. Daniel Juergens, you're the best! 14) Don't send bonding RLB updates to invalid MAC addresses. Debabrata Benerjee saved us! 15) Uh oh, we were leaking in udp_sendmsg and ping_v4_sendmsg. The ship is now water tight, thanks to Andrey Ignatov. 16) IPSEC memory leak in ixgbe from Colin Ian King, man we've got holes everywhere! 17) Fix error path in tcf_proto_create, Jiri Pirko what would we do without you! * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (92 commits) net sched actions: fix refcnt leak in skbmod net: sched: fix error path in tcf_proto_create() when modules are not configured net sched actions: fix invalid pointer dereferencing if skbedit flags missing ixgbe: fix memory leak on ipsec allocation ixgbevf: fix ixgbevf_xmit_frame()'s return type ixgbe: return error on unsupported SFP module when resetting ice: Set rq_last_status when cleaning rq ipv4: fix memory leaks in udp_sendmsg, ping_v4_sendmsg mlxsw: core: Fix an error handling path in 'mlxsw_core_bus_device_register()' bonding: send learning packets for vlans on slave bonding: do not allow rlb updates to invalid mac net/mlx5e: Err if asked to offload TC match on frag being first net/mlx5: E-Switch, Include VF RDMA stats in vport statistics net/mlx5: Free IRQs in shutdown path rxrpc: Trace UDP transmission failure rxrpc: Add a tracepoint to log ICMP/ICMP6 and error messages rxrpc: Fix the min security level for kernel calls rxrpc: Fix error reception on AF_INET6 sockets rxrpc: Fix missing start of call timeout qed: fix spelling mistake: "taskelt" -> "tasklet" ...
| * \ \ Merge tag 'mac80211-for-davem-2018-05-09' of ↵David S. Miller2018-05-101-0/+2
| |\ \ \ | | |_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== We only have a few fixes this time: * WMM element validation * SAE timeout * add-BA timeout * docbook parsing * a few memory leaks in error paths ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | cfg80211: limit wiphy names to 128 bytesJohannes Berg2018-04-191-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There's currently no limit on wiphy names, other than netlink message size and memory limitations, but that causes issues when, for example, the wiphy name is used in a uevent, e.g. in rfkill where we use the same name for the rfkill instance, and then the buffer there is "only" 2k for the environment variables. This was reported by syzkaller, which used a 4k name. Limit the name to something reasonable, I randomly picked 128. Reported-by: syzbot+230d9e642a85d3fec29c@syzkaller.appspotmail.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
* | | | Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds2018-05-0419-19/+19
|\ \ \ \ | |/ / / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull rdma fixes from Doug Ledford: "This is our first pull request of the rc cycle. It's not that it's been overly quiet, we were just waiting on a few things before sending this off. For instance, the 6 patch series from Intel for the hfi1 driver had actually been pulled in on Tuesday for a Wednesday pull request, only to have Jason notice something I missed, so we held off for some testing, and then on Thursday had to respin the series because the very first patch needed a minor fix (unnecessary cast is all). There is a sizable hns patch series in here, as well as a reasonably largish hfi1 patch series, then all of the lines of uapi updates are just the change to the new official Linux-OpenIB SPDX tag (a bunch of our files had what amounts to a BSD-2-Clause + MIT Warranty statement as their license as a result of the initial code submission years ago, and the SPDX folks decided it was unique enough to warrant a unique tag), then the typical mlx4 and mlx5 updates, and finally some cxgb4 and core/cache/cma updates to round out the bunch. None of it was overly large by itself, but in the 2 1/2 weeks we've been collecting patches, it has added up :-/. As best I can tell, it's been through 0day (I got a notice about my last for-next push, but not for my for-rc push, but Jason seems to think that failure messages are prioritized and success messages not so much). It's also been through linux-next. And yes, we did notice in the context portion of the CMA query gid fix patch that there is a dubious BUG_ON() in the code, and have plans to audit our BUG_ON usage and remove it anywhere we can. Summary: - Various build fixes (USER_ACCESS=m and ADDR_TRANS turned off) - SPDX license tag cleanups (new tag Linux-OpenIB) - RoCE GID fixes related to default GIDs - Various fixes to: cxgb4, uverbs, cma, iwpm, rxe, hns (big batch), mlx4, mlx5, and hfi1 (medium batch)" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (52 commits) RDMA/cma: Do not query GID during QP state transition to RTR IB/mlx4: Fix integer overflow when calculating optimal MTT size IB/hfi1: Fix memory leak in exception path in get_irq_affinity() IB/{hfi1, rdmavt}: Fix memory leak in hfi1_alloc_devdata() upon failure IB/hfi1: Fix NULL pointer dereference when invalid num_vls is used IB/hfi1: Fix loss of BECN with AHG IB/hfi1 Use correct type for num_user_context IB/hfi1: Fix handling of FECN marked multicast packet IB/core: Make ib_mad_client_id atomic iw_cxgb4: Atomically flush per QP HW CQEs IB/uverbs: Fix kernel crash during MR deregistration flow IB/uverbs: Prevent reregistration of DM_MR to regular MR RDMA/mlx4: Add missed RSS hash inner header flag RDMA/hns: Fix a couple misspellings RDMA/hns: Submit bad wr RDMA/hns: Update assignment method for owner field of send wqe RDMA/hns: Adjust the order of cleanup hem table RDMA/hns: Only assign dqpn if IB_QP_PATH_DEST_QPN bit is set RDMA/hns: Remove some unnecessary attr_mask judgement RDMA/hns: Only assign mtu if IB_QP_PATH_MTU bit is set ...
| * | | uapi: Fix SPDX tags for files referring to the 'OpenIB.org' licenseJason Gunthorpe2018-04-2319-19/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Based on discussion with Kate Stewart this license is not a BSD-2-Clause, but is now formally identified as Linux-OpenIB by SPDX. The key difference between the licenses is in the 'warranty' paragraph. if_infiniband.h refers to the 'OpenIB.org' license, but does not include the text, instead it links to an obsolete web site that contains a license that matches the BSD-2-Clause SPX. There is no 'three clause' version of the OpenIB.org license. Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Doug Ledford <dledford@redhat.com>
* | | | Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds2018-04-291-1/+0
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Thomas Gleixner: "Two fixes from the timer departement: - Fix a long standing issue in the NOHZ tick code which causes RB tree corruption, delayed timers and other malfunctions. The cause for this is code which modifies the expiry time of an enqueued hrtimer. - Revert the CLOCK_MONOTONIC/CLOCK_BOOTTIME unification due to regression reports. Seems userspace _is_ relying on the documented behaviour despite our hope that it wont" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: Revert: Unify CLOCK_MONOTONIC and CLOCK_BOOTTIME tick/sched: Do not mess with an enqueued hrtimer
| * | | | Revert: Unify CLOCK_MONOTONIC and CLOCK_BOOTTIMEThomas Gleixner2018-04-261-1/+0
| | |_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Revert commits 92af4dcb4e1c ("tracing: Unify the "boot" and "mono" tracing clocks") 127bfa5f4342 ("hrtimer: Unify MONOTONIC and BOOTTIME clock behavior") 7250a4047aa6 ("posix-timers: Unify MONOTONIC and BOOTTIME clock behavior") d6c7270e913d ("timekeeping: Remove boot time specific code") f2d6fdbfd238 ("Input: Evdev - unify MONOTONIC and BOOTTIME clock behavior") d6ed449afdb3 ("timekeeping: Make the MONOTONIC clock behave like the BOOTTIME clock") 72199320d49d ("timekeeping: Add the new CLOCK_MONOTONIC_ACTIVE clock") As stated in the pull request for the unification of CLOCK_MONOTONIC and CLOCK_BOOTTIME, it was clear that we might have to revert the change. As reported by several folks systemd and other applications rely on the documented behaviour of CLOCK_MONOTONIC on Linux and break with the above changes. After resume daemons time out and other timeout related issues are observed. Rafael compiled this list: * systemd kills daemons on resume, after >WatchdogSec seconds of suspending (Genki Sky). [Verified that that's because systemd uses CLOCK_MONOTONIC and expects it to not include the suspend time.] * systemd-journald misbehaves after resume: systemd-journald[7266]: File /var/log/journal/016627c3c4784cd4812d4b7e96a34226/system.journal corrupted or uncleanly shut down, renaming and replacing. (Mike Galbraith). * NetworkManager reports "networking disabled" and networking is broken after resume 50% of the time (Pavel). [May be because of systemd.] * MATE desktop dims the display and starts the screensaver right after system resume (Pavel). * Full system hang during resume (me). [May be due to systemd or NM or both.] That happens on debian and open suse systems. It's sad, that these problems were neither catched in -next nor by those folks who expressed interest in this change. Reported-by: Rafael J. Wysocki <rjw@rjwysocki.net> Reported-by: Genki Sky <sky@genki.is>, Reported-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kevin Easton <kevin@guarana.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Salyzyn <salyzyn@android.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org>
* | | | rMerge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2018-04-271-0/+7
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull KVM fixes from Radim Krčmář: "ARM: - PSCI selection API, a leftover from 4.16 (for stable) - Kick vcpu on active interrupt affinity change - Plug a VMID allocation race on oversubscribed systems - Silence debug messages - Update Christoffer's email address (linaro -> arm) x86: - Expose userspace-relevant bits of a newly added feature - Fix TLB flushing on VMX with VPID, but without EPT" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: x86/headers/UAPI: Move DISABLE_EXITS KVM capability bits to the UAPI kvm: apic: Flush TLB after APIC mode/address change if VPIDs are in use arm/arm64: KVM: Add PSCI version selection API KVM: arm/arm64: vgic: Kick new VCPU on interrupt migration arm64: KVM: Demote SVE and LORegion warnings to debug only MAINTAINERS: Update e-mail address for Christoffer Dall KVM: arm/arm64: Close VMID generation race
| * | | | x86/headers/UAPI: Move DISABLE_EXITS KVM capability bits to the UAPIKarimAllah Ahmed2018-04-271-0/+7
| |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move DISABLE_EXITS KVM capability bits to the UAPI just like the rest of capabilities. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: x86@kernel.org Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
* | | | Merge tag 'staging-4.17-rc3' of ↵Linus Torvalds2018-04-271-18/+0
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging Pull staging fixes from Greg KH: "Here are two staging driver fixups for 4.17-rc3. The first is the remaining stragglers of the irda code removal that you pointed out during the merge window. The second is a fix for the wilc1000 driver due to a patch that got merged in 4.17-rc1. Both of these have been in linux-next for a while with no reported issues" * tag 'staging-4.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: staging: wilc1000: fix NULL pointer exception in host_int_parse_assoc_resp_info() staging: irda: remove remaining remants of irda code removal
| * | | | staging: irda: remove remaining remants of irda code removalGreg Kroah-Hartman2018-04-161-18/+0
| | |/ / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There were some documentation locations that irda was mentioned, as well as an old MAINTAINERS entry and the networking sysctl entries. Clean these all out as this stuff really is finally gone. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | | | virtio_balloon: add array of stat namesMichael S. Tsirkin2018-04-241-0/+15
| |/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Jason Wang points out that it's very hard for users to build an array of stat names. The naive thing is to use VIRTIO_BALLOON_S_NR but that breaks if we add more stats - as done e.g. recently by commit 6c64fe7f2 ("virtio_balloon: export hugetlb page allocation counts"). Let's add an array of reasonably readable names. Fixes: 6c64fe7f2 ("virtio_balloon: export hugetlb page allocation counts") Cc: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Jonathan Helman <jonathan.helman@oracle.com>
* | | Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds2018-04-221-3/+15
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Thomas Gleixner: "A larger set of updates for perf. Kernel: - Handle the SBOX uncore monitoring correctly on Broadwell CPUs which do not have SBOX. - Store context switch out type in PERF_RECORD_SWITCH[_CPU_WIDE]. The percentage of preempting and non-preempting context switches help understanding the nature of workloads (CPU or IO bound) that are running on a machine. This adds the kernel facility and userspace changes needed to show this information in 'perf script' and 'perf report -D' (Alexey Budankov) - Remove a WARN_ON() in the trace/kprobes code which is pointless because the return error code is already telling the caller what's wrong. - Revert a fugly workaround for clang BPF targets. - Fix sample_max_stack maximum check and do not proceed when an error has been detect, return them to avoid misidentifying errors (Jiri Olsa) - Add SPDX idenitifiers and get rid of GPL boilderplate. Tools: - Synchronize kernel ABI headers, v4.17-rc1 (Ingo Molnar) - Support MAP_FIXED_NOREPLACE, noticed when updating the tools/include/ copies (Arnaldo Carvalho de Melo) - Add '\n' at the end of parse-options error messages (Ravi Bangoria) - Add s390 support for detailed/verbose PMU event description (Thomas Richter) - perf annotate fixes and improvements: * Allow showing offsets in more than just jump targets, use the new 'O' hotkey in the TUI, config ~/.perfconfig annotate.offset_level for it and for --stdio2 (Arnaldo Carvalho de Melo) * Use the resolved variable names from objdump disassembled lines to make them more compact, just like was already done for some instructions, like "mov", this eventually will be done more generally, but lets now add some more to the existing mechanism (Arnaldo Carvalho de Melo) - perf record fixes: * Change warning for missing topology sysfs entry to debug, as not all architectures have those files, s390 being one of those (Thomas Richter) * Remove old error messages about things that unlikely to be the root cause in modern systems (Andi Kleen) - perf sched fixes: * Fix -g/--call-graph documentation (Takuya Yamamoto) - perf stat: * Enable 1ms interval for printing event counters values in (Alexey Budankov) - perf test fixes: * Run dwarf unwind on arm32 (Kim Phillips) * Remove unused ptrace.h include from LLVM test, sidesteping older clang's lack of support for some asm constructs (Arnaldo Carvalho de Melo) * Fixup BPF test using epoll_pwait syscall function probe, to cope with the syscall routines renames performed in this development cycle (Arnaldo Carvalho de Melo) - perf version fixes: * Do not print info about HAVE_LIBAUDIT_SUPPORT in 'perf version --build-options' when HAVE_SYSCALL_TABLE_SUPPORT is true, as libaudit won't be used in that case, print info about syscall_table support instead (Jin Yao) - Build system fixes: * Use HAVE_..._SUPPORT used consistently (Jin Yao) * Restore READ_ONCE() C++ compatibility in tools/include (Mark Rutland) * Give hints about package names needed to build jvmti (Arnaldo Carvalho de Melo)" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits) perf/x86/intel/uncore: Fix SBOX support for Broadwell CPUs perf/x86/intel/uncore: Revert "Remove SBOX support for Broadwell server" coresight: Move to SPDX identifier perf test BPF: Fixup BPF test using epoll_pwait syscall function probe perf tests mmap: Show which tracepoint is failing perf tools: Add '\n' at the end of parse-options error messages perf record: Remove suggestion to enable APIC perf record: Remove misleading error suggestion perf hists browser: Clarify top/report browser help perf mem: Allow all record/report options perf trace: Support MAP_FIXED_NOREPLACE perf: Remove superfluous allocation error check perf: Fix sample_max_stack maximum check perf: Return proper values for user stack errors perf list: Add s390 support for detailed/verbose PMU event description perf script: Extend misc field decoding with switch out event type perf report: Extend raw dump (-D) out with switch out event type perf/core: Store context switch out type in PERF_RECORD_SWITCH[_CPU_WIDE] tools/headers: Synchronize kernel ABI headers, v4.17-rc1 trace_kprobe: Remove warning message "Could not insert probe at..." ...
| * | | perf/core: Store context switch out type in PERF_RECORD_SWITCH[_CPU_WIDE]Alexey Budankov2018-04-171-3/+15
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Store preempting context switch out event into Perf trace as a part of PERF_RECORD_SWITCH[_CPU_WIDE] record. Percentage of preempting and non-preempting context switches help understanding the nature of workloads (CPU or IO bound) that are running on a machine; The event is treated as preemption one when task->state value of the thread being switched out is TASK_RUNNING. Event type encoding is implemented using PERF_RECORD_MISC_SWITCH_OUT_PREEMPT bit; Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Link: http://lkml.kernel.org/r/9ff84e83-a0ca-dd82-a6d0-cb951689be74@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
* | | Merge tag 'random_for_linus_stable' of ↵Linus Torvalds2018-04-211-0/+3
|\ \ \ | |/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random Pull /dev/random fixes from Ted Ts'o: "Fix some bugs in the /dev/random driver which causes getrandom(2) to unblock earlier than designed. Thanks to Jann Horn from Google's Project Zero for pointing this out to me" * tag 'random_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random: random: add new ioctl RNDRESEEDCRNG random: crng_reseed() should lock the crng instance that it is modifying random: set up the NUMA crng instances after the CRNG is fully initialized random: use a different mixing algorithm for add_device_randomness() random: fix crng_ready() test
| * | random: add new ioctl RNDRESEEDCRNGTheodore Ts'o2018-04-141-0/+3
| | | | | | | | | | | | | | | | | | | | | Add a new ioctl which forces the the crng to be reseeded. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
* | | Merge branch 'parisc-4.17-2' of ↵Linus Torvalds2018-04-121-1/+2
|\ \ \ | |_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux Pull parisc updates from Helge Deller: - fix panic when halting system via "shutdown -h now" - drop own coding in favour of generic CONFIG_COMPAT_BINFMT_ELF implementation - add FPE_CONDTRAP constant: last outstanding parisc-specific cleanup for Eric Biedermans siginfo patches - move some functions to .init and some to .text.hot linker sections * 'parisc-4.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux: parisc: Prevent panic at system halt parisc: Switch to generic COMPAT_BINFMT_ELF parisc: Move cache flush functions into .text.hot section parisc/signal: Add FPE_CONDTRAP for conditional trap handling
| * | parisc/signal: Add FPE_CONDTRAP for conditional trap handlingHelge Deller2018-04-111-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | Posix and common sense requires that SI_USER not be a signal specific si_code. Thus add a new FPE_CONDTRAP si_code for conditional traps. Signed-off-by: Helge Deller <deller@gmx.de> Cc: Stephen Rothwell <sfr@canb.auug.org.au>
* | | Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds2018-04-111-1/+3
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull virtio update from Michael Tsirkin: "This adds reporting hugepage stats to virtio-balloon" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: virtio_balloon: export hugetlb page allocation counts
| * | | virtio_balloon: export hugetlb page allocation countsJonathan Helman2018-04-101-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Export the number of successful and failed hugetlb page allocations via the virtio balloon driver. These 2 counts come directly from the vm_events HTLB_BUDDY_PGALLOC and HTLB_BUDDY_PGALLOC_FAIL. Signed-off-by: Jonathan Helman <jonathan.helman@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Jason Wang <jasowang@redhat.com>
* | | | linux/const.h: refactor _BITUL and _BITULL a bitMasahiro Yamada2018-04-111-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Minor cleanups available by _UL and _ULL. Link: http://lkml.kernel.org/r/1519301715-31798-5-git-send-email-yamada.masahiro@socionext.com Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Howells <dhowells@redhat.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: Russell King <rmk+kernel@armlinux.org.uk> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | linux/const.h: move UL() macro to include/linux/const.hMasahiro Yamada2018-04-111-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ARM, ARM64 and UniCore32 duplicate the definition of UL(): #define UL(x) _AC(x, UL) This is not actually arch-specific, so it will be useful to move it to a common header. Currently, we only have the uapi variant for linux/const.h, so I am creating include/linux/const.h. I also added _UL(), _ULL() and ULL() because _AC() is mostly used in the form either _AC(..., UL) or _AC(..., ULL). I expect they will be replaced in follow-up cleanups. The underscore-prefixed ones should be used for exported headers. Link: http://lkml.kernel.org/r/1519301715-31798-4-git-send-email-yamada.masahiro@socionext.com Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Acked-by: Guan Xuetao <gxt@mprc.pku.edu.cn> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Russell King <rmk+kernel@armlinux.org.uk> Cc: David Howells <dhowells@redhat.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | linux/const.h: prefix include guard of uapi/linux/const.h with _UAPIMasahiro Yamada2018-04-111-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "linux/const.h: cleanups of macros such as UL(), _BITUL(), BIT() etc", v3. ARM, ARM64, UniCore32 define UL() as a shorthand of _AC(..., UL). More architectures may introduce it in the future. UL() is arch-agnostic, and useful. So let's move it to include/linux/const.h Currently, <asm/memory.h> must be included to use UL(). It pulls in more bloats just for defining some bit macros. I posted V2 one year ago. The previous posts are: https://patchwork.kernel.org/patch/9498273/ https://patchwork.kernel.org/patch/9498275/ https://patchwork.kernel.org/patch/9498269/ https://patchwork.kernel.org/patch/9498271/ At that time, what blocked this series was a comment from David Howells: You need to be very careful doing this. Some userspace stuff depends on the guard macro names on the kernel header files. (https://patchwork.kernel.org/patch/9498275/) Looking at the code closer, I noticed this is not a problem. See the following line. https://github.com/torvalds/linux/blob/v4.16-rc2/scripts/headers_install.sh#L40 scripts/headers_install.sh rips off _UAPI prefix from guard macro names. I ran "make headers_install" and confirmed the result is what I expect. So, we can prefix the include guard of include/uapi/linux/const.h, and add a new include/linux/const.h. This patch (of 4): I am going to add include/linux/const.h for the kernel space. Add _UAPI to the include guard of include/uapi/linux/const.h to prepare for that. Please notice the guard name of the exported one will be kept as-is. So, this commit has no impact to the userspace even if some userspace stuff depends on the guard macro names. scripts/headers_install.sh processes exported headers by SED, and rips off "_UAPI" from guard macro names. #ifndef _UAPI_LINUX_CONST_H #define _UAPI_LINUX_CONST_H will be turned into #ifndef _LINUX_CONST_H #define _LINUX_CONST_H Link: http://lkml.kernel.org/r/1519301715-31798-2-git-send-email-yamada.masahiro@socionext.com Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: David Howells <dhowells@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | fs, elf: drop MAP_FIXED usage from elf_mapMichal Hocko2018-04-111-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Both load_elf_interp and load_elf_binary rely on elf_map to map segments on a controlled address and they use MAP_FIXED to enforce that. This is however dangerous thing prone to silent data corruption which can be even exploitable. Let's take CVE-2017-1000253 as an example. At the time (before commit eab09532d400: "binfmt_elf: use ELF_ET_DYN_BASE only for PIE") ELF_ET_DYN_BASE was at TASK_SIZE / 3 * 2 which is not that far away from the stack top on 32b (legacy) memory layout (only 1GB away). Therefore we could end up mapping over the existing stack with some luck. The issue has been fixed since then (a87938b2e246: "fs/binfmt_elf.c: fix bug in loading of PIE binaries"), ELF_ET_DYN_BASE moved moved much further from the stack (eab09532d400 and later by c715b72c1ba4: "mm: revert x86_64 and arm64 ELF_ET_DYN_BASE base changes") and excessive stack consumption early during execve fully stopped by da029c11e6b1 ("exec: Limit arg stack to at most 75% of _STK_LIM"). So we should be safe and any attack should be impractical. On the other hand this is just too subtle assumption so it can break quite easily and hard to spot. I believe that the MAP_FIXED usage in load_elf_binary (et. al) is still fundamentally dangerous. Moreover it shouldn't be even needed. We are at the early process stage and so there shouldn't be unrelated mappings (except for stack and loader) existing so mmap for a given address should succeed even without MAP_FIXED. Something is terribly wrong if this is not the case and we should rather fail than silently corrupt the underlying mapping. Address this issue by changing MAP_FIXED to the newly added MAP_FIXED_NOREPLACE. This will mean that mmap will fail if there is an existing mapping clashing with the requested one without clobbering it. [mhocko@suse.com: fix build] [akpm@linux-foundation.org: coding-style fixes] [avagin@openvz.org: don't use the same value for MAP_FIXED_NOREPLACE and MAP_SYNC] Link: http://lkml.kernel.org/r/20171218184916.24445-1-avagin@openvz.org Link: http://lkml.kernel.org/r/20171213092550.2774-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrei Vagin <avagin@openvz.org> Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Acked-by: Kees Cook <keescook@chromium.org> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Cc: Joel Stanley <joel@jms.id.au> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | mm: introduce MAP_FIXED_NOREPLACEMichal Hocko2018-04-111-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "mm: introduce MAP_FIXED_NOREPLACE", v2. This has started as a follow up discussion [3][4] resulting in the runtime failure caused by hardening patch [5] which removes MAP_FIXED from the elf loader because MAP_FIXED is inherently dangerous as it might silently clobber an existing underlying mapping (e.g. stack). The reason for the failure is that some architectures enforce an alignment for the given address hint without MAP_FIXED used (e.g. for shared or file backed mappings). One way around this would be excluding those archs which do alignment tricks from the hardening [6]. The patch is really trivial but it has been objected, rightfully so, that this screams for a more generic solution. We basically want a non-destructive MAP_FIXED. The first patch introduced MAP_FIXED_NOREPLACE which enforces the given address but unlike MAP_FIXED it fails with EEXIST if the given range conflicts with an existing one. The flag is introduced as a completely new one rather than a MAP_FIXED extension because of the backward compatibility. We really want a never-clobber semantic even on older kernels which do not recognize the flag. Unfortunately mmap sucks wrt flags evaluation because we do not EINVAL on unknown flags. On those kernels we would simply use the traditional hint based semantic so the caller can still get a different address (which sucks) but at least not silently corrupt an existing mapping. I do not see a good way around that. Except we won't export expose the new semantic to the userspace at all. It seems there are users who would like to have something like that. Jemalloc has been mentioned by Michael Ellerman [7] Florian Weimer has mentioned the following: : glibc ld.so currently maps DSOs without hints. This means that the kernel : will map right next to each other, and the offsets between them a completely : predictable. We would like to change that and supply a random address in a : window of the address space. If there is a conflict, we do not want the : kernel to pick a non-random address. Instead, we would try again with a : random address. John Hubbard has mentioned CUDA example : a) Searches /proc/<pid>/maps for a "suitable" region of available : VA space. "Suitable" generally means it has to have a base address : within a certain limited range (a particular device model might : have odd limitations, for example), it has to be large enough, and : alignment has to be large enough (again, various devices may have : constraints that lead us to do this). : : This is of course subject to races with other threads in the process. : : Let's say it finds a region starting at va. : : b) Next it does: : p = mmap(va, ...) : : *without* setting MAP_FIXED, of course (so va is just a hint), to : attempt to safely reserve that region. If p != va, then in most cases, : this is a failure (almost certainly due to another thread getting a : mapping from that region before we did), and so this layer now has to : call munmap(), before returning a "failure: retry" to upper layers. : : IMPROVEMENT: --> if instead, we could call this: : : p = mmap(va, ... MAP_FIXED_NOREPLACE ...) : : , then we could skip the munmap() call upon failure. This : is a small thing, but it is useful here. (Thanks to Piotr : Jaroszynski and Mark Hairgrove for helping me get that detail : exactly right, btw.) : : c) After that, CUDA suballocates from p, via: : : q = mmap(sub_region_start, ... MAP_FIXED ...) : : Interestingly enough, "freeing" is also done via MAP_FIXED, and : setting PROT_NONE to the subregion. Anyway, I just included (c) for : general interest. Atomic address range probing in the multithreaded programs in general sounds like an interesting thing to me. The second patch simply replaces MAP_FIXED use in elf loader by MAP_FIXED_NOREPLACE. I believe other places which rely on MAP_FIXED should follow. Actually real MAP_FIXED usages should be docummented properly and they should be more of an exception. [1] http://lkml.kernel.org/r/20171116101900.13621-1-mhocko@kernel.org [2] http://lkml.kernel.org/r/20171129144219.22867-1-mhocko@kernel.org [3] http://lkml.kernel.org/r/20171107162217.382cd754@canb.auug.org.au [4] http://lkml.kernel.org/r/1510048229.12079.7.camel@abdul.in.ibm.com [5] http://lkml.kernel.org/r/20171023082608.6167-1-mhocko@kernel.org [6] http://lkml.kernel.org/r/20171113094203.aofz2e7kueitk55y@dhcp22.suse.cz [7] http://lkml.kernel.org/r/87efp1w7vy.fsf@concordia.ellerman.id.au This patch (of 2): MAP_FIXED is used quite often to enforce mapping at the particular range. The main problem of this flag is, however, that it is inherently dangerous because it unmaps existing mappings covered by the requested range. This can cause silent memory corruptions. Some of them even with serious security implications. While the current semantic might be really desiderable in many cases there are others which would want to enforce the given range but rather see a failure than a silent memory corruption on a clashing range. Please note that there is no guarantee that a given range is obeyed by the mmap even when it is free - e.g. arch specific code is allowed to apply an alignment. Introduce a new MAP_FIXED_NOREPLACE flag for mmap to achieve this behavior. It has the same semantic as MAP_FIXED wrt. the given address request with a single exception that it fails with EEXIST if the requested address is already covered by an existing mapping. We still do rely on get_unmaped_area to handle all the arch specific MAP_FIXED treatment and check for a conflicting vma after it returns. The flag is introduced as a completely new one rather than a MAP_FIXED extension because of the backward compatibility. We really want a never-clobber semantic even on older kernels which do not recognize the flag. Unfortunately mmap sucks wrt. flags evaluation because we do not EINVAL on unknown flags. On those kernels we would simply use the traditional hint based semantic so the caller can still get a different address (which sucks) but at least not silently corrupt an existing mapping. I do not see a good way around that. [mpe@ellerman.id.au: fix whitespace] [fail on clashing range with EEXIST as per Florian Weimer] [set MAP_FIXED before round_hint_to_min as per Khalid Aziz] Link: http://lkml.kernel.org/r/20171213092550.2774-2-mhocko@kernel.org Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: Russell King - ARM Linux <linux@armlinux.org.uk> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Cc: Joel Stanley <joel@jms.id.au> Cc: Kees Cook <keescook@chromium.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Jason Evans <jasone@google.com> Cc: David Goldblatt <davidtgoldblatt@gmail.com> Cc: Edward Tomasz Napierała <trasz@FreeBSD.org> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | ipc/msg: introduce msgctl(MSG_STAT_ANY)Davidlohr Bueso2018-04-111-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a permission discrepancy when consulting msq ipc object metadata between /proc/sysvipc/msg (0444) and the MSG_STAT shmctl command. The later does permission checks for the object vs S_IRUGO. As such there can be cases where EACCESS is returned via syscall but the info is displayed anyways in the procfs files. While this might have security implications via info leaking (albeit no writing to the msq metadata), this behavior goes way back and showing all the objects regardless of the permissions was most likely an overlook - so we are stuck with it. Furthermore, modifying either the syscall or the procfs file can cause userspace programs to break (ie ipcs). Some applications require getting the procfs info (without root privileges) and can be rather slow in comparison with a syscall -- up to 500x in some reported cases for shm. This patch introduces a new MSG_STAT_ANY command such that the msq ipc object permissions are ignored, and only audited instead. In addition, I've left the lsm security hook checks in place, as if some policy can block the call, then the user has no other choice than just parsing the procfs file. Link: http://lkml.kernel.org/r/20180215162458.10059-4-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Reported-by: Robert Kettler <robert.kettler@outlook.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | ipc/sem: introduce semctl(SEM_STAT_ANY)Davidlohr Bueso2018-04-111-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a permission discrepancy when consulting shm ipc object metadata between /proc/sysvipc/sem (0444) and the SEM_STAT semctl command. The later does permission checks for the object vs S_IRUGO. As such there can be cases where EACCESS is returned via syscall but the info is displayed anyways in the procfs files. While this might have security implications via info leaking (albeit no writing to the sma metadata), this behavior goes way back and showing all the objects regardless of the permissions was most likely an overlook - so we are stuck with it. Furthermore, modifying either the syscall or the procfs file can cause userspace programs to break (ie ipcs). Some applications require getting the procfs info (without root privileges) and can be rather slow in comparison with a syscall -- up to 500x in some reported cases for shm. This patch introduces a new SEM_STAT_ANY command such that the sem ipc object permissions are ignored, and only audited instead. In addition, I've left the lsm security hook checks in place, as if some policy can block the call, then the user has no other choice than just parsing the procfs file. Link: http://lkml.kernel.org/r/20180215162458.10059-3-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Reported-by: Robert Kettler <robert.kettler@outlook.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | ipc/shm: introduce shmctl(SHM_STAT_ANY)Davidlohr Bueso2018-04-111-2/+3
| |/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "sysvipc: introduce STAT_ANY commands", v2. The following patches adds the discussed (see [1]) new command for shm as well as for sems and msq as they are subject to the same discrepancies for ipc object permission checks between the syscall and via procfs. These new commands are justified in that (1) we are stuck with this semantics as changing syscall and procfs can break userland; and (2) some users can benefit from performance (for large amounts of shm segments, for example) from not having to parse the procfs interface. Once merged, I will submit the necesary manpage updates. But I'm thinking something like: : diff --git a/man2/shmctl.2 b/man2/shmctl.2 : index 7bb503999941..bb00bbe21a57 100644 : --- a/man2/shmctl.2 : +++ b/man2/shmctl.2 : @@ -41,6 +41,7 @@ : .\" 2005-04-25, mtk -- noted aberrant Linux behavior w.r.t. new : .\" attaches to a segment that has already been marked for deletion. : .\" 2005-08-02, mtk: Added IPC_INFO, SHM_INFO, SHM_STAT descriptions. : +.\" 2018-02-13, dbueso: Added SHM_STAT_ANY description. : .\" : .TH SHMCTL 2 2017-09-15 "Linux" "Linux Programmer's Manual" : .SH NAME : @@ -242,6 +243,18 @@ However, the : argument is not a segment identifier, but instead an index into : the kernel's internal array that maintains information about : all shared memory segments on the system. : +.TP : +.BR SHM_STAT_ANY " (Linux-specific)" : +Return a : +.I shmid_ds : +structure as for : +.BR SHM_STAT . : +However, the : +.I shm_perm.mode : +is not checked for read access for : +.IR shmid , : +resembing the behaviour of : +/proc/sysvipc/shm. : .PP : The caller can prevent or allow swapping of a shared : memory segment with the following \fIcmd\fP values: : @@ -287,7 +300,7 @@ operation returns the index of the highest used entry in the : kernel's internal array recording information about all : shared memory segments. : (This information can be used with repeated : -.B SHM_STAT : +.B SHM_STAT/SHM_STAT_ANY : operations to obtain information about all shared memory segments : on the system.) : A successful : @@ -328,7 +341,7 @@ isn't accessible. : \fIshmid\fP is not a valid identifier, or \fIcmd\fP : is not a valid command. : Or: for a : -.B SHM_STAT : +.B SHM_STAT/SHM_STAT_ANY : operation, the index value specified in : .I shmid : referred to an array slot that is currently unused. This patch (of 3): There is a permission discrepancy when consulting shm ipc object metadata between /proc/sysvipc/shm (0444) and the SHM_STAT shmctl command. The later does permission checks for the object vs S_IRUGO. As such there can be cases where EACCESS is returned via syscall but the info is displayed anyways in the procfs files. While this might have security implications via info leaking (albeit no writing to the shm metadata), this behavior goes way back and showing all the objects regardless of the permissions was most likely an overlook - so we are stuck with it. Furthermore, modifying either the syscall or the procfs file can cause userspace programs to break (ie ipcs). Some applications require getting the procfs info (without root privileges) and can be rather slow in comparison with a syscall -- up to 500x in some reported cases. This patch introduces a new SHM_STAT_ANY command such that the shm ipc object permissions are ignored, and only audited instead. In addition, I've left the lsm security hook checks in place, as if some policy can block the call, then the user has no other choice than just parsing the procfs file. [1] https://lkml.org/lkml/2017/12/19/220 Link: http://lkml.kernel.org/r/20180215162458.10059-2-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org> Cc: Robert Kettler <robert.kettler@outlook.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2018-04-091-2/+21
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull kvm updates from Paolo Bonzini: "ARM: - VHE optimizations - EL2 address space randomization - speculative execution mitigations ("variant 3a", aka execution past invalid privilege register access) - bugfixes and cleanups PPC: - improvements for the radix page fault handler for HV KVM on POWER9 s390: - more kvm stat counters - virtio gpu plumbing - documentation - facilities improvements x86: - support for VMware magic I/O port and pseudo-PMCs - AMD pause loop exiting - support for AMD core performance extensions - support for synchronous register access - expose nVMX capabilities to userspace - support for Hyper-V signaling via eventfd - use Enlightened VMCS when running on Hyper-V - allow userspace to disable MWAIT/HLT/PAUSE vmexits - usual roundup of optimizations and nested virtualization bugfixes Generic: - API selftest infrastructure (though the only tests are for x86 as of now)" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (174 commits) kvm: x86: fix a prototype warning kvm: selftests: add sync_regs_test kvm: selftests: add API testing infrastructure kvm: x86: fix a compile warning KVM: X86: Add Force Emulation Prefix for "emulate the next instruction" KVM: X86: Introduce handle_ud() KVM: vmx: unify adjacent #ifdefs x86: kvm: hide the unused 'cpu' variable KVM: VMX: remove bogus WARN_ON in handle_ept_misconfig Revert "KVM: X86: Fix SMRAM accessing even if VM is shutdown" kvm: Add emulation for movups/movupd KVM: VMX: raise internal error for exception during invalid protected mode state KVM: nVMX: Optimization: Dont set KVM_REQ_EVENT when VMExit with nested_run_pending KVM: nVMX: Require immediate-exit when event reinjected to L2 and L1 event pending KVM: x86: Fix misleading comments on handling pending exceptions KVM: x86: Rename interrupt.pending to interrupt.injected KVM: VMX: No need to clear pending NMI/interrupt on inject realmode interrupt x86/kvm: use Enlightened VMCS when running on Hyper-V x86/hyper-v: detect nested features x86/hyper-v: define struct hv_enlightened_vmcs and clean field bits ...
| * | | KVM: X86: Provide a capability to disable MWAIT interceptsWanpeng Li2018-03-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allowing a guest to execute MWAIT without interception enables a guest to put a (physical) CPU into a power saving state, where it takes longer to return from than what may be desired by the host. Don't give a guest that power over a host by default. (Especially, since nothing prevents a guest from using MWAIT even when it is not advertised via CPUID.) Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Jan H. Schönherr <jschoenh@amazon.de> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | KVM: x86: add SYNC_REGS_SIZE_BYTES #define.Ken Hofsass2018-03-061-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Replace hardcoded padding size value for struct kvm_sync_regs with #define SYNC_REGS_SIZE_BYTES. Also update the value specified in api.txt from outdated hardcoded value to SYNC_REGS_SIZE_BYTES. Signed-off-by: Ken Hofsass <hofsass@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
| * | | kvm: x86: hyperv: guest->host event signaling via eventfdRoman Kagan2018-03-061-0/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In Hyper-V, the fast guest->host notification mechanism is the SIGNAL_EVENT hypercall, with a single parameter of the connection ID to signal. Currently this hypercall incurs a user exit and requires the userspace to decode the parameters and trigger the notification of the potentially different I/O context. To avoid the costly user exit, process this hypercall and signal the corresponding eventfd in KVM, similar to ioeventfd. The association between the connection id and the eventfd is established via the newly introduced KVM_HYPERV_EVENTFD ioctl, and maintained in an (srcu-protected) IDR. Signed-off-by: Roman Kagan <rkagan@virtuozzo.com> Reviewed-by: David Hildenbrand <david@redhat.com> [asm/hyperv.h changes approved by KY Srinivasan. - Radim] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
* | | | Merge tag 'vfio-v4.17-rc1' of git://github.com/awilliam/linux-vfioLinus Torvalds2018-04-061-0/+27
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull VFIO updates from Alex Williamson: - Adopt iommu_unmap_fast() interface to type1 backend (Suravee Suthikulpanit) - mdev sample driver fixup (Shunyong Yang) - More efficient PFN mapping handling in type1 backend (Jason Cai) - VFIO device ioeventfd interface (Alex Williamson) - Tag new vfio-platform sub-maintainer (Alex Williamson) * tag 'vfio-v4.17-rc1' of git://github.com/awilliam/linux-vfio: MAINTAINERS: vfio/platform: Update sub-maintainer vfio/pci: Add ioeventfd support vfio/pci: Use endian neutral helpers vfio/pci: Pull BAR mapping setup from read-write path vfio/type1: Improve memory pinning process for raw PFN mapping vfio-mdev/samples: change RDI interrupt condition vfio/type1: Adopt fast IOTLB flush interface when unmap IOVAs
| * | | | vfio/pci: Add ioeventfd supportAlex Williamson2018-03-261-0/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The ioeventfd here is actually irqfd handling of an ioeventfd such as supported in KVM. A user is able to pre-program a device write to occur when the eventfd triggers. This is yet another instance of eventfd-irqfd triggering between KVM and vfio. The impetus for this is high frequency writes to pages which are virtualized in QEMU. Enabling this near-direct write path for selected registers within the virtualized page can improve performance and reduce overhead. Specifically this is initially targeted at NVIDIA graphics cards where the driver issues a write to an MMIO register within a virtualized region in order to allow the MSI interrupt to re-trigger. Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
* | | | | Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds2018-04-061-0/+97
|\ \ \ \ \ | | |_|/ / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull fw_cfg, vhost updates from Michael Tsirkin: "This cleans up the qemu fw cfg device driver. On top of this, vmcore is dumped there on crash to help debugging with kASLR enabled. Also included are some fixes in vhost" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: vhost: add vsock compat ioctl vhost: fix vhost ioctl signature to build with clang fw_cfg: write vmcoreinfo details crash: export paddr_vmcoreinfo_note() fw_cfg: add DMA register fw_cfg: add a public uapi header fw_cfg: handle fw_cfg_read_blob() error fw_cfg: remove inline from fw_cfg_read_blob() fw_cfg: fix sparse warnings around FW_CFG_FILE_DIR read fw_cfg: fix sparse warning reading FW_CFG_ID fw_cfg: fix sparse warnings with fw_cfg_file fw_cfg: fix sparse warnings in fw_cfg_sel_endianness() ptr_ring: fix build
| * | | | fw_cfg: write vmcoreinfo detailsMarc-André Lureau2018-03-201-0/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the "etc/vmcoreinfo" fw_cfg file is present and we are not running the kdump kernel, write the addr/size of the vmcoreinfo ELF note. The DMA operation is expected to run synchronously with today qemu, but the specification states that it may become async, so we run "control" field check in a loop for eventual changes. Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
| * | | | fw_cfg: add a public uapi headerMarc-André Lureau2018-03-201-0/+66
| |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Create a common header file for well-known values and structures to be shared by the Linux kernel with qemu or other projects. It is based from qemu/docs/specs/fw_cfg.txt which references qemu/include/hw/nvram/fw_cfg_keys.h "for the most up-to-date and authoritative list" & vmcoreinfo.txt. Those files don't have an explicit license, but qemu/hw/nvram/fw_cfg.c is BSD-license, so Michael S. Tsirkin suggested to use the same license. The patch intentionally left out DMA & vmcoreinfo structures & defines, which are added in the commits making usage of it. Suggested-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
* | | | Merge tag 'pci-v4.17-changes' of ↵Linus Torvalds2018-04-061-2/+5
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull PCI updates from Bjorn Helgaas: - move pci_uevent_ers() out of pci.h (Michael Ellerman) - skip ASPM common clock warning if BIOS already configured it (Sinan Kaya) - fix ASPM Coverity warning about threshold_ns (Gustavo A. R. Silva) - remove last user of pci_get_bus_and_slot() and the function itself (Sinan Kaya) - add decoding for 16 GT/s link speed (Jay Fang) - add interfaces to get max link speed and width (Tal Gilboa) - add pcie_bandwidth_capable() to compute max supported link bandwidth (Tal Gilboa) - add pcie_bandwidth_available() to compute bandwidth available to device (Tal Gilboa) - add pcie_print_link_status() to log link speed and whether it's limited (Tal Gilboa) - use PCI core interfaces to report when device performance may be limited by its slot instead of doing it in each driver (Tal Gilboa) - fix possible cpqphp NULL pointer dereference (Shawn Lin) - rescan more of the hierarchy on ACPI hotplug to fix Thunderbolt/xHCI hotplug (Mika Westerberg) - add support for PCI I/O port space that's neither directly accessible via CPU in/out instructions nor directly mapped into CPU physical memory space. This is fairly intrusive and includes minor changes to interfaces used for I/O space on most platforms (Zhichang Yuan, John Garry) - add support for HiSilicon Hip06/Hip07 LPC I/O space (Zhichang Yuan, John Garry) - use PCI_EXP_DEVCTL2_COMP_TIMEOUT in rapidio/tsi721 (Bjorn Helgaas) - remove possible NULL pointer dereference in of_pci_bus_find_domain_nr() (Shawn Lin) - report quirk timings with dev_info (Bjorn Helgaas) - report quirks that take longer than 10ms (Bjorn Helgaas) - add and use Altera Vendor ID (Johannes Thumshirn) - tidy Makefiles and comments (Bjorn Helgaas) - don't set up INTx if MSI or MSI-X is enabled to align cris, frv, ia64, and mn10300 with x86 (Bjorn Helgaas) - move pcieport_if.h to drivers/pci/pcie/ to encapsulate it (Frederick Lawler) - merge pcieport_if.h into portdrv.h (Bjorn Helgaas) - move workaround for BIOS PME issue from portdrv to PCI core (Bjorn Helgaas) - completely disable portdrv with "pcie_ports=compat" (Bjorn Helgaas) - remove portdrv link order dependency (Bjorn Helgaas) - remove support for unused VC portdrv service (Bjorn Helgaas) - simplify portdrv feature permission checking (Bjorn Helgaas) - remove "pcie_hp=nomsi" parameter (use "pci=nomsi" instead) (Bjorn Helgaas) - remove unnecessary "pcie_ports=auto" parameter (Bjorn Helgaas) - use cached AER capability offset (Frederick Lawler) - don't enable DPC if BIOS hasn't granted AER control (Mika Westerberg) - rename pcie-dpc.c to dpc.c (Bjorn Helgaas) - use generic pci_mmap_resource_range() instead of powerpc and xtensa arch-specific versions (David Woodhouse) - support arbitrary PCI host bridge offsets on sparc (Yinghai Lu) - remove System and Video ROM reservations on sparc (Bjorn Helgaas) - probe for device reset support during enumeration instead of runtime (Bjorn Helgaas) - add ACS quirk for Ampere (née APM) root ports (Feng Kan) - add function 1 DMA alias quirk for Marvell 88SE9220 (Thomas Vincent-Cross) - protect device restore with device lock (Sinan Kaya) - handle failure of FLR gracefully (Sinan Kaya) - handle CRS (config retry status) after device resets (Sinan Kaya) - skip various config reads for SR-IOV VFs as an optimization (KarimAllah Ahmed) - consolidate VPD code in vpd.c (Bjorn Helgaas) - add Tegra dependency on PCI_MSI_IRQ_DOMAIN (Arnd Bergmann) - add DT support for R-Car r8a7743 (Biju Das) - fix a PCI_EJECT vs PCI_BUS_RELATIONS race condition in Hyper-V host bridge driver that causes a general protection fault (Dexuan Cui) - fix Hyper-V host bridge hang in MSI setup on 1-vCPU VMs with SR-IOV (Dexuan Cui) - fix Hyper-V host bridge hang when ejecting a VF before setting up MSI (Dexuan Cui) - make several structures static (Fengguang Wu) - increase number of MSI IRQs supported by Synopsys DesignWare bridges from 32 to 256 (Gustavo Pimentel) - implemented multiplexed IRQ domain API and remove obsolete MSI IRQ API from DesignWare drivers (Gustavo Pimentel) - add Tegra power management support (Manikanta Maddireddy) - add Tegra loadable module support (Manikanta Maddireddy) - handle 64-bit BARs correctly in endpoint support (Niklas Cassel) - support optional regulator for HiSilicon STB (Shawn Guo) - use regulator bulk API for Qualcomm apq8064 (Srinivas Kandagatla) - support power supplies for Qualcomm msm8996 (Srinivas Kandagatla) * tag 'pci-v4.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (123 commits) MAINTAINERS: Add John Garry as maintainer for HiSilicon LPC driver HISI LPC: Add ACPI support ACPI / scan: Do not enumerate Indirect IO host children ACPI / scan: Rename acpi_is_serial_bus_slave() for more general use HISI LPC: Support the LPC host on Hip06/Hip07 with DT bindings of: Add missing I/O range exception for indirect-IO devices PCI: Apply the new generic I/O management on PCI IO hosts PCI: Add fwnode handler as input param of pci_register_io_range() PCI: Remove __weak tag from pci_register_io_range() MAINTAINERS: Add missing /drivers/pci/cadence directory entry fm10k: Report PCIe link properties with pcie_print_link_status() net/mlx5e: Use pcie_bandwidth_available() to compute bandwidth net/mlx5: Report PCIe link properties with pcie_print_link_status() net/mlx4_core: Report PCIe link properties with pcie_print_link_status() PCI: Add pcie_print_link_status() to log link speed and whether it's limited PCI: Add pcie_bandwidth_available() to compute bandwidth available to device misc: pci_endpoint_test: Handle 64-bit BARs properly PCI: designware-ep: Make dw_pcie_ep_reset_bar() handle 64-bit BARs properly PCI: endpoint: Make sure that BAR_5 does not have 64-bit flag set when clearing PCI: endpoint: Make epc->ops->clear_bar()/pci_epc_clear_bar() take struct *epf_bar ...
| * | | | PCI: Add decoding for 16 GT/s link speedJay Fang2018-03-211-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PCIe 4.0 defines the 16.0 GT/s link speed. Links can run at that speed without any Linux changes, but previously their sysfs "max_link_speed" and "current_link_speed" files contained "Unknown speed", not the expected "16.0 GT/s". Add decoding for the new 16 GT/s link speed. Signed-off-by: Jay Fang <f.fangjian@huawei.com> [bhelgaas: add PCI_EXP_LNKCAP2_SLS_16_0GB] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Dongdong Liu <liudongdong3@huawei.com>
* | | | | Merge tag 'for-linus-unmerged' of ↵Linus Torvalds2018-04-0626-355/+991
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma Pull rdma updates from Jason Gunthorpe: "Doug and I are at a conference next week so if another PR is sent I expect it to only be bug fixes. Parav noted yesterday that there are some fringe case behavior changes in his work that he would like to fix, and I see that Intel has a number of rc looking patches for HFI1 they posted yesterday. Parav is again the biggest contributor by patch count with his ongoing work to enable container support in the RDMA stack, followed by Leon doing syzkaller inspired cleanups, though most of the actual fixing went to RC. There is one uncomfortable series here fixing the user ABI to actually work as intended in 32 bit mode. There are lots of notes in the commit messages, but the basic summary is we don't think there is an actual 32 bit kernel user of drivers/infiniband for several good reasons. However we are seeing people want to use a 32 bit user space with 64 bit kernel, which didn't completely work today. So in fixing it we required a 32 bit rxe user to upgrade their userspace. rxe users are still already quite rare and we think a 32 bit one is non-existing. - Fix RDMA uapi headers to actually compile in userspace and be more complete - Three shared with netdev pull requests from Mellanox: * 7 patches, mostly to net with 1 IB related one at the back). This series addresses an IRQ performance issue (patch 1), cleanups related to the fix for the IRQ performance problem (patches 2-6), and then extends the fragmented completion queue support that already exists in the net side of the driver to the ib side of the driver (patch 7). * Mostly IB, with 5 patches to net that are needed to support the remaining 10 patches to the IB subsystem. This series extends the current 'representor' framework when the mlx5 driver is in switchdev mode from being a netdev only construct to being a netdev/IB dev construct. The IB dev is limited to raw Eth queue pairs only, but by having an IB dev of this type attached to the representor for a switchdev port, it enables DPDK to work on the switchdev device. * All net related, but needed as infrastructure for the rdma driver - Updates for the hns, i40iw, bnxt_re, cxgb3, cxgb4, hns drivers - SRP performance updates - IB uverbs write path cleanup patch series from Leon - Add RDMA_CM support to ib_srpt. This is disabled by default. Users need to set the port for ib_srpt to listen on in configfs in order for it to be enabled (/sys/kernel/config/target/srpt/discovery_auth/rdma_cm_port) - TSO and Scatter FCS support in mlx4 - Refactor of modify_qp routine to resolve problems seen while working on new code that is forthcoming - More refactoring and updates of RDMA CM for containers support from Parav - mlx5 'fine grained packet pacing', 'ipsec offload' and 'device memory' user API features - Infrastructure updates for the new IOCTL interface, based on increased usage - ABI compatibility bug fixes to fully support 32 bit userspace on 64 bit kernel as was originally intended. See the commit messages for extensive details - Syzkaller bugs and code cleanups motivated by them" * tag 'for-linus-unmerged' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (199 commits) IB/rxe: Fix for oops in rxe_register_device on ppc64le arch IB/mlx5: Device memory mr registration support net/mlx5: Mkey creation command adjustments IB/mlx5: Device memory support in mlx5_ib net/mlx5: Query device memory capabilities IB/uverbs: Add device memory registration ioctl support IB/uverbs: Add alloc/free dm uverbs ioctl support IB/uverbs: Add device memory capabilities reporting IB/uverbs: Expose device memory capabilities to user RDMA/qedr: Fix wmb usage in qedr IB/rxe: Removed GID add/del dummy routines RDMA/qedr: Zero stack memory before copying to user space IB/mlx5: Add ability to hash by IPSEC_SPI when creating a TIR IB/mlx5: Add information for querying IPsec capabilities IB/mlx5: Add IPsec support for egress and ingress {net,IB}/mlx5: Add ipsec helper IB/mlx5: Add modify_flow_action_esp verb IB/mlx5: Add implementation for create and destroy action_xfrm IB/uverbs: Introduce ESP steering match filter IB/uverbs: Add modify ESP flow_action ...