summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* KVM: async_pf: avoid async pf injection when in guest modeWanpeng Li2017-06-143-4/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 9bc1f09f6fa76fdf31eb7d6a4a4df43574725f93 upstream. INFO: task gnome-terminal-:1734 blocked for more than 120 seconds. Not tainted 4.12.0-rc4+ #8 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. gnome-terminal- D 0 1734 1015 0x00000000 Call Trace: __schedule+0x3cd/0xb30 schedule+0x40/0x90 kvm_async_pf_task_wait+0x1cc/0x270 ? __vfs_read+0x37/0x150 ? prepare_to_swait+0x22/0x70 do_async_page_fault+0x77/0xb0 ? do_async_page_fault+0x77/0xb0 async_page_fault+0x28/0x30 This is triggered by running both win7 and win2016 on L1 KVM simultaneously, and then gives stress to memory on L1, I can observed this hang on L1 when at least ~70% swap area is occupied on L0. This is due to async pf was injected to L2 which should be injected to L1, L2 guest starts receiving pagefault w/ bogus %cr2(apf token from the host actually), and L1 guest starts accumulating tasks stuck in D state in kvm_async_pf_task_wait() since missing PAGE_READY async_pfs. This patch fixes the hang by doing async pf when executing L1 guest. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* arm: KVM: Allow unaligned accesses at HYPMarc Zyngier2017-06-141-3/+2
| | | | | | | | | | | | | | | commit 33b5c38852b29736f3b472dd095c9a18ec22746f upstream. We currently have the HSCTLR.A bit set, trapping unaligned accesses at HYP, but we're not really prepared to deal with it. Since the rest of the kernel is pretty happy about that, let's follow its example and set HSCTLR.A to zero. Modern CPUs don't really care. Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <cdall@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* arm64: KVM: Allow unaligned accesses at EL2Marc Zyngier2017-06-141-2/+3
| | | | | | | | | | | | | | | | | | | commit 78fd6dcf11468a5a131b8365580d0c613bcc02cb upstream. We currently have the SCTLR_EL2.A bit set, trapping unaligned accesses at EL2, but we're not really prepared to deal with it. So far, this has been unnoticed, until GCC 7 started emitting those (in particular 64bit writes on a 32bit boundary). Since the rest of the kernel is pretty happy about that, let's follow its example and set SCTLR_EL2.A to zero. Modern CPUs don't really care. Reported-by: Alexander Graf <agraf@suse.de> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <cdall@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* arm64: KVM: Preserve RES1 bits in SCTLR_EL2Marc Zyngier2017-06-142-4/+10
| | | | | | | | | | | | | | | | commit d68c1f7fd1b7148dab5fe658321d511998969f2d upstream. __do_hyp_init has the rather bad habit of ignoring RES1 bits and writing them back as zero. On a v8.0-8.2 CPU, this doesn't do anything bad, but may end-up being pretty nasty on future revisions of the architecture. Let's preserve those bits so that we don't have to fix this later on. Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <cdall@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: cpuid: Fix read/write out-of-bounds vulnerability in cpuid emulationWanpeng Li2017-06-141-9/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit a3641631d14571242eec0d30c9faa786cbf52d44 upstream. If "i" is the last element in the vcpu->arch.cpuid_entries[] array, it potentially can be exploited the vulnerability. this will out-of-bounds read and write. Luckily, the effect is small: /* when no next entry is found, the current entry[i] is reselected */ for (j = i + 1; ; j = (j + 1) % nent) { struct kvm_cpuid_entry2 *ej = &vcpu->arch.cpuid_entries[j]; if (ej->function == e->function) { It reads ej->maxphyaddr, which is user controlled. However... ej->flags |= KVM_CPUID_FLAG_STATE_READ_NEXT; After cpuid_entries there is int maxphyaddr; struct x86_emulate_ctxt emulate_ctxt; /* 16-byte aligned */ So we have: - cpuid_entries at offset 1B50 (6992) - maxphyaddr at offset 27D0 (6992 + 3200 = 10192) - padding at 27D4...27DF - emulate_ctxt at 27E0 And it writes in the padding. Pfew, writing the ops field of emulate_ctxt would have been much worse. This patch fixes it by modding the index to avoid the out-of-bounds access. Worst case, i == j and ej->function == e->function, the loop can bail out. Reported-by: Moguofang <moguofang@huawei.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Guofang Mo <moguofang@huawei.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* kvm: async_pf: fix rcu_irq_enter() with irqs enabledPaolo Bonzini2017-06-141-1/+1
| | | | | | | | | | | | | | | | | commit bbaf0e2b1c1b4f88abd6ef49576f0efb1734eae5 upstream. native_safe_halt enables interrupts, and you just shouldn't call rcu_irq_enter() with interrupts enabled. Reorder the call with the following local_irq_disable() to respect the invariant. Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* efi: Don't issue error message when booted under XenJuergen Gross2017-06-141-0/+3
| | | | | | | | | | | | | | | | | | | | | commit 1ea34adb87c969b89dfd83f1905a79161e9ada26 upstream. When booted as Xen dom0 there won't be an EFI memmap allocated. Avoid issuing an error message in this case: [ 0.144079] efi: Failed to allocate new EFI memmap Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-efi@vger.kernel.org Link: http://lkml.kernel.org/r/20170526113652.21339-2-matt@codeblueprint.co.uk Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* nfsd: Fix up the "supattr_exclcreat" attributesTrond Myklebust2017-06-141-3/+10
| | | | | | | | | | | | | commit b26b78cb726007533d81fdf90a62e915002ef5c8 upstream. If an NFSv4 client asks us for the supattr_exclcreat, then we must not return attributes that are unsupported by this minor version. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Fixes: 75976de6556f ("NFSD: Return word2 bitmask if setting security..,") Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* nfsd4: fix null dereference on replayJ. Bruce Fields2017-06-141-7/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 9a307403d374b993061f5992a6e260c944920d0b upstream. if we receive a compound such that: - the sessionid, slot, and sequence number in the SEQUENCE op match a cached succesful reply with N ops, and - the Nth operation of the compound is a PUTFH, PUTPUBFH, PUTROOTFH, or RESTOREFH, then nfsd4_sequence will return 0 and set cstate->status to nfserr_replay_cache. The current filehandle will not be set. This will cause us to call check_nfsd_access with first argument NULL. To nfsd4_compound it looks like we just succesfully executed an operation that set a filehandle, but the current filehandle is not set. Fix this by moving the nfserr_replay_cache earlier. There was never any reason to have it after the encode_op label, since the only case where he hit that is when opdesc->op_func sets it. Note that there are two ways we could hit this case: - a client is resending a previously sent compound that ended with one of the four PUTFH-like operations, or - a client is sending a *new* compound that (incorrectly) shares sessionid, slot, and sequence number with a previously sent compound, and the length of the previously sent compound happens to match the position of a PUTFH-like operation in the new compound. The second is obviously incorrect client behavior. The first is also very strange--the only purpose of a PUTFH-like operation is to set the current filehandle to be used by the following operation, so there's no point in having it as the last in a compound. So it's likely this requires a buggy or malicious client to reproduce. Reported-by: Scott Mayhew <smayhew@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* drm/amdgpu/ci: disable mclk switching for high refresh rates (v2)Alex Deucher2017-06-141-0/+6
| | | | | | | | | | | | | | | | commit 0a646f331db0eb9efc8d3a95a44872036d441d58 upstream. Even if the vblank period would allow it, it still seems to be problematic on some cards. v2: fix logic inversion (Nils) bug: https://bugs.freedesktop.org/show_bug.cgi?id=96868 Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* crypto: gcm - wait for crypto op not signal safeGilad Ben-Yossef2017-06-141-4/+2
| | | | | | | | | | | | | | | | | commit f3ad587070d6bd961ab942b3fd7a85d00dfc934b upstream. crypto_gcm_setkey() was using wait_for_completion_interruptible() to wait for completion of async crypto op but if a signal occurs it may return before DMA ops of HW crypto provider finish, thus corrupting the data buffer that is kfree'ed in this case. Resolve this by using wait_for_completion() instead. Reported-by: Eric Biggers <ebiggers3@gmail.com> Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* crypto: drbg - wait for crypto op not signal safeGilad Ben-Yossef2017-06-141-3/+2
| | | | | | | | | | | | | | | | | commit a5dfefb1c3f3db81662556393fd9283511e08430 upstream. drbg_kcapi_sym_ctr() was using wait_for_completion_interruptible() to wait for completion of async crypto op but if a signal occurs it may return before DMA ops of HW crypto provider finish, thus corrupting the output buffer. Resolve this by using wait_for_completion() instead. Reported-by: Eric Biggers <ebiggers3@gmail.com> Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KEYS: encrypted: avoid encrypting/decrypting stack buffersEric Biggers2017-06-141-8/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit e9ff56ac352446f55141aaef1553cee662b2e310 upstream. Since v4.9, the crypto API cannot (normally) be used to encrypt/decrypt stack buffers because the stack may be virtually mapped. Fix this for the padding buffers in encrypted-keys by using ZERO_PAGE for the encryption padding and by allocating a temporary heap buffer for the decryption padding. Tested with CONFIG_DEBUG_SG=y: keyctl new_session keyctl add user master "abcdefghijklmnop" @s keyid=$(keyctl add encrypted desc "new user:master 25" @s) datablob="$(keyctl pipe $keyid)" keyctl unlink $keyid keyid=$(keyctl add encrypted desc "load $datablob" @s) datablob2="$(keyctl pipe $keyid)" [ "$datablob" = "$datablob2" ] && echo "Success!" Cc: Andy Lutomirski <luto@kernel.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Mimi Zohar <zohar@linux.vnet.ibm.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: James Morris <james.l.morris@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KEYS: fix freeing uninitialized memory in key_update()Eric Biggers2017-06-141-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 63a0b0509e700717a59f049ec6e4e04e903c7fe2 upstream. key_update() freed the key_preparsed_payload even if it was not initialized first. This would cause a crash if userspace called keyctl_update() on a key with type like "asymmetric" that has a ->preparse() method but not an ->update() method. Possibly it could even be triggered for other key types by racing with keyctl_setperm() to make the KEY_NEED_WRITE check fail (the permission was already checked, so normally it wouldn't fail there). Reproducer with key type "asymmetric", given a valid cert.der: keyctl new_session keyid=$(keyctl padd asymmetric desc @s < cert.der) keyctl setperm $keyid 0x3f000000 keyctl update $keyid data [ 150.686666] BUG: unable to handle kernel NULL pointer dereference at 0000000000000001 [ 150.687601] IP: asymmetric_key_free_kids+0x12/0x30 [ 150.688139] PGD 38a3d067 [ 150.688141] PUD 3b3de067 [ 150.688447] PMD 0 [ 150.688745] [ 150.689160] Oops: 0000 [#1] SMP [ 150.689455] Modules linked in: [ 150.689769] CPU: 1 PID: 2478 Comm: keyctl Not tainted 4.11.0-rc4-xfstests-00187-ga9f6b6b8cd2f #742 [ 150.690916] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014 [ 150.692199] task: ffff88003b30c480 task.stack: ffffc90000350000 [ 150.692952] RIP: 0010:asymmetric_key_free_kids+0x12/0x30 [ 150.693556] RSP: 0018:ffffc90000353e58 EFLAGS: 00010202 [ 150.694142] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000004 [ 150.694845] RDX: ffffffff81ee3920 RSI: ffff88003d4b0700 RDI: 0000000000000001 [ 150.697569] RBP: ffffc90000353e60 R08: ffff88003d5d2140 R09: 0000000000000000 [ 150.702483] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001 [ 150.707393] R13: 0000000000000004 R14: ffff880038a4d2d8 R15: 000000000040411f [ 150.709720] FS: 00007fcbcee35700(0000) GS:ffff88003fd00000(0000) knlGS:0000000000000000 [ 150.711504] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 150.712733] CR2: 0000000000000001 CR3: 0000000039eab000 CR4: 00000000003406e0 [ 150.714487] Call Trace: [ 150.714975] asymmetric_key_free_preparse+0x2f/0x40 [ 150.715907] key_update+0xf7/0x140 [ 150.716560] ? key_default_cmp+0x20/0x20 [ 150.717319] keyctl_update_key+0xb0/0xe0 [ 150.718066] SyS_keyctl+0x109/0x130 [ 150.718663] entry_SYSCALL_64_fastpath+0x1f/0xc2 [ 150.719440] RIP: 0033:0x7fcbce75ff19 [ 150.719926] RSP: 002b:00007ffd5d167088 EFLAGS: 00000206 ORIG_RAX: 00000000000000fa [ 150.720918] RAX: ffffffffffffffda RBX: 0000000000404d80 RCX: 00007fcbce75ff19 [ 150.721874] RDX: 00007ffd5d16785e RSI: 000000002866cd36 RDI: 0000000000000002 [ 150.722827] RBP: 0000000000000006 R08: 000000002866cd36 R09: 00007ffd5d16785e [ 150.723781] R10: 0000000000000004 R11: 0000000000000206 R12: 0000000000404d80 [ 150.724650] R13: 00007ffd5d16784d R14: 00007ffd5d167238 R15: 000000000040411f [ 150.725447] Code: 83 c4 08 31 c0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 85 ff 74 23 55 48 89 e5 53 48 89 fb <48> 8b 3f e8 06 21 c5 ff 48 8b 7b 08 e8 fd 20 c5 ff 48 89 df e8 [ 150.727489] RIP: asymmetric_key_free_kids+0x12/0x30 RSP: ffffc90000353e58 [ 150.728117] CR2: 0000000000000001 [ 150.728430] ---[ end trace f7f8fe1da2d5ae8d ]--- Fixes: 4d8c0250b841 ("KEYS: Call ->free_preparse() even after ->preparse() returns an error") Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: James Morris <james.l.morris@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KEYS: fix dereferencing NULL payload with nonzero lengthEric Biggers2017-06-141-2/+2
| | | | | | | | | | | | | | | | | | | | commit 5649645d725c73df4302428ee4e02c869248b4c5 upstream. sys_add_key() and the KEYCTL_UPDATE operation of sys_keyctl() allowed a NULL payload with nonzero length to be passed to the key type's ->preparse(), ->instantiate(), and/or ->update() methods. Various key types including asymmetric, cifs.idmap, cifs.spnego, and pkcs7_test did not handle this case, allowing an unprivileged user to trivially cause a NULL pointer dereference (kernel oops) if one of these key types was present. Fix it by doing the copy_from_user() when 'plen' is nonzero rather than when '_payload' is non-NULL, causing the syscall to fail with EFAULT as expected when an invalid buffer is specified. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: James Morris <james.l.morris@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* crypto: asymmetric_keys - handle EBUSY due to backlog correctlyGilad Ben-Yossef2017-06-141-1/+1
| | | | | | | | | | | | | | | | | commit e68368aed56324e2e38d4f6b044bb8cf82077fc2 upstream. public_key_verify_signature() was passing the CRYPTO_TFM_REQ_MAY_BACKLOG flag to akcipher_request_set_callback() but was not handling correctly the case where a -EBUSY error could be returned from the call to crypto_akcipher_verify() if backlog was used, possibly casuing data corruption due to use-after-free of buffers. Resolve this by handling -EBUSY correctly. Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ptrace: Properly initialize ptracer_cred on forkEric W. Biederman2017-06-142-9/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | commit c70d9d809fdeecedb96972457ee45c49a232d97f upstream. When I introduced ptracer_cred I failed to consider the weirdness of fork where the task_struct copies the old value by default. This winds up leaving ptracer_cred set even when a process forks and the child process does not wind up being ptraced. Because ptracer_cred is not set on non-ptraced processes whose parents were ptraced this has broken the ability of the enlightenment window manager to start setuid children. Fix this by properly initializing ptracer_cred in ptrace_init_task This must be done with a little bit of care to preserve the current value of ptracer_cred when ptrace carries through fork. Re-reading the ptracer_cred from the ptracing process at this point is inconsistent with how PT_PTRACE_CAP has been maintained all of these years. Tested-by: Takashi Iwai <tiwai@suse.de> Fixes: 64b875f7ac8a ("ptrace: Capture the ptracer's creds not PT_PTRACE_CAP") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* serial: ifx6x60: fix use-after-free on module unloadJohan Hovold2017-06-141-1/+1
| | | | | | | | | | | | | | commit 1e948479b3d63e3ac0ecca13cbf4921c7d17c168 upstream. Make sure to deregister the SPI driver before releasing the tty driver to avoid use-after-free in the SPI remove callback where the tty devices are deregistered. Fixes: 72d4724ea54c ("serial: ifx6x60: Add modem power off function in the platform reboot process") Cc: Jun Chen <jun.d.chen@intel.com> Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* arch/sparc: support NR_CPUS = 4096Jane Chu2017-06-142-6/+15
| | | | | | | | | | | | | | | | | | | | | | [ Upstream commit c79a13734d104b5b147d7cb0870276ccdd660dae ] Linux SPARC64 limits NR_CPUS to 4064 because init_cpu_send_mondo_info() only allocates a single page for NR_CPUS mondo entries. Thus we cannot use all 4096 CPUs on some SPARC platforms. To fix, allocate (2^order) pages where order is set according to the size of cpu_list for possible cpus. Since cpu_list_pa and cpu_mondo_block_pa are not used in asm code, there are no imm13 offsets from the base PA that will break because they can only reach one page. Orabug: 25505750 Signed-off-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Bob Picco <bob.picco@oracle.com> Reviewed-by: Atish Patra <atish.patra@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* sparc64: delete old wrap codePavel Tatashin2017-06-146-45/+1
| | | | | | | | | | | | | | [ Upstream commit 0197e41ce70511dc3b71f7fefa1a676e2b5cd60b ] The old method that is using xcall and softint to get new context id is deleted, as it is replaced by a method of using per_cpu_secondary_mm without xcall to perform the context wrap. Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Bob Picco <bob.picco@oracle.com> Reviewed-by: Steven Sistare <steven.sistare@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* sparc64: new context wrapPavel Tatashin2017-06-141-27/+54
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit a0582f26ec9dfd5360ea2f35dd9a1b026f8adda0 ] The current wrap implementation has a race issue: it is called outside of the ctx_alloc_lock, and also does not wait for all CPUs to complete the wrap. This means that a thread can get a new context with a new version and another thread might still be running with the same context. The problem is especially severe on CPUs with shared TLBs, like sun4v. I used the following test to very quickly reproduce the problem: - start over 8K processes (must be more than context IDs) - write and read values at a memory location in every process. Very quickly memory corruptions start happening, and what we read back does not equal what we wrote. Several approaches were explored before settling on this one: Approach 1: Move smp_new_mmu_context_version() inside ctx_alloc_lock, and wait for every process to complete the wrap. (Note: every CPU must WAIT before leaving smp_new_mmu_context_version_client() until every one arrives). This approach ends up with deadlocks, as some threads own locks which other threads are waiting for, and they never receive softint until these threads exit smp_new_mmu_context_version_client(). Since we do not allow the exit, deadlock happens. Approach 2: Handle wrap right during mondo interrupt. Use etrap/rtrap to enter into into C code, and issue new versions to every CPU. This approach adds some overhead to runtime: in switch_mm() we must add some checks to make sure that versions have not changed due to wrap while we were loading the new secondary context. (could be protected by PSTATE_IE but that degrades performance as on M7 and older CPUs as it takes 50 cycles for each access). Also, we still need a global per-cpu array of MMs to know where we need to load new contexts, otherwise we can change context to a thread that is going way (if we received mondo between switch_mm() and switch_to() time). Finally, there are some issues with window registers in rtrap() when context IDs are changed during CPU mondo time. The approach in this patch is the simplest and has almost no impact on runtime. We use the array with mm's where last secondary contexts were loaded onto CPUs and bump their versions to the new generation without changing context IDs. If a new process comes in to get a context ID, it will go through get_new_mmu_context() because of version mismatch. But the running processes do not need to be interrupted. And wrap is quicker as we do not need to xcall and wait for everyone to receive and complete wrap. Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Bob Picco <bob.picco@oracle.com> Reviewed-by: Steven Sistare <steven.sistare@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* sparc64: add per-cpu mm of secondary contextsPavel Tatashin2017-06-142-2/+4
| | | | | | | | | | | | | [ Upstream commit 7a5b4bbf49fe86ce77488a70c5dccfe2d50d7a2d ] The new wrap is going to use information from this array to figure out mm's that currently have valid secondary contexts setup. Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Bob Picco <bob.picco@oracle.com> Reviewed-by: Steven Sistare <steven.sistare@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* sparc64: redefine first versionPavel Tatashin2017-06-142-4/+4
| | | | | | | | | | | | | | [ Upstream commit c4415235b2be0cc791572e8e7f7466ab8f73a2bf ] CTX_FIRST_VERSION defines the first context version, but also it defines first context. This patch redefines it to only include the first context version. Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Bob Picco <bob.picco@oracle.com> Reviewed-by: Steven Sistare <steven.sistare@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* sparc64: combine activate_mm and switch_mmPavel Tatashin2017-06-141-20/+1
| | | | | | | | | | | | | | | [ Upstream commit 14d0334c6748ff2aedb3f2f7fdc51ee90a9b54e7 ] The only difference between these two functions is that in activate_mm we unconditionally flush context. However, there is no need to keep this difference after fixing a bug where cpumask was not reset on a wrap. So, in this patch we combine these. Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Bob Picco <bob.picco@oracle.com> Reviewed-by: Steven Sistare <steven.sistare@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* sparc64: reset mm cpumask after wrapPavel Tatashin2017-06-141-0/+2
| | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 588974857359861891f478a070b1dc7ae04a3880 ] After a wrap (getting a new context version) a process must get a new context id, which means that we would need to flush the context id from the TLB before running for the first time with this ID on every CPU. But, we use mm_cpumask to determine if this process has been running on this CPU before, and this mask is not reset after a wrap. So, there are two possible fixes for this issue: 1. Clear mm cpumask whenever mm gets a new context id 2. Unconditionally flush context every time process is running on a CPU This patch implements the first solution Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Bob Picco <bob.picco@oracle.com> Reviewed-by: Steven Sistare <steven.sistare@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* sparc: Machine description indices can varyJames Clarke2017-06-142-4/+65
| | | | | | | | | | | | | | [ Upstream commit c982aa9c304bf0b9a7522fd118fed4afa5a0263c ] VIO devices were being looked up by their index in the machine description node block, but this often varies over time as devices are added and removed. Instead, store the ID and look up using the type, config handle and ID. Signed-off-by: James Clarke <jrtc27@jrtc27.com> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=112541 Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* sparc64: mm: fix copy_tsb to correctly copy huge page TSBsMike Kravetz2017-06-142-6/+12
| | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 654f4807624a657f364417c2a7454f0df9961734 ] When a TSB grows beyond its current capacity, a new TSB is allocated and copy_tsb is called to copy entries from the old TSB to the new. A hash shift based on page size is used to calculate the index of an entry in the TSB. copy_tsb has hard coded PAGE_SHIFT in these calculations. However, for huge page TSBs the value REAL_HPAGE_SHIFT should be used. As a result, when copy_tsb is called for a huge page TSB the entries are placed at the incorrect index in the newly allocated TSB. When doing hardware table walk, the MMU does not match these entries and we end up in the TSB miss handling code. This code will then create and write an entry to the correct index in the TSB. We take a performance hit for the table walk miss and recreation of these entries. Pass a new parameter to copy_tsb that is the page size shift to be used when copying the TSB. Suggested-by: Anthony Yznaga <anthony.yznaga@oracle.com> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* sparc64: Add __multi3 for gcc 7.x and later.David S. Miller2017-06-142-0/+36
| | | | | | | | [ Upstream commit 1b4af13ff2cc6897557bb0b8d9e2fad4fa4d67aa ] Reported-by: Waldemar Brodkorb <wbx@openadk.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* net: bridge: start hello timer only if device is upNikolay Aleksandrov2017-06-141-1/+2
| | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit aeb073241fe7a2b932e04e20c60e47718332877f ] When the transition of NO_STP -> KERNEL_STP was fixed by always calling mod_timer in br_stp_start, it introduced a new regression which causes the timer to be armed even when the bridge is down, and since we stop the timers in its ndo_stop() function, they never get disabled if the device is destroyed before it's upped. To reproduce: $ while :; do ip l add br0 type bridge hello_time 100; brctl stp br0 on; ip l del br0; done; CC: Xin Long <lucien.xin@gmail.com> CC: Ivan Vecera <cera@cera.cz> CC: Sebastian Ott <sebott@linux.vnet.ibm.com> Reported-by: Sebastian Ott <sebott@linux.vnet.ibm.com> Fixes: 6d18c732b95c ("bridge: start hello_timer when enabling KERNEL_STP in br_stp_start") Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* net: stmmac: fix completely hung TX when using TSONiklas Cassel2017-06-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 426849e6611f2092553f8d53372ae310818a6292 ] stmmac_tso_allocator can fail to set the Last Descriptor bit on a descriptor that actually was the last descriptor. This happens when the buffer of the last descriptor ends up having a size of exactly TSO_MAX_BUFF_SIZE. When the IP eventually reaches the next last descriptor, which actually has the bit set, the DMA will hang. When the DMA hangs, we get a tx timeout, however, since stmmac does not do a complete reset of the IP in stmmac_tx_timeout, we end up in a state with completely hung TX. Signed-off-by: Niklas Cassel <niklas.cassel@axis.com> Acked-by: Giuseppe Cavallaro <peppe.cavallaro@st.com> Acked-by: Alexandre TORGUE <alexandre.torgue@st.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* net: ethoc: enable NAPI before poll may be scheduledMax Filippov2017-06-141-1/+2
| | | | | | | | | | | | | | | | | | [ Upstream commit d220b942a4b6a0640aee78841608f4aa5e8e185e ] ethoc_reset enables device interrupts, ethoc_interrupt may schedule a NAPI poll before NAPI is enabled in the ethoc_open, which results in device being unable to send or receive anything until it's closed and reopened. In case the device is flooded with ingress packets it may be unable to recover at all. Move napi_enable above ethoc_reset in the ethoc_open to fix that. Fixes: a1702857724f ("net: Add support for the OpenCores 10/100 Mbps Ethernet MAC.") Signed-off-by: Max Filippov <jcmvbkbc@gmail.com> Reviewed-by: Tobias Klauser <tklauser@distanz.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* net/ipv6: Fix CALIPSO causing GPF with datagram supportRichard Haines2017-06-141-1/+5
| | | | | | | | | | | | | | | [ Upstream commit e3ebdb20fddacded2740a333ff66781e0d28b05c ] When using CALIPSO with IPPROTO_UDP it is possible to trigger a GPF as the IP header may have moved. Also update the payload length after adding the CALIPSO option. Signed-off-by: Richard Haines <richard_c_haines@btinternet.com> Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Huw Davies <huw@codeweavers.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* net: ping: do not abuse udp_poll()Eric Dumazet2017-06-144-3/+4
| | | | | | | | | | | | | | | | | | | | | [ Upstream commit 77d4b1d36926a9b8387c6b53eeba42bcaaffcea3 ] Alexander reported various KASAN messages triggered in recent kernels The problem is that ping sockets should not use udp_poll() in the first place, and recent changes in UDP stack finally exposed this old bug. Fixes: c319b4d76b9e ("net: ipv4: add IPPROTO_ICMP socket kind") Fixes: 6d0bfe226116 ("net: ipv6: Add IPv6 support to the ping socket.") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Sasha Levin <alexander.levin@verizon.com> Cc: Solar Designer <solar@openwall.com> Cc: Vasiliy Kulikov <segoon@openwall.com> Cc: Lorenzo Colitti <lorenzo@google.com> Acked-By: Lorenzo Colitti <lorenzo@google.com> Tested-By: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ipv6: Fix leak in ipv6_gso_segment().David S. Miller2017-06-141-1/+3
| | | | | | | | | | | | [ Upstream commit e3e86b5119f81e5e2499bea7ea1ebe8ac6aab789 ] If ip6_find_1stfragopt() fails and we return an error we have to free up 'segs' because nobody else is going to. Fixes: 2423496af35d ("ipv6: Prevent overrun when parsing v6 header options") Reported-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* vxlan: fix use-after-free on deletionMark Bloch2017-06-141-6/+13
| | | | | | | | | | | | | | | | | | | | [ Upstream commit a53cb29b0af346af44e4abf13d7e59f807fba690 ] Adding a vxlan interface to a socket isn't symmetrical, while adding is done in vxlan_open() the deletion is done in vxlan_dellink(). This can cause a use-after-free error when we close the vxlan interface before deleting it. We add vxlan_vs_del_dev() to match vxlan_vs_add_dev() and call it from vxlan_stop() to match the call from vxlan_open(). Fixes: 56ef9c909b40 ("vxlan: Move socket initialization to within rtnl scope") Acked-by: Jiri Benc <jbenc@redhat.com> Tested-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Mark Bloch <markb@mellanox.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tcp: disallow cwnd undo when switching congestion controlYuchung Cheng2017-06-141-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 44abafc4cc094214a99f860f778c48ecb23422fc ] When the sender switches its congestion control during loss recovery, if the recovery is spurious then it may incorrectly revert cwnd and ssthresh to the older values set by a previous congestion control. Consider a congestion control (like BBR) that does not use ssthresh and keeps it infinite: the connection may incorrectly revert cwnd to an infinite value when switching from BBR to another congestion control. This patch fixes it by disallowing such cwnd undo operation upon switching congestion control. Note that undo_marker is not reset s.t. the packets that were incorrectly marked lost would be corrected. We only avoid undoing the cwnd in tcp_undo_cwnd_reduction(). Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* cxgb4: avoid enabling napi twice to the same queueGanesh Goudar2017-06-141-0/+4
| | | | | | | | | | | | [ Upstream commit e7519f9926f1d0d11c776eb0475eb098c7760f68 ] Take uld mutex to avoid race between cxgb_up() and cxgb4_register_uld() to enable napi for the same uld queue. Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ipv6: xfrm: Handle errors reported by xfrm6_find_1stfragopt()Ben Hutchings2017-06-142-0/+4
| | | | | | | | | | | | | [ Upstream commit 6e80ac5cc992ab6256c3dae87f7e57db15e1a58c ] xfrm6_find_1stfragopt() may now return an error code and we must not treat it as a length. Fixes: 2423496af35d ("ipv6: Prevent overrun when parsing v6 header options") Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Acked-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* vxlan: eliminate cached dst leakLance Richardson2017-06-141-3/+17
| | | | | | | | | | | | | | | | | | | [ Upstream commit 35cf2845563c1aaa01d27bd34d64795c4ae72700 ] After commit 0c1d70af924b ("net: use dst_cache for vxlan device"), cached dst entries could be leaked when more than one remote was present for a given vxlan_fdb entry, causing subsequent netns operations to block indefinitely and "unregister_netdevice: waiting for lo to become free." messages to appear in the kernel log. Fix by properly releasing cached dst and freeing resources in this case. Fixes: 0c1d70af924b ("net: use dst_cache for vxlan device") Signed-off-by: Lance Richardson <lrichard@redhat.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* bnx2x: Fix Multi-CosMintz, Yuval2017-06-141-1/+1
| | | | | | | | | | | | | | | | | | [ Upstream commit 3968d38917eb9bd0cd391265f6c9c538d9b33ffa ] Apparently multi-cos isn't working for bnx2x quite some time - driver implements ndo_select_queue() to allow queue-selection for FCoE, but the regular L2 flow would cause it to modulo the fallback's result by the number of queues. The fallback would return a queue matching the needed tc [via __skb_tx_hash()], but since the modulo is by the number of TSS queues where number of TCs is not accounted, transmission would always be done by a queue configured into using TC0. Fixes: ada7c19e6d27 ("bnx2x: use XPS if possible for bnx2x_select_queue instead of pure hash") Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Linux 4.9.31v4.9.31Greg Kroah-Hartman2017-06-071-1/+1
|
* xfs: Fix off-by-in in loop termination in xfs_find_get_desired_pgoff()Jan Kara2017-06-071-1/+1
| | | | | | | | | | | | | | | commit d7fd24257aa60316bf81093f7f909dc9475ae974 upstream. There is an off-by-one error in loop termination conditions in xfs_find_get_desired_pgoff() since 'end' may index a page beyond end of desired range if 'endoff' is page aligned. It doesn't have any visible effects but still it is good to fix it. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: fix unaligned access in xfs_btree_visit_blocksEric Sandeen2017-06-071-1/+1
| | | | | | | | | | | | | | | | commit a4d768e702de224cc85e0c8eac9311763403b368 upstream. This structure copy was throwing unaligned access warnings on sparc64: Kernel unaligned access at TPC[1043c088] xfs_btree_visit_blocks+0x88/0xe0 [xfs] xfs_btree_copy_ptrs does a memcpy, which avoids it. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: avoid mount-time deadlock in CoW extent recoveryDarrick J. Wong2017-06-071-12/+31
| | | | | | | | | | | | | | | commit 3ecb3ac7b950ff8f6c6a61e8b7b0d6e3546429a0 upstream. If a malicious user corrupts the refcount btree to cause a cycle between different levels of the tree, the next mount attempt will deadlock in the CoW recovery routine while grabbing buffer locks. We can use the ability to re-grab a buffer that was previous locked to a transaction to avoid deadlocks, so do that here. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: xfs_trans_alloc_emptyChristoph Hellwig2017-06-072-0/+24
| | | | | | | | | | | This is a partial cherry-pick of commit e89c041338 ("xfs: implement the GETFSMAP ioctl"), which also adds this helper, and a great example of why feature patches should be properly split into their parts. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> [hch: split from the larger patch for -stable] Signed-off-by: Christoph Hellwig <hch@lst.de>
* xfs: bad assertion for delalloc an extent that start at i_sizeZorro Lang2017-06-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | commit 892d2a5f705723b2cb488bfb38bcbdcf83273184 upstream. By run fsstress long enough time enough in RHEL-7, I find an assertion failure (harder to reproduce on linux-4.11, but problem is still there): XFS: Assertion failed: (iflags & BMV_IF_DELALLOC) != 0, file: fs/xfs/xfs_bmap_util.c The assertion is in xfs_getbmap() funciton: if (map[i].br_startblock == DELAYSTARTBLOCK && --> map[i].br_startoff <= XFS_B_TO_FSB(mp, XFS_ISIZE(ip))) ASSERT((iflags & BMV_IF_DELALLOC) != 0); When map[i].br_startoff == XFS_B_TO_FSB(mp, XFS_ISIZE(ip)), the startoff is just at EOF. But we only need to make sure delalloc extents that are within EOF, not include EOF. Signed-off-by: Zorro Lang <zlang@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: BMAPX shouldn't barf on inline-format directoriesDarrick J. Wong2017-06-071-2/+6
| | | | | | | | | | | | | | | | commit 6eadbf4c8ba816c10d1c97bed9aa861d9fd17809 upstream. When we're fulfilling a BMAPX request, jump out early if the data fork is in local format. This prevents us from hitting a debugging check in bmapi_read and barfing errors back to userspace. The on-disk extent count check later isn't sufficient for IF_DELALLOC mode because da extents are in memory and not on disk. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: fix indlen accounting error on partial delalloc conversionBrian Foster2017-06-071-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 0daaecacb83bc6b656a56393ab77a31c28139bc7 upstream. The delalloc -> real block conversion path uses an incorrect calculation in the case where the middle part of a delalloc extent is being converted. This is documented as a rare situation because XFS generally attempts to maximize contiguity by converting as much of a delalloc extent as possible. If this situation does occur, the indlen reservation for the two new delalloc extents left behind by the conversion of the middle range is calculated and compared with the original reservation. If more blocks are required, the delta is allocated from the global block pool. This delta value can be characterized as the difference between the new total requirement (temp + temp2) and the currently available reservation minus those blocks that have already been allocated (startblockval(PREV.br_startblock) - allocated). The problem is that the current code does not account for previously allocated blocks correctly. It subtracts the current allocation count from the (new - old) delta rather than the old indlen reservation. This means that more indlen blocks than have been allocated end up stashed in the remaining extents and free space accounting is broken as a result. Fix up the calculation to subtract the allocated block count from the original extent indlen and thus correctly allocate the reservation delta based on the difference between the new total requirement and the unused blocks from the original reservation. Also remove a bogus assert that contradicts the fact that the new indlen reservation can be larger than the original indlen reservation. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: fix use-after-free in xfs_finish_page_writebackEryu Guan2017-06-071-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 161f55efba5ddccc690139fae9373cafc3447a97 upstream. Commit 28b783e47ad7 ("xfs: bufferhead chains are invalid after end_page_writeback") fixed one use-after-free issue by pre-calculating the loop conditionals before calling bh->b_end_io() in the end_io processing loop, but it assigned 'next' pointer before checking end offset boundary & breaking the loop, at which point the bh might be freed already, and caused use-after-free. This is caught by KASAN when running fstests generic/127 on sub-page block size XFS. [ 2517.244502] run fstests generic/127 at 2017-04-27 07:30:50 [ 2747.868840] ================================================================== [ 2747.876949] BUG: KASAN: use-after-free in xfs_destroy_ioend+0x3d3/0x4e0 [xfs] at addr ffff8801395ae698 ... [ 2747.918245] Call Trace: [ 2747.920975] dump_stack+0x63/0x84 [ 2747.924673] kasan_object_err+0x21/0x70 [ 2747.928950] kasan_report+0x271/0x530 [ 2747.933064] ? xfs_destroy_ioend+0x3d3/0x4e0 [xfs] [ 2747.938409] ? end_page_writeback+0xce/0x110 [ 2747.943171] __asan_report_load8_noabort+0x19/0x20 [ 2747.948545] xfs_destroy_ioend+0x3d3/0x4e0 [xfs] [ 2747.953724] xfs_end_io+0x1af/0x2b0 [xfs] [ 2747.958197] process_one_work+0x5ff/0x1000 [ 2747.962766] worker_thread+0xe4/0x10e0 [ 2747.966946] kthread+0x2d3/0x3d0 [ 2747.970546] ? process_one_work+0x1000/0x1000 [ 2747.975405] ? kthread_create_on_node+0xc0/0xc0 [ 2747.980457] ? syscall_return_slowpath+0xe6/0x140 [ 2747.985706] ? do_page_fault+0x30/0x80 [ 2747.989887] ret_from_fork+0x2c/0x40 [ 2747.993874] Object at ffff8801395ae690, in cache buffer_head size: 104 [ 2748.001155] Allocated: [ 2748.003782] PID = 8327 [ 2748.006411] save_stack_trace+0x1b/0x20 [ 2748.010688] save_stack+0x46/0xd0 [ 2748.014383] kasan_kmalloc+0xad/0xe0 [ 2748.018370] kasan_slab_alloc+0x12/0x20 [ 2748.022648] kmem_cache_alloc+0xb8/0x1b0 [ 2748.027024] alloc_buffer_head+0x22/0xc0 [ 2748.031399] alloc_page_buffers+0xd1/0x250 [ 2748.035968] create_empty_buffers+0x30/0x410 [ 2748.040730] create_page_buffers+0x120/0x1b0 [ 2748.045493] __block_write_begin_int+0x17a/0x1800 [ 2748.050740] iomap_write_begin+0x100/0x2f0 [ 2748.055308] iomap_zero_range_actor+0x253/0x5c0 [ 2748.060362] iomap_apply+0x157/0x270 [ 2748.064347] iomap_zero_range+0x5a/0x80 [ 2748.068624] iomap_truncate_page+0x6b/0xa0 [ 2748.073227] xfs_setattr_size+0x1f7/0xa10 [xfs] [ 2748.078312] xfs_vn_setattr_size+0x68/0x140 [xfs] [ 2748.083589] xfs_file_fallocate+0x4ac/0x820 [xfs] [ 2748.088838] vfs_fallocate+0x2cf/0x780 [ 2748.093021] SyS_fallocate+0x48/0x80 [ 2748.097006] do_syscall_64+0x18a/0x430 [ 2748.101186] return_from_SYSCALL_64+0x0/0x6a [ 2748.105948] Freed: [ 2748.108189] PID = 8327 [ 2748.110816] save_stack_trace+0x1b/0x20 [ 2748.115093] save_stack+0x46/0xd0 [ 2748.118788] kasan_slab_free+0x73/0xc0 [ 2748.122969] kmem_cache_free+0x7a/0x200 [ 2748.127247] free_buffer_head+0x41/0x80 [ 2748.131524] try_to_free_buffers+0x178/0x250 [ 2748.136316] xfs_vm_releasepage+0x2e9/0x3d0 [xfs] [ 2748.141563] try_to_release_page+0x100/0x180 [ 2748.146325] invalidate_inode_pages2_range+0x7da/0xcf0 [ 2748.152087] xfs_shift_file_space+0x37d/0x6e0 [xfs] [ 2748.157557] xfs_collapse_file_space+0x49/0x120 [xfs] [ 2748.163223] xfs_file_fallocate+0x2a7/0x820 [xfs] [ 2748.168462] vfs_fallocate+0x2cf/0x780 [ 2748.172642] SyS_fallocate+0x48/0x80 [ 2748.176629] do_syscall_64+0x18a/0x430 [ 2748.180810] return_from_SYSCALL_64+0x0/0x6a Fixed it by checking on offset against end & breaking out first, dereference bh only if there're still bufferheads to process. Signed-off-by: Eryu Guan <eguan@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: reserve enough blocks to handle btree splits when remappingDarrick J. Wong2017-06-073-9/+37
| | | | | | | | | | | | | | | | | | | | | | | commit fe0be23e68200573de027de9b8cc2b27e7fce35e upstream. In xfs_reflink_end_cow, we erroneously reserve only enough blocks to handle adding 1 extent. This is problematic if we fragment free space, have to do CoW, and then have to perform multiple bmap btree expansions. Furthermore, the BUI recovery routine doesn't reserve /any/ blocks to handle btree splits, so log recovery fails after our first error causes the filesystem to go down. Therefore, refactor the transaction block reservation macros until we have a macro that works for our deferred (re)mapping activities, and fix both problems by using that macro. With 1k blocks we can hit this fairly often in g/187 if the scratch fs is big enough. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>