summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* openvswitch: Remove padding from packet before L3+ conntrack processingEd Swierk2018-04-261-0/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 9382fe71c0058465e942a633869629929102843d ] IPv4 and IPv6 packets may arrive with lower-layer padding that is not included in the L3 length. For example, a short IPv4 packet may have up to 6 bytes of padding following the IP payload when received on an Ethernet device with a minimum packet length of 64 bytes. Higher-layer processing functions in netfilter (e.g. nf_ip_checksum(), and help() in nf_conntrack_ftp) assume skb->len reflects the length of the L3 header and payload, rather than referring back to ip_hdr->tot_len or ipv6_hdr->payload_len, and get confused by lower-layer padding. In the normal IPv4 receive path, ip_rcv() trims the packet to ip_hdr->tot_len before invoking netfilter hooks. In the IPv6 receive path, ip6_rcv() does the same using ipv6_hdr->payload_len. Similarly in the br_netfilter receive path, br_validate_ipv4() and br_validate_ipv6() trim the packet to the L3 length before invoking netfilter hooks. Currently in the OVS conntrack receive path, ovs_ct_execute() pulls the skb to the L3 header but does not trim it to the L3 length before calling nf_conntrack_in(NF_INET_PRE_ROUTING). When nf_conntrack_proto_tcp encounters a packet with lower-layer padding, nf_ip_checksum() fails causing a "nf_ct_tcp: bad TCP checksum" log message. While extra zero bytes don't affect the checksum, the length in the IP pseudoheader does. That length is based on skb->len, and without trimming, it doesn't match the length the sender used when computing the checksum. In ovs_ct_execute(), trim the skb to the L3 length before higher-layer processing. Signed-off-by: Ed Swierk <eswierk@skyportsystems.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* mm/fadvise: discard partial page if endbyte is also EOFshidao.ytt2018-04-261-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit a7ab400d6fe73d0119fdc234e9982a6f80faea9f ] During our recent testing with fadvise(FADV_DONTNEED), we find that if given offset/length is not page-aligned, the last page will not be discarded. The tool we use is vmtouch (https://hoytech.com/vmtouch/), we map a 10KB-sized file into memory and then try to run this tool to evict the whole file mapping, but the last single page always remains staying in the memory: $./vmtouch -e test_10K Files: 1 Directories: 0 Evicted Pages: 3 (12K) Elapsed: 2.1e-05 seconds $./vmtouch test_10K Files: 1 Directories: 0 Resident Pages: 1/3 4K/12K 33.3% Elapsed: 5.5e-05 seconds However when we test with an older kernel, say 3.10, this problem is gone. So we wonder if this is a regression: $./vmtouch -e test_10K Files: 1 Directories: 0 Evicted Pages: 3 (12K) Elapsed: 8.2e-05 seconds $./vmtouch test_10K Files: 1 Directories: 0 Resident Pages: 0/3 0/12K 0% <-- partial page also discarded Elapsed: 5e-05 seconds After digging a little bit into this problem, we find it seems not a regression. Not discarding partial page is likely to be on purpose according to commit 441c228f817f ("mm: fadvise: document the fadvise(FADV_DONTNEED) behaviour for partial pages") written by Mel Gorman. He explained why partial pages should be preserved instead of being discarded when using fadvise(FADV_DONTNEED). However, the interesting part is that the actual code did NOT work as the same as it was described, the partial page was still discarded anyway, due to a calculation mistake of `end_index' passed to invalidate_mapping_pages(). This mistake has not been fixed until recently, that's why we fail to reproduce our problem in old kernels. The fix is done in commit 18aba41cbf ("mm/fadvise.c: do not discard partial pages with POSIX_FADV_DONTNEED") by Oleg Drokin. Back to the original testing, our problem becomes that there is a special case that, if the page-unaligned `endbyte' is also the end of file, it is not necessary at all to preserve the last partial page, as we all know no one else will use the rest of it. It should be safe enough if we just discard the whole page. So we add an EOF check in this patch. We also find a poosbile real world issue in mainline kernel. Assume such scenario: A userspace backup application want to backup a huge amount of small files (<4k) at once, the developer might (I guess) want to use fadvise(FADV_DONTNEED) to save memory. However, FADV_DONTNEED won't really happen since the only page mapped is a partial page, and kernel will preserve it. Our patch also fixes this problem, since we know the endbyte is EOF, so we discard it. Here is a simple reproducer to reproduce and verify each scenario we described above: test_fadvise.c ============================== #include <sys/mman.h> #include <sys/stat.h> #include <fcntl.h> #include <stdlib.h> #include <string.h> #include <stdio.h> #include <unistd.h> int main(int argc, char **argv) { int i, fd, ret, len; struct stat buf; void *addr; unsigned char *vec; char *strbuf; ssize_t pagesize = getpagesize(); ssize_t filesize; fd = open(argv[1], O_RDWR|O_CREAT, S_IRUSR|S_IWUSR); if (fd < 0) return -1; filesize = strtoul(argv[2], NULL, 10); strbuf = malloc(filesize); memset(strbuf, 42, filesize); write(fd, strbuf, filesize); free(strbuf); fsync(fd); len = (filesize + pagesize - 1) / pagesize; printf("length of pages: %d\n", len); addr = mmap(NULL, filesize, PROT_READ, MAP_SHARED, fd, 0); if (addr == MAP_FAILED) return -1; ret = posix_fadvise(fd, 0, filesize, POSIX_FADV_DONTNEED); if (ret < 0) return -1; vec = malloc(len); ret = mincore(addr, filesize, (void *)vec); if (ret < 0) return -1; for (i = 0; i < len; i++) printf("pages[%d]: %x\n", i, vec[i] & 0x1); free(vec); close(fd); return 0; } ============================== Test 1: running on kernel with commit 18aba41cbf reverted: [root@caspar ~]# uname -r 4.15.0-rc6.revert+ [root@caspar ~]# ./test_fadvise file1 1024 length of pages: 1 pages[0]: 0 # <-- partial page discarded [root@caspar ~]# ./test_fadvise file2 8192 length of pages: 2 pages[0]: 0 pages[1]: 0 [root@caspar ~]# ./test_fadvise file3 10240 length of pages: 3 pages[0]: 0 pages[1]: 0 pages[2]: 0 # <-- partial page discarded Test 2: running on mainline kernel: [root@caspar ~]# uname -r 4.15.0-rc6+ [root@caspar ~]# ./test_fadvise test1 1024 length of pages: 1 pages[0]: 1 # <-- partial and the only page not discarded [root@caspar ~]# ./test_fadvise test2 8192 length of pages: 2 pages[0]: 0 pages[1]: 0 [root@caspar ~]# ./test_fadvise test3 10240 length of pages: 3 pages[0]: 0 pages[1]: 0 pages[2]: 1 # <-- partial page not discarded Test 3: running on kernel with this patch: [root@caspar ~]# uname -r 4.15.0-rc6.patched+ [root@caspar ~]# ./test_fadvise test1 1024 length of pages: 1 pages[0]: 0 # <-- partial page and EOF, discarded [root@caspar ~]# ./test_fadvise test2 8192 length of pages: 2 pages[0]: 0 pages[1]: 0 [root@caspar ~]# ./test_fadvise test3 10240 length of pages: 3 pages[0]: 0 pages[1]: 0 pages[2]: 0 # <-- partial page and EOF, discarded [akpm@linux-foundation.org: tweak code comment] Link: http://lkml.kernel.org/r/5222da9ee20e1695eaabb69f631f200d6e6b8876.1515132470.git.jinli.zjl@alibaba-inc.com Signed-off-by: shidao.ytt <shidao.ytt@alibaba-inc.com> Signed-off-by: Caspar Zhang <jinli.zjl@alibaba-inc.com> Reviewed-by: Oliver Yang <zhiche.yy@alibaba-inc.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* mm: pin address_space before dereferencing it while isolating an LRU pageMel Gorman2018-04-261-2/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 69d763fc6d3aee787a3e8c8c35092b4f4960fa5d ] Minchan Kim asked the following question -- what locks protects address_space destroying when race happens between inode trauncation and __isolate_lru_page? Jan Kara clarified by describing the race as follows CPU1 CPU2 truncate(inode) __isolate_lru_page() ... truncate_inode_page(mapping, page); delete_from_page_cache(page) spin_lock_irqsave(&mapping->tree_lock, flags); __delete_from_page_cache(page, NULL) page_cache_tree_delete(..) ... mapping = page_mapping(page); page->mapping = NULL; ... spin_unlock_irqrestore(&mapping->tree_lock, flags); page_cache_free_page(mapping, page) put_page(page) if (put_page_testzero(page)) -> false - inode now has no pages and can be freed including embedded address_space if (mapping && !mapping->a_ops->migratepage) - we've dereferenced mapping which is potentially already free. The race is theoretically possible but unlikely. Before the delete_from_page_cache, truncate_cleanup_page is called so the page is likely to be !PageDirty or PageWriteback which gets skipped by the only caller that checks the mappping in __isolate_lru_page. Even if the race occurs, a substantial amount of work has to happen during a tiny window with no preemption but it could potentially be done using a virtual machine to artifically slow one CPU or halt it during the critical window. This patch should eliminate the race with truncation by try-locking the page before derefencing mapping and aborting if the lock was not acquired. There was a suggestion from Huang Ying to use RCU as a side-effect to prevent mapping being freed. However, I do not like the solution as it's an unconventional means of preserving a mapping and it's not a context where rcu_read_lock is obviously protecting rcu data. Link: http://lkml.kernel.org/r/20180104102512.2qos3h5vqzeisrek@techsingularity.net Fixes: c82449352854 ("mm: compaction: make isolate_lru_page() filter-aware again") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Minchan Kim <minchan@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* mm: thp: use down_read_trylock() in khugepaged to avoid long blockYang Shi2018-04-261-4/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 3b454ad35043dfbd3b5d2bb92b0991d6342afb44 ] In the current design, khugepaged needs to acquire mmap_sem before scanning an mm. But in some corner cases, khugepaged may scan a process which is modifying its memory mapping, so khugepaged blocks in uninterruptible state. But the process might hold the mmap_sem for a long time when modifying a huge memory space and it may trigger the below khugepaged hung issue: INFO: task khugepaged:270 blocked for more than 120 seconds. Tainted: G E 4.9.65-006.ali3000.alios7.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. khugepaged D 0 270 2 0x00000000  ffff883f3deae4c0 0000000000000000 ffff883f610596c0 ffff883f7d359440 ffff883f63818000 ffffc90019adfc78 ffffffff817079a5 d67e5aa8c1860a64 0000000000000246 ffff883f7d359440 ffffc90019adfc88 ffff883f610596c0 Call Trace: schedule+0x36/0x80 rwsem_down_read_failed+0xf0/0x150 call_rwsem_down_read_failed+0x18/0x30 down_read+0x20/0x40 khugepaged+0x476/0x11d0 kthread+0xe6/0x100 ret_from_fork+0x25/0x30 So it sounds pointless to just block khugepaged waiting for the semaphore so replace down_read() with down_read_trylock() to move to scan the next mm quickly instead of just blocking on the semaphore so that other processes can get more chances to install THP. Then khugepaged can come back to scan the skipped mm when it has finished the current round full_scan. And it appears that the change can improve khugepaged efficiency a little bit. Below is the test result when running LTP on a 24 cores 4GB memory 2 nodes NUMA VM: pristine w/ trylock full_scan 197 187 pages_collapsed 21 26 thp_fault_alloc 40818 44466 thp_fault_fallback 18413 16679 thp_collapse_alloc 21 150 thp_collapse_alloc_failed 14 16 thp_file_alloc 369 369 [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: tweak comment] [arnd@arndb.de: avoid uninitialized variable use] Link: http://lkml.kernel.org/r/20171215125129.2948634-1-arnd@arndb.de Link: http://lkml.kernel.org/r/1513281203-54878-1-git-send-email-yang.s@alibaba-inc.com Signed-off-by: Yang Shi <yang.s@alibaba-inc.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* sparc64: update pmdp_invalidate() to return old pmd valueNitin Gupta2018-04-262-6/+19
| | | | | | | | | | | | | | | | | | | [ Upstream commit a8e654f01cb725d0bfd741ebca1bf4c9337969cc ] It's required to avoid losing dirty and accessed bits. [akpm@linux-foundation.org: add a `do' to the do-while loop] Link: http://lkml.kernel.org/r/20171213105756.69879-9-kirill.shutemov@linux.intel.com Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: David Miller <davem@davemloft.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* asm-generic: provide generic_pmdp_establish()Kirill A. Shutemov2018-04-261-0/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit c58f0bb77ed8bf93dfdde762b01cb67eebbdfc29 ] Patch series "Do not lose dirty bit on THP pages", v4. Vlastimil noted that pmdp_invalidate() is not atomic and we can lose dirty and access bits if CPU sets them after pmdp dereference, but before set_pmd_at(). The bug can lead to data loss, but the race window is tiny and I haven't seen any reports that suggested that it happens in reality. So I don't think it worth sending it to stable. Unfortunately, there's no way to address the issue in a generic way. We need to fix all architectures that support THP one-by-one. All architectures that have THP supported have to provide atomic pmdp_invalidate() that returns previous value. If generic implementation of pmdp_invalidate() is used, architecture needs to provide atomic pmdp_estabish(). pmdp_estabish() is not used out-side generic implementation of pmdp_invalidate() so far, but I think this can change in the future. This patch (of 12): This is an implementation of pmdp_establish() that is only suitable for an architecture that doesn't have hardware dirty/accessed bits. In this case we can't race with CPU which sets these bits and non-atomic approach is fine. Link: http://lkml.kernel.org/r/20171213105756.69879-2-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Daney <david.daney@cavium.com> Cc: David Miller <davem@davemloft.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Nitin Gupta <nitin.m.gupta@oracle.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vineet Gupta <vgupta@synopsys.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* mm/mempolicy: add nodes_empty check in SYSC_migrate_pagesYisheng Xie2018-04-261-3/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 0486a38bcc4749808edbc848f1bcf232042770fc ] As in manpage of migrate_pages, the errno should be set to EINVAL when none of the node IDs specified by new_nodes are on-line and allowed by the process's current cpuset context, or none of the specified nodes contain memory. However, when test by following case: new_nodes = 0; old_nodes = 0xf; ret = migrate_pages(pid, old_nodes, new_nodes, MAX); The ret will be 0 and no errno is set. As the new_nodes is empty, we should expect EINVAL as documented. To fix the case like above, this patch check whether target nodes AND current task_nodes is empty, and then check whether AND node_states[N_MEMORY] is empty. Link: http://lkml.kernel.org/r/1510882624-44342-4-git-send-email-xieyisheng1@huawei.com Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Chris Salls <salls@cs.ucsb.edu> Cc: Christopher Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Tan Xiaojun <tanxiaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* mm/mempolicy: fix the check of nodemask from userYisheng Xie2018-04-261-3/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 56521e7a02b7b84a5e72691a1fb15570e6055545 ] As Xiaojun reported the ltp of migrate_pages01 will fail on arm64 system which has 4 nodes[0...3], all have memory and CONFIG_NODES_SHIFT=2: migrate_pages01 0 TINFO : test_invalid_nodes migrate_pages01 14 TFAIL : migrate_pages_common.c:45: unexpected failure - returned value = 0, expected: -1 migrate_pages01 15 TFAIL : migrate_pages_common.c:55: call succeeded unexpectedly In this case the test_invalid_nodes of migrate_pages01 will call: SYSC_migrate_pages as: migrate_pages(0, , {0x0000000000000001}, 64, , {0x0000000000000010}, 64) = 0 The new nodes specifies one or more node IDs that are greater than the maximum supported node ID, however, the errno is not set to EINVAL as expected. As man pages of set_mempolicy[1], mbind[2], and migrate_pages[3] mentioned, when nodemask specifies one or more node IDs that are greater than the maximum supported node ID, the errno should set to EINVAL. However, get_nodes only check whether the part of bits [BITS_PER_LONG*BITS_TO_LONGS(MAX_NUMNODES), maxnode) is zero or not, and remain [MAX_NUMNODES, BITS_PER_LONG*BITS_TO_LONGS(MAX_NUMNODES) unchecked. This patch is to check the bits of [MAX_NUMNODES, maxnode) in get_nodes to let migrate_pages set the errno to EINVAL when nodemask specifies one or more node IDs that are greater than the maximum supported node ID, which follows the manpage's guide. [1] http://man7.org/linux/man-pages/man2/set_mempolicy.2.html [2] http://man7.org/linux/man-pages/man2/mbind.2.html [3] http://man7.org/linux/man-pages/man2/migrate_pages.2.html Link: http://lkml.kernel.org/r/1510882624-44342-3-git-send-email-xieyisheng1@huawei.com Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com> Reported-by: Tan Xiaojun <tanxiaojun@huawei.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Chris Salls <salls@cs.ucsb.edu> Cc: Christopher Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ocfs2: return error when we attempt to access a dirty bh in jbd2piaojun2018-04-261-11/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit d984187e3a1ad7d12447a7ab2c43ce3717a2b5b3 ] We should not reuse the dirty bh in jbd2 directly due to the following situation: 1. When removing extent rec, we will dirty the bhs of extent rec and truncate log at the same time, and hand them over to jbd2. 2. The bhs are submitted to jbd2 area successfully. 3. The write-back thread of device help flush the bhs to disk but encounter write error due to abnormal storage link. 4. After a while the storage link become normal. Truncate log flush worker triggered by the next space reclaiming found the dirty bh of truncate log and clear its 'BH_Write_EIO' and then set it uptodate in __ocfs2_journal_access(): ocfs2_truncate_log_worker ocfs2_flush_truncate_log __ocfs2_flush_truncate_log ocfs2_replay_truncate_records ocfs2_journal_access_di __ocfs2_journal_access // here we clear io_error and set 'tl_bh' uptodata. 5. Then jbd2 will flush the bh of truncate log to disk, but the bh of extent rec is still in error state, and unfortunately nobody will take care of it. 6. At last the space of extent rec was not reduced, but truncate log flush worker have given it back to globalalloc. That will cause duplicate cluster problem which could be identified by fsck.ocfs2. Sadly we can hardly revert this but set fs read-only in case of ruining atomicity and consistency of space reclaim. Link: http://lkml.kernel.org/r/5A6E8092.8090701@huawei.com Fixes: acf8fdbe6afb ("ocfs2: do not BUG if buffer not uptodate in __ocfs2_journal_access") Signed-off-by: Jun Piao <piaojun@huawei.com> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Changwei Ge <ge.changwei@h3c.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ocfs2/acl: use 'ip_xattr_sem' to protect getting extended attributepiaojun2018-04-262-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 16c8d569f5704a84164f30ff01b29879f3438065 ] The race between *set_acl and *get_acl will cause getting incomplete xattr data as below: processA processB ocfs2_set_acl ocfs2_xattr_set __ocfs2_xattr_set_handle ocfs2_get_acl_nolock ocfs2_xattr_get_nolock: processB may get incomplete xattr data if processA hasn't set_acl done. So we should use 'ip_xattr_sem' to protect getting extended attribute in ocfs2_get_acl_nolock(), as other processes could be changing it concurrently. Link: http://lkml.kernel.org/r/5A5DDCFF.7030001@huawei.com Signed-off-by: Jun Piao <piaojun@huawei.com> Reviewed-by: Alex Chen <alex.chen@huawei.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ocfs2: return -EROFS to mount.ocfs2 if inode block is invalidpiaojun2018-04-261-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 025bcbde3634b2c9b316f227fed13ad6ad6817fb ] If metadata is corrupted such as 'invalid inode block', we will get failed by calling 'mount()' and then set filesystem readonly as below: ocfs2_mount ocfs2_initialize_super ocfs2_init_global_system_inodes ocfs2_iget ocfs2_read_locked_inode ocfs2_validate_inode_block ocfs2_error ocfs2_handle_error ocfs2_set_ro_flag(osb, 0); // set readonly In this situation we need return -EROFS to 'mount.ocfs2', so that user can fix it by fsck. And then mount again. In addition, 'mount.ocfs2' should be updated correspondingly as it only return 1 for all errno. And I will post a patch for 'mount.ocfs2' too. Link: http://lkml.kernel.org/r/5A4302FA.2010606@huawei.com Signed-off-by: Jun Piao <piaojun@huawei.com> Reviewed-by: Alex Chen <alex.chen@huawei.com> Reviewed-by: Joseph Qi <jiangqi903@gmail.com> Reviewed-by: Changwei Ge <ge.changwei@h3c.com> Reviewed-by: Gang He <ghe@suse.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* fs/dax.c: release PMD lock even when there is no PMD support in DAXJan H. Schönherr2018-04-261-1/+1
| | | | | | | | | | | | | | | | | | | | [ Upstream commit ee190ca6516bc8257e3d36187ca6f0f71a9ec477 ] follow_pte_pmd() can theoretically return after having acquired a PMD lock, even when DAX was not compiled with CONFIG_FS_DAX_PMD. Release the PMD lock unconditionally. Link: http://lkml.kernel.org/r/20180118133839.20587-1-jschoenh@amazon.de Fixes: f729c8c9b24f ("dax: wrprotect pmd_t in dax_mapping_entry_mkclean") Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86/kvm/vmx: do not use vm-exit instruction length for fast MMIO when ↵Vitaly Kuznetsov2018-04-262-2/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | running nested [ Upstream commit d391f1207067268261add0485f0f34503539c5b0 ] I was investigating an issue with seabios >= 1.10 which stopped working for nested KVM on Hyper-V. The problem appears to be in handle_ept_violation() function: when we do fast mmio we need to skip the instruction so we do kvm_skip_emulated_instruction(). This, however, depends on VM_EXIT_INSTRUCTION_LEN field being set correctly in VMCS. However, this is not the case. Intel's manual doesn't mandate VM_EXIT_INSTRUCTION_LEN to be set when EPT MISCONFIG occurs. While on real hardware it was observed to be set, some hypervisors follow the spec and don't set it; we end up advancing IP with some random value. I checked with Microsoft and they confirmed they don't fill VM_EXIT_INSTRUCTION_LEN on EPT MISCONFIG. Fix the issue by doing instruction skip through emulator when running nested. Fixes: 68c3b4d1676d870f0453c31d5a52e7e65c7448ae Suggested-by: Radim Krčmář <rkrcmar@redhat.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* kvm: Map PFN-type memory regions as writable (if possible)KarimAllah Ahmed2018-04-261-2/+5
| | | | | | | | | | | | | | | | | | | | | [ Upstream commit a340b3e229b24a56f1c7f5826b15a3af0f4b13e5 ] For EPT-violations that are triggered by a read, the pages are also mapped with write permissions (if their memory region is also writable). That would avoid getting yet another fault on the same page when a write occurs. This optimization only happens when you have a "struct page" backing the memory region. So also enable it for memory regions that do not have a "struct page". Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tcp_nv: fix potential integer overflow in tcpnv_ackedGustavo A. R. Silva2018-04-261-1/+1
| | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit e4823fbd229bfbba368b40cdadb8f4eeb20604cc ] Add suffix ULL to constant 80000 in order to avoid a potential integer overflow and give the compiler complete information about the proper arithmetic to use. Notice that this constant is used in a context that expects an expression of type u64. The current cast to u64 effectively applies to the whole expression as an argument of type u64 to be passed to div64_u64, but it does not prevent it from being evaluated using 32-bit arithmetic instead of 64-bit arithmetic. Also, once the expression is properly evaluated using 64-bit arithmentic, there is no need for the parentheses and the external cast to u64. Addresses-Coverity-ID: 1357588 ("Unintentional integer overflow") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* netfilter: x_tables: fix pointer leaks to userspaceDmitry Vyukov2018-04-265-2/+5
| | | | | | | | | | | | | | | | | | | [ Upstream commit 1e98ffea5a8935ec040ab72299e349cb44b8defd ] Several netfilter matches and targets put kernel pointers into info objects, but don't set usersize in descriptors. This leads to kernel pointer leaks if a match/target is set and then read back to userspace. Properly set usersize for these matches/targets. Found with manual code inspection. Fixes: ec2318904965 ("xtables: extend matches and targets with .usersize") Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86/hyperv: Check for required priviliges in hyperv_init()Vitaly Kuznetsov2018-04-261-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 89a8f6d4904c8cf3ff8fee9fdaff392a6bbb8bf6 ] In hyperv_init() its presumed that it always has access to VP index and hypercall MSRs while according to the specification it should be checked if it's allowed to access the corresponding MSRs before accessing them. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: kvm@vger.kernel.org Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: "Michael Kelley (EOSG)" <Michael.H.Kelley@microsoft.com> Cc: Roman Kagan <rkagan@virtuozzo.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: devel@linuxdriverproject.org Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Cathy Avery <cavery@redhat.com> Cc: Mohammed Gamal <mmorsy@redhat.com> Link: https://lkml.kernel.org/r/20180124132337.30138-2-vkuznets@redhat.com Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* gianfar: prevent integer wrapping in the rx handlerAndy Spencer2018-04-261-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 202a0a70e445caee1d0ec7aae814e64b1189fa4d ] When the frame check sequence (FCS) is split across the last two frames of a fragmented packet, part of the FCS gets counted twice, once when subtracting the FCS, and again when subtracting the previously received data. For example, if 1602 bytes are received, and the first fragment contains the first 1600 bytes (including the first two bytes of the FCS), and the second fragment contains the last two bytes of the FCS: 'skb->len == 1600' from the first fragment size = lstatus & BD_LENGTH_MASK; # 1602 size -= ETH_FCS_LEN; # 1598 size -= skb->len; # -2 Since the size is unsigned, it wraps around and causes a BUG later in the packet handling, as shown below: kernel BUG at ./include/linux/skbuff.h:2068! Oops: Exception in kernel mode, sig: 5 [#1] ... NIP [c021ec60] skb_pull+0x24/0x44 LR [c01e2fbc] gfar_clean_rx_ring+0x498/0x690 Call Trace: [df7edeb0] [c01e2c1c] gfar_clean_rx_ring+0xf8/0x690 (unreliable) [df7edf20] [c01e33a8] gfar_poll_rx_sq+0x3c/0x9c [df7edf40] [c023352c] net_rx_action+0x21c/0x274 [df7edf90] [c0329000] __do_softirq+0xd8/0x240 [df7edff0] [c000c108] call_do_irq+0x24/0x3c [c0597e90] [c00041dc] do_IRQ+0x64/0xc4 [c0597eb0] [c000d920] ret_from_except+0x0/0x18 --- interrupt: 501 at arch_cpu_idle+0x24/0x5c Change the size to a signed integer and then trim off any part of the FCS that was received prior to the last fragment. Fixes: 6c389fc931bc ("gianfar: fix size of scatter-gathered frames") Signed-off-by: Andy Spencer <aspencer@spacex.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ntb_transport: Fix bug with max_mw_size parameterLogan Gunthorpe2018-04-261-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit cbd27448faff4843ac4b66cc71445a10623ff48d ] When using the max_mw_size parameter of ntb_transport to limit the size of the Memory windows, communication cannot be established and the queues freeze. This is because the mw_size that's reported to the peer is correctly limited but the size used locally is not. So the MW is initialized with a buffer smaller than the window but the TX side is using the full window. This means the TX side will be writing to a region of the window that points nowhere. This is easily fixed by applying the same limit to tx_size in ntb_transport_init_queue(). Fixes: e26a5843f7f5 ("NTB: Split ntb_hw_intel and ntb_transport drivers") Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Acked-by: Allen Hubbe <Allen.Hubbe@dell.com> Cc: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Jon Mason <jdmason@kudzu.us> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* RDMA/mlx5: Avoid memory leak in case of XRCD dealloc failureLeon Romanovsky2018-04-261-4/+1
| | | | | | | | | | | | | | | | | | [ Upstream commit b081808a66345ba725b77ecd8d759bee874cd937 ] Failure in XRCD FW deallocation command leaves memory leaked and returns error to the user which he can't do anything about it. This patch changes behavior to always free memory and always return success to the user. Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox Connect-IB adapters") Reviewed-by: Majd Dibbiny <majd@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* powerpc/numa: Ensure nodes initialized for hotplugMichael Bringmann2018-04-261-10/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit ea05ba7c559c8e5a5946c3a94a2a266e9a6680a6 ] This patch fixes some problems encountered at runtime with configurations that support memory-less nodes, or that hot-add CPUs into nodes that are memoryless during system execution after boot. The problems of interest include: * Nodes known to powerpc to be memoryless at boot, but to have CPUs in them are allowed to be 'possible' and 'online'. Memory allocations for those nodes are taken from another node that does have memory until and if memory is hot-added to the node. * Nodes which have no resources assigned at boot, but which may still be referenced subsequently by affinity or associativity attributes, are kept in the list of 'possible' nodes for powerpc. Hot-add of memory or CPUs to the system can reference these nodes and bring them online instead of redirecting the references to one of the set of nodes known to have memory at boot. Note that this software operates under the context of CPU hotplug. We are not doing memory hotplug in this code, but rather updating the kernel's CPU topology (i.e. arch_update_cpu_topology / numa_update_cpu_topology). We are initializing a node that may be used by CPUs or memory before it can be referenced as invalid by a CPU hotplug operation. CPU hotplug operations are protected by a range of APIs including cpu_maps_update_begin/cpu_maps_update_done, cpus_read/write_lock / cpus_read/write_unlock, device locks, and more. Memory hotplug operations, including try_online_node, are protected by mem_hotplug_begin/mem_hotplug_done, device locks, and more. In the case of CPUs being hot-added to a previously memoryless node, the try_online_node operation occurs wholly within the CPU locks with no overlap. Using HMC hot-add/hot-remove operations, we have been able to add and remove CPUs to any possible node without failures. HMC operations involve a degree self-serialization, though. Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com> Reviewed-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* powerpc/numa: Use ibm,max-associativity-domains to discover possible nodesMichael Bringmann2018-04-261-3/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit a346137e9142b039fd13af2e59696e3d40c487ef ] On powerpc systems which allow 'hot-add' of CPU or memory resources, it may occur that the new resources are to be inserted into nodes that were not used for these resources at bootup. In the kernel, any node that is used must be defined and initialized. These empty nodes may occur when, * Dedicated vs. shared resources. Shared resources require information such as the VPHN hcall for CPU assignment to nodes. Associativity decisions made based on dedicated resource rules, such as associativity properties in the device tree, may vary from decisions made using the values returned by the VPHN hcall. * memoryless nodes at boot. Nodes need to be defined as 'possible' at boot for operation with other code modules. Previously, the powerpc code would limit the set of possible nodes to those which have memory assigned at boot, and were thus online. Subsequent add/remove of CPUs or memory would only work with this subset of possible nodes. * memoryless nodes with CPUs at boot. Due to the previous restriction on nodes, nodes that had CPUs but no memory were being collapsed into other nodes that did have memory at boot. In practice this meant that the node assignment presented by the runtime kernel differed from the affinity and associativity attributes presented by the device tree or VPHN hcalls. Nodes that might be known to the pHyp were not 'possible' in the runtime kernel because they did not have memory at boot. This patch ensures that sufficient nodes are defined to support configuration requirements after boot, as well as at boot. This patch set fixes a couple of problems. * Nodes known to powerpc to be memoryless at boot, but to have CPUs in them are allowed to be 'possible' and 'online'. Memory allocations for those nodes are taken from another node that does have memory until and if memory is hot-added to the node. * Nodes which have no resources assigned at boot, but which may still be referenced subsequently by affinity or associativity attributes, are kept in the list of 'possible' nodes for powerpc. Hot-add of memory or CPUs to the system can reference these nodes and bring them online instead of redirecting to one of the set of nodes that were known to have memory at boot. This patch extracts the value of the lowest domain level (number of allocable resources) from the device tree property "ibm,max-associativity-domains" to use as the maximum number of nodes to setup as possibly available in the system. This new setting will override the instruction: nodes_and(node_possible_map, node_possible_map, node_online_map); presently seen in the function arch/powerpc/mm/numa.c:initmem_init(). If the "ibm,max-associativity-domains" property is not present at boot, no operation will be performed to define or enable additional nodes, or enable the above 'nodes_and()'. Signed-off-by: Michael Bringmann <mwb@linux.vnet.ibm.com> Reviewed-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* samples/bpf: Partially fixes the bpf.o buildMickaël Salaün2018-04-261-1/+4
| | | | | | | | | | | | | | | | | | [ Upstream commit c25ef6a5e62fa212d298ce24995ce239f29b5f96 ] Do not build lib/bpf/bpf.o with this Makefile but use the one from the library directory. This avoid making a buggy bpf.o file (e.g. missing symbols). This patch is useful if some code (e.g. Landlock tests) needs both the bpf.o (from tools/lib/bpf) and the bpf_load.o (from samples/bpf). Signed-off-by: Mickaël Salaün <mic@digikod.net> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* i40e: fix reported mask for ntuple filtersJacob Keller2018-04-261-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 40339af33c703bacb336493157d43c86a8bf2fed ] In commit 36777d9fa24c ("i40e: check current configured input set when adding ntuple filters") some code was added to report the input set mask for a given filter when reporting it to the user. This code is necessary so that the reported filter correctly displays that it is or is not masking certain fields. Unfortunately the code was incorrect. Development error accidentally swapped the mask values for the IPv4 addresses with the L4 port numbers. The port numbers are only 16bits wide while IPv4 addresses are 32 bits. Unfortunately we assigned only 16 bits to the IPv4 address masks. Additionally we assigned 32bit value 0xFFFFFFF to the TCP port numbers. This second part does not matter as the value would be truncated to 16bits regardless, but it is unnecessary. Fix the reported masks to properly report that the entire field is masked. Fixes: 36777d9fa24c ("i40e: check current configured input set when adding ntuple filters") Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* i40e: program fragmented IPv4 filter input setJacob Keller2018-04-262-0/+13
| | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 02b4016bfe43d2d5ed043be7ffa56cda6a4d1100 ] When implementing support for IP_USER_FLOW filters, we correctly programmed a filter for both the non fragmented IPv4/Other filter, as well as the fragmented IPv4 filters. However, we did not properly program the input set for fragmented IPv4 PCTYPE. This meant that the filters would almost certainly not match, unless the user specified all of the flow types. Add support to program the fragmented IPv4 filter input set. Since we always program these filters together, we'll assume that the two input sets must match, and will thus always program the input sets to the same value. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ixgbe: don't set RXDCTL.RLPML for 82599Emil Tantilov2018-04-261-2/+6
| | | | | | | | | | | | | | | | | | | | [ Upstream commit 2bafa8fac19a31ca72ae1a3e48df35f73661dbed ] commit 2de6aa3a666e ("ixgbe: Add support for padding packet") Uses RXDCTL.RLPML to limit the maximum frame size on Rx when using build_skb. Unfortunately that register does not work on 82599. Added an explicit check to avoid setting this register on 82599 MAC. Extended the comment related to the setting of RXDCTL.RLPML to better explain its purpose. Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* jffs2: Fix use-after-free bug in jffs2_iget()'s error handling pathJake Daryll Obina2018-04-261-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 5bdd0c6f89fba430e18d636493398389dadc3b17 ] If jffs2_iget() fails for a newly-allocated inode, jffs2_do_clear_inode() can get called twice in the error handling path, the first call in jffs2_iget() itself and the second through iget_failed(). This can result to a use-after-free error in the second jffs2_do_clear_inode() call, such as shown by the oops below wherein the second jffs2_do_clear_inode() call was trying to free node fragments that were already freed in the first jffs2_do_clear_inode() call. [ 78.178860] jffs2: error: (1904) jffs2_do_read_inode_internal: CRC failed for read_inode of inode 24 at physical location 0x1fc00c [ 78.178914] Unable to handle kernel paging request at virtual address 6b6b6b6b6b6b6b7b [ 78.185871] pgd = ffffffc03a567000 [ 78.188794] [6b6b6b6b6b6b6b7b] *pgd=0000000000000000, *pud=0000000000000000 [ 78.194968] Internal error: Oops: 96000004 [#1] PREEMPT SMP ... [ 78.513147] PC is at rb_first_postorder+0xc/0x28 [ 78.516503] LR is at jffs2_kill_fragtree+0x28/0x90 [jffs2] [ 78.520672] pc : [<ffffff8008323d28>] lr : [<ffffff8000eb1cc8>] pstate: 60000105 [ 78.526757] sp : ffffff800cea38f0 [ 78.528753] x29: ffffff800cea38f0 x28: ffffffc01f3f8e80 [ 78.532754] x27: 0000000000000000 x26: ffffff800cea3c70 [ 78.536756] x25: 00000000dc67c8ae x24: ffffffc033d6945d [ 78.540759] x23: ffffffc036811740 x22: ffffff800891a5b8 [ 78.544760] x21: 0000000000000000 x20: 0000000000000000 [ 78.548762] x19: ffffffc037d48910 x18: ffffff800891a588 [ 78.552764] x17: 0000000000000800 x16: 0000000000000c00 [ 78.556766] x15: 0000000000000010 x14: 6f2065646f6e695f [ 78.560767] x13: 6461657220726f66 x12: 2064656c69616620 [ 78.564769] x11: 435243203a6c616e x10: 7265746e695f6564 [ 78.568771] x9 : 6f6e695f64616572 x8 : ffffffc037974038 [ 78.572774] x7 : bbbbbbbbbbbbbbbb x6 : 0000000000000008 [ 78.576775] x5 : 002f91d85bd44a2f x4 : 0000000000000000 [ 78.580777] x3 : 0000000000000000 x2 : 000000403755e000 [ 78.584779] x1 : 6b6b6b6b6b6b6b6b x0 : 6b6b6b6b6b6b6b6b ... [ 79.038551] [<ffffff8008323d28>] rb_first_postorder+0xc/0x28 [ 79.042962] [<ffffff8000eb5578>] jffs2_do_clear_inode+0x88/0x100 [jffs2] [ 79.048395] [<ffffff8000eb9ddc>] jffs2_evict_inode+0x3c/0x48 [jffs2] [ 79.053443] [<ffffff8008201ca8>] evict+0xb0/0x168 [ 79.056835] [<ffffff8008202650>] iput+0x1c0/0x200 [ 79.060228] [<ffffff800820408c>] iget_failed+0x30/0x3c [ 79.064097] [<ffffff8000eba0c0>] jffs2_iget+0x2d8/0x360 [jffs2] [ 79.068740] [<ffffff8000eb0a60>] jffs2_lookup+0xe8/0x130 [jffs2] [ 79.073434] [<ffffff80081f1a28>] lookup_slow+0x118/0x190 [ 79.077435] [<ffffff80081f4708>] walk_component+0xfc/0x28c [ 79.081610] [<ffffff80081f4dd0>] path_lookupat+0x84/0x108 [ 79.085699] [<ffffff80081f5578>] filename_lookup+0x88/0x100 [ 79.089960] [<ffffff80081f572c>] user_path_at_empty+0x58/0x6c [ 79.094396] [<ffffff80081ebe14>] vfs_statx+0xa4/0x114 [ 79.098138] [<ffffff80081ec44c>] SyS_newfstatat+0x58/0x98 [ 79.102227] [<ffffff800808354c>] __sys_trace_return+0x0/0x4 [ 79.106489] Code: d65f03c0 f9400001 b40000e1 aa0103e0 (f9400821) The jffs2_do_clear_inode() call in jffs2_iget() is unnecessary since iget_failed() will eventually call jffs2_do_clear_inode() if needed, so just remove it. Fixes: 5451f79f5f81 ("iget: stop JFFS2 from using iget() and read_inode()") Reviewed-by: Richard Weinberger <richard@nod.at> Signed-off-by: Jake Daryll Obina <jake.obina@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* RDMA/uverbs: Use an unambiguous errno for method not supportedJason Gunthorpe2018-04-261-6/+13
| | | | | | | | | | | | | | | | | | | | | [ Upstream commit 3624a8f02568f08aef299d3b117f2226f621177d ] Returning EOPNOTSUPP is problematic because it can also be returned by the method function, and we use it in quite a few places in drivers these days. Instead, dedicate EPROTONOSUPPORT to indicate that the ioctl framework is enabled but the requested object and method are not supported by the kernel. No other case will return this code, and it lets userspace know to fall back to write(). grep says we do not use it today in drivers/infiniband subsystem. Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Reviewed-by: Matan Barak <matanb@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* crypto: artpec6 - remove select on non-existing CRYPTO_SHA384Corentin LABBE2018-04-261-1/+0
| | | | | | | | | | | | | [ Upstream commit 980b4c95e78e4113cb7b9f430f121dab1c814b6c ] Since CRYPTO_SHA384 does not exists, Kconfig should not select it. Anyway, all SHA384 stuff is in CRYPTO_SHA512 which is already selected. Fixes: a21eb94fc4d3i ("crypto: axis - add ARTPEC-6/7 crypto accelerator driver") Signed-off-by: Corentin Labbe <clabbe.montjoie@gmail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* device property: Define type of PROPERTY_ENRTY_*() macrosAndy Shevchenko2018-04-261-5/+5
| | | | | | | | | | | | | | | | | | | | | | [ Upstream commit c505cbd45f6e9c539d57dd171d95ec7e5e9f9cd0 ] Some of the drivers may use the macro at runtime flow, like struct property_entry p[10]; ... p[index++] = PROPERTY_ENTRY_U8("u8 property", u8_data); In that case and absence of the data type compiler fails the build: drivers/char/ipmi/ipmi_dmi.c:79:29: error: Expected ; at end of statement drivers/char/ipmi/ipmi_dmi.c:79:29: error: got { Acked-by: Corey Minyard <cminyard@mvista.com> Cc: Corey Minyard <minyard@acm.org> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tty: serial: exar: Relocate sleep wake-up handlingAaron Sierra2018-04-262-30/+30
| | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit c7e1b4059075c9e8eed101d7cc5da43e95eb5e18 ] Exar sleep wake-up handling has been done on a per-channel basis by virtue of INT0 being accessible from each channel's address space. I believe this was initially done out of necessity, but now that Exar devices have their own driver, we can do things more efficiently by registering a dedicated INT0 handler at the PCI device level. I see this change providing the following benefits: 1. If more than one port is active, eliminates the redundant bus cycles for reading INT0 on every interrupt. 2. This note associated with hooking in the per-channel handler in 8250_port.c is resolved: /* Fixme: probably not the best place for this */ Cc: Matt Schulte <matts@commtech-fastcom.com> Signed-off-by: Aaron Sierra <asierra@xes-inc.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86/hyperv: Stop suppressing X86_FEATURE_PCIDVitaly Kuznetsov2018-04-261-2/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 617ab45c9a8900e64a78b43696c02598b8cad68b ] When hypercall-based TLB flush was enabled for Hyper-V guests PCID feature was deliberately suppressed as a precaution: back then PCID was never exposed to Hyper-V guests and it wasn't clear what will happen if some day it becomes available. The day came and PCID/INVPCID features are already exposed on certain Hyper-V hosts. >From TLFS (as of 5.0b) it is unclear how TLB flush hypercalls combine with PCID. In particular the usage of PCID is per-cpu based: the same mm gets different CR3 values on different CPUs. If the hypercall does exact matching this will fail. However, this is not the case. David Zhang explains: "In practice, the AddressSpace argument is ignored on any VM that supports PCIDs. Architecturally, the AddressSpace argument must match the CR3 with PCID bits stripped out (i.e., the low 12 bits of AddressSpace should be 0 in long mode). The flush hypercalls flush all PCIDs for the specified AddressSpace." With this, PCID can be enabled. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: David Zhang <dazhan@microsoft.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: "Michael Kelley (EOSG)" <Michael.H.Kelley@microsoft.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: devel@linuxdriverproject.org Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Aditya Bhandari <adityabh@microsoft.com> Link: https://lkml.kernel.org/r/20180124103629.29980-1-vkuznets@redhat.com Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* fm10k: fix "failed to kill vid" message for VFNgai-Mint Kwan2018-04-261-2/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit cf315ea596ec26d7aa542a9ce354990875a920c0 ] When a VF is under PF VLAN assignment: ip link set <pf> vf <#> vlan <vid> This will remove all previous entries in the VLAN table including those generated by VLAN interfaces created on the VF. The issue arises when the VF is under PF VLAN assignment and one or more of these VLAN interfaces of the VF are deleted. When deleting these VLAN interfaces, the following message will be generated in "dmesg": failed to kill vid 0081/<vid> for device <vf> This is due to the fact that "ndo_vlan_rx_kill_vid" exits with an error. The handler for this ndo is "fm10k_update_vid". Any calls to this function while under PF VLAN management will exit prematurely and, thus, it will generate the failure message. Additionally, since "fm10k_update_vid" exits prematurely, none of the VLAN update is performed. So, even though the actual VLAN interfaces of the VF will be deleted, the active_vlans bitmask is not cleared. When the VF is no longer under PF VLAN assignment, the driver mistakenly restores the previous entries of the VLAN table based on an unsynchronized list of active VLANs. The solution to this issue involves checking the VLAN update action type before exiting "fm10k_update_vid". If the VLAN update action type is to "add", this action will not be permitted while the VF is under PF VLAN assignment and the VLAN update is abandoned like before. However, if the VLAN update action type is to "kill", then we need to also clear the active_vlans bitmask. However, we don't need to actually queue any messages to the PF, because the MAC and VLAN tables have already been cleared, and the PF would silently ignore these requests anyways. Signed-off-by: Ngai-Mint Kwan <ngai-mint.kwan@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Krishneil Singh <krishneil.k.singh@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* igb: Clear TXSTMP when ptp_tx_work() is timeoutDaniel Hua2018-04-261-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 3a53285228165225a7f76c7d5ff1ddc0213ce0e4 ] Problem description: After ethernet cable connect and disconnect for several iterations on a device with i210, tx timestamp will stop being put into the socket. Steps to reproduce: 1. Setup a device with i210 and wire it to a 802.1AS capable switch ( Extreme Networks Summit x440 is used in our case) 2. Have the gptp daemon running on the device and make sure it is synced with the switch 3. Have the switch disable and enable the port, wait for the device gets resynced with the switch 4. Iterates step 3 until the device is not albe to get resynced 5. Review the log in dmesg and you will see warning message "igb : clearing Tx timestamp hang" Root cause: If ptp_tx_work() gets scheduled just before the port gets disabled, a LINK DOWN event will be processed before ptp_tx_work(), which may cause timeout in ptp_tx_work(). In the timeout logic, the TSYNCTXCTL's TXTT bit (Transmit timestamp valid bit) is not cleared, causing no new timestamp loaded to TXSTMP register. Consequently therefore, no new interrupt is triggerred by TSICR.TXTS bit and no more Tx timestamp send to the socket. Signed-off-by: Daniel Hua <daniel.hua@ni.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* igb: Allow to remove administratively set MAC on VFsCorinna Vinschen2018-04-261-11/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 177132df5e45b134c147f419f567a3b56aafaf2b ] Before libvirt modifies the MAC address and vlan tag for an SRIOV VF for use by a virtual machine (either using vfio device assignment or macvtap passthru mode), it saves the current MAC address and vlan tag so that it can reset them to their original value when the guest is done. Libvirt can't leave the VF MAC set to the value used by the now-defunct guest since it may be started again later using a different VF, but it certainly shouldn't just pick any random value, either. So it saves the state of everything prior to using the VF, and resets it to that. The igb driver initializes the MAC addresses of all VFs to 00:00:00:00:00:00, and reports that when asked (via an RTM_GETLINK netlink message, also visible in the list of VFs in the output of "ip link show"). But when libvirt attempts to restore the MAC address back to 00:00:00:00:00:00 (using an RTM_SETLINK netlink message) the kernel responds with "Invalid argument". Forbidding a reset back to the original value leaves the VF MAC at the value set for the now-defunct virtual machine. Especially on a system with NetworkManager enabled, this has very bad consequences, since NetworkManager forces all interfacess to be IFF_UP all the time - if the same virtual machine is restarted using a different VF (or even on a different host), there will be multiple interfaces watching for traffic with the same MAC address. To allow libvirt to revert to the original state, we need a way to remove the administrative set MAC on a VF, to allow normal host operation again, and to reset/overwrite the VF MAC via VF netdev. This patch implements the outlined scenario by allowing to set the VF MAC to 00:00:00:00:00:00 via RTM_SETLINK on the PF. igb_ndo_set_vf_mac resets the IGB_VF_FLAG_PF_SET_MAC flag to 0, so it's possible to reset the VF MAC back to the original value via the VF netdev. Note: Recent patches to libvirt allow for a workaround if the NIC isn't capable of resetting the administrative MAC back to all 0, but in theory the NIC should allow resetting the MAC in the first place. Signed-off-by: Corinna Vinschen <vinschen@redhat.com> Tested-by: Aaron Brown <arron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ASoC: rockchip: Use dummy_dai for rt5514 dsp dailinkJeffy Chen2018-04-261-3/+16
| | | | | | | | | | | | | | | | | [ Upstream commit fde7f9dbc71365230eeb8c8ea97ce9b552c8e5bd ] The rt5514 dsp captures pcm data through spi directly, so we should not use rockchip-i2s as it's cpu dai like other codecs. Use dummy_dai for rt5514 dsp dailink to make voice wakeup work again. Reported-by: Jimmy Cheng-Yi Chiang <cychiang@google.com> Fixes: (72cfb0f20c75 ASoC: rockchip: Use codec of_node and dai_name for rt5514 dsp) Signed-off-by: Jeffy Chen <jeffy.chen@rock-chips.com> Tested-by: Brian Norris <briannorris@chromium.org> Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* blk-mq-debugfs: don't allow write on attributes with seq_operations setEryu Guan2018-04-261-1/+5
| | | | | | | | | | | | | | | | | | | | | [ Upstream commit 6b136a24b05c81a24e0b648a4bd938bcd0c4f69e ] Attributes that only implement .seq_ops are read-only, any write to them should be rejected. But currently kernel would crash when writing to such debugfs entries, e.g. chmod +w /sys/kernel/debug/block/<dev>/requeue_list echo 0 > /sys/kernel/debug/block/<dev>/requeue_list chmod -w /sys/kernel/debug/block/<dev>/requeue_list Fix it by returning -EPERM in blk_mq_debugfs_write() when writing to such attributes. Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Eryu Guan <eguan@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* KVM: s390: vsie: use READ_ONCE to access some SCB fieldsDavid Hildenbrand2018-04-261-19/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit b3ecd4aa8632a86428605ab73393d14779019d82 ] Another VCPU might try to modify the SCB while we are creating the shadow SCB. In general this is no problem - unless the compiler decides to not load values once, but e.g. twice. For us, this is only relevant when checking/working with such values. E.g. the prefix value, the mso, state of transactional execution and addresses of satellite blocks. E.g. if we blindly forward values (e.g. general purpose registers or execution controls after masking), we don't care. Leaving unpin_blocks() untouched for now, will handle it separately. The worst thing right now that I can see would be a missed prefix un/remap (mso, prefix, tx) or using wrong guest addresses. Nothing critical, but let's try to avoid unpredictable behavior. Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20180116171526.12343-2-david@redhat.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* platform/x86: thinkpad_acpi: suppress warning about palm detectionDavid Herrmann2018-04-261-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 587d8628fb71c3bfae29fb2bbe84c1478c59bac8 ] This patch prevents the thinkpad_acpi driver from warning about 2 event codes returned for keyboard palm-detection. No behavioral changes, other than suppressing the warning in the kernel log. The events are still forwarded via acpi-netlink channels. We could, optionally, decide to forward the event through a input-switch on the tpacpi input device. However, so far no suitable input-code exists, and no similar drivers report such events. Hence, leave it an acpi event for now. Note that the event-codes are named based on empirical studies. On the ThinkPad X1 5th Gen the sensor can be found underneath the arrow key. Cc: Matthew Thode <mthode@mthode.org> Signed-off-by: David Herrmann <dh.herrmann@gmail.com> Acked-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* i40evf: ignore link up if not runningAlan Brady2018-04-261-12/+23
| | | | | | | | | | | | | | | [ Upstream commit e0346f9fcb6c636d2f870e6666de8781413f34ea ] If we receive the link status message from PF with link up before queues are actually enabled, it will trigger a TX hang. This fixes the issue by ignoring a link up message if the VF state is not yet in RUNNING state. Signed-off-by: Alan Brady <alan.brady@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* i40evf: Don't schedule reset_task when device is being removedAvinash Dayanand2018-04-262-1/+9
| | | | | | | | | | | | | | | | | | [ Upstream commit 06aa040f039404a0039a5158cd12f41187487a1f ] When a host disables and enables a PF device, all the associated VFs are removed and added back in. It also generates a PFR which in turn resets all the connected VFs. This behaviour is different from that of Linux guest on Linux host. Hence we end up in a situation where there's a PFR and device removal at the same time. And watchdog doesn't have a clue about this and schedules a reset_task. This patch adds code to send signal to reset_task that the device is currently being removed. Signed-off-by: Avinash Dayanand <avinash.dayanand@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* bpf: test_maps: cleanup sockmaps when test endsPrashant Bhole2018-04-261-4/+12
| | | | | | | | | | | | | | | | | | [ Upstream commit 783687810e986a15ffbf86c516a1a48ff37f38f7 ] Bug: BPF programs and maps related to sockmaps test exist in memory even after test_maps ends. This patch fixes it as a short term workaround (sockmap kernel side needs real fixing) by empyting sockmaps when test ends. Fixes: 6f6d33f3b3d0f ("bpf: selftests add sockmap tests") Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp> [ daniel: Note on workaround. ] Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* block: Set BIO_TRACE_COMPLETION on new bio during splitGoldwyn Rodrigues2018-04-261-1/+1
| | | | | | | | | | | | | [ Upstream commit 20d59023c5ec4426284af492808bcea1f39787ef ] We inadvertently set it again on the source bio, but we need to set it on the new split bio instead. Fixes: fbbaf700e7b1 ("block: trace completion of all bios.") Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* nfp: fix error return code in nfp_pci_probe()Wei Yongjun2018-04-261-0/+1
| | | | | | | | | | | | | | [ Upstream commit e58decc9c51eb61697aba35ba8eda33f4b80552d ] Fix to return error code -EINVAL instead of 0 when num_vfs above limit_vfs, as done elsewhere in this function. Fixes: 0dc786219186 ("nfp: handle SR-IOV already enabled when driver is probing") Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* HID: roccat: prevent an out of bounds read in kovaplus_profile_activated()Dan Carpenter2018-04-261-0/+2
| | | | | | | | | | | | | | | | [ Upstream commit 7ad81482cad67cbe1ec808490d1ddfc420c42008 ] We get the "new_profile_index" value from the mouse device when we're handling raw events. Smatch taints it as untrusted data and complains that we need a bounds check. This seems like a reasonable warning otherwise there is a small read beyond the end of the array. Fixes: 0e70f97f257e ("HID: roccat: Add support for Kova[+] mouse") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Silvan Jegen <s.jegen@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Input: stmfts - set IRQ_NOAUTOEN to the irq flagAndi Shyti2018-04-261-3/+8
| | | | | | | | | | | | | | | | | | | | | | [ Upstream commit cba04cdf437d745fac85220d1d692a9ae23d7004 ] The interrupt is requested before the device is powered on and it's value in some cases cannot be reliable. It happens on some devices that an interrupt is generated as soon as requested before having the chance to disable the irq. Set the irq flag as IRQ_NOAUTOEN before requesting it. This patch mutes the error: stmfts 2-0049: failed to read events: -11 received sometimes during boot time. Signed-off-by: Andi Shyti <andi.shyti@samsung.com> Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* scsi: fas216: fix sense buffer initializationArnd Bergmann2018-04-261-1/+1
| | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 96d5eaa9bb74d299508d811d865c2c41b38b0301 ] While testing with the ARM specific memset() macro removed, I ran into a compiler warning that shows an old bug: drivers/scsi/arm/fas216.c: In function 'fas216_rq_sns_done': drivers/scsi/arm/fas216.c:2014:40: error: argument to 'sizeof' in 'memset' call is the same expression as the destination; did you mean to provide an explicit length? [-Werror=sizeof-pointer-memaccess] It turns out that the definition of the scsi_cmd structure changed back in linux-2.6.25, so now we clear only four bytes (sizeof(pointer)) instead of 96 (SCSI_SENSE_BUFFERSIZE). I did not check whether we actually need to initialize the buffer here, but it's clear that if we do it, we should use the correct size. Fixes: de25deb18016 ("[SCSI] use dynamically allocated sense buffer") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* scsi: devinfo: fix format of the device listXose Vazquez Perez2018-04-261-4/+3
| | | | | | | | | | | | | | | | [ Upstream commit 3f884a0a8bdf28cfd1e9987d54d83350096cdd46 ] Replace "" with NULL for product revision level, and merge TEXEL duplicate entries. Cc: Hannes Reinecke <hare@suse.de> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: James E.J. Bottomley <jejb@linux.vnet.ibm.com> Cc: SCSI ML <linux-scsi@vger.kernel.org> Signed-off-by: Xose Vazquez Perez <xose.vazquez@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* f2fs: avoid hungtask when GC encrypted block if io_bits is setSheng Yong2018-04-261-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit a9d572c7550044d5b217b5287d99a2e6d34b97b0 ] When io_bits is set, GCing encrypted block may hit the following hungtask. Since io_bits requires aligned block address, f2fs_submit_page_write may return -EAGAIN if new_blkaddr does not satisify io_bits alignment. As a result, the encrypted page will never be writtenback. This patch makes move_data_block aware the EAGAIN error and cancel the writeback. [ 246.751371] INFO: task kworker/u4:4:797 blocked for more than 90 seconds. [ 246.752423] Not tainted 4.15.0-rc4+ #11 [ 246.754176] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 246.755336] kworker/u4:4 D25448 797 2 0x80000000 [ 246.755597] Workqueue: writeback wb_workfn (flush-7:0) [ 246.755616] Call Trace: [ 246.755695] ? __schedule+0x322/0xa90 [ 246.755761] ? blk_init_request_from_bio+0x120/0x120 [ 246.755773] ? pci_mmcfg_check_reserved+0xb0/0xb0 [ 246.755801] ? __radix_tree_create+0x19e/0x200 [ 246.755813] ? delete_node+0x136/0x370 [ 246.755838] schedule+0x43/0xc0 [ 246.755904] io_schedule+0x17/0x40 [ 246.755939] wait_on_page_bit_common+0x17b/0x240 [ 246.755950] ? wake_page_function+0xa0/0xa0 [ 246.755961] ? add_to_page_cache_lru+0x160/0x160 [ 246.755972] ? page_cache_tree_insert+0x170/0x170 [ 246.755983] ? __lru_cache_add+0x96/0xb0 [ 246.756086] __filemap_fdatawait_range+0x14f/0x1c0 [ 246.756097] ? wait_on_page_bit_common+0x240/0x240 [ 246.756120] ? __wake_up_locked_key_bookmark+0x20/0x20 [ 246.756167] ? wait_on_all_pages_writeback+0xc9/0x100 [ 246.756179] ? __remove_ino_entry+0x120/0x120 [ 246.756192] ? wait_woken+0x100/0x100 [ 246.756204] filemap_fdatawait_range+0x9/0x20 [ 246.756216] write_checkpoint+0x18a1/0x1f00 [ 246.756254] ? blk_get_request+0x10/0x10 [ 246.756265] ? cpumask_next_and+0x43/0x60 [ 246.756279] ? f2fs_sync_inode_meta+0x160/0x160 [ 246.756289] ? remove_element.isra.4+0xa0/0xa0 [ 246.756300] ? __put_compound_page+0x40/0x40 [ 246.756310] ? f2fs_sync_fs+0xec/0x1c0 [ 246.756320] ? f2fs_sync_fs+0x120/0x1c0 [ 246.756329] f2fs_sync_fs+0x120/0x1c0 [ 246.756357] ? trace_event_raw_event_f2fs__page+0x260/0x260 [ 246.756393] ? ata_build_rw_tf+0x173/0x410 [ 246.756397] f2fs_balance_fs_bg+0x198/0x390 [ 246.756405] ? drop_inmem_page+0x230/0x230 [ 246.756415] ? ahci_qc_prep+0x1bb/0x2e0 [ 246.756418] ? ahci_qc_issue+0x1df/0x290 [ 246.756422] ? __accumulate_pelt_segments+0x42/0xd0 [ 246.756426] ? f2fs_write_node_pages+0xd1/0x380 [ 246.756429] f2fs_write_node_pages+0xd1/0x380 [ 246.756437] ? sync_node_pages+0x8f0/0x8f0 [ 246.756440] ? update_curr+0x53/0x220 [ 246.756444] ? __accumulate_pelt_segments+0xa2/0xd0 [ 246.756448] ? __update_load_avg_se.isra.39+0x349/0x360 [ 246.756452] ? do_writepages+0x2a/0xa0 [ 246.756456] do_writepages+0x2a/0xa0 [ 246.756460] __writeback_single_inode+0x70/0x490 [ 246.756463] ? check_preempt_wakeup+0x199/0x310 [ 246.756467] writeback_sb_inodes+0x2a2/0x660 [ 246.756471] ? is_empty_dir_inode+0x40/0x40 [ 246.756474] ? __writeback_single_inode+0x490/0x490 [ 246.756477] ? string+0xbf/0xf0 [ 246.756480] ? down_read_trylock+0x35/0x60 [ 246.756484] __writeback_inodes_wb+0x9f/0xf0 [ 246.756488] wb_writeback+0x41d/0x4b0 [ 246.756492] ? writeback_inodes_wb.constprop.55+0x150/0x150 [ 246.756498] ? set_worker_desc+0xf7/0x130 [ 246.756502] ? current_is_workqueue_rescuer+0x60/0x60 [ 246.756511] ? _find_next_bit+0x2c/0xa0 [ 246.756514] ? wb_workfn+0x400/0x5d0 [ 246.756518] wb_workfn+0x400/0x5d0 [ 246.756521] ? finish_task_switch+0xdf/0x2a0 [ 246.756525] ? inode_wait_for_writeback+0x30/0x30 [ 246.756529] process_one_work+0x3a7/0x6f0 [ 246.756533] worker_thread+0x82/0x750 [ 246.756537] kthread+0x16f/0x1c0 [ 246.756541] ? trace_event_raw_event_workqueue_work+0x110/0x110 [ 246.756544] ? kthread_create_worker_on_cpu+0xb0/0xb0 [ 246.756548] ret_from_fork+0x1f/0x30 Signed-off-by: Sheng Yong <shengyong1@huawei.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* RDMA/cma: Check existence of netdevice during port validationParav Pandit2018-04-261-3/+5
| | | | | | | | | | | | | | | | | | [ Upstream commit 00db63c128dd3daf38f481371976c24d32678142 ] If valid netdevice is not found for RoCE, GID table should not be searched with NULL netdevice. Doing so causes the search routines to ignore the netdev argument and may match the wrong GID table entry if the netdev is deleted. Fixes: abae1b71dd37 ("IB/cma: cma_validate_port should verify the port and netdevice") Signed-off-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Mark Bloch <markb@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>