summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* mm: filemap: update find_get_pages_tag() to deal with shadow entriesJohannes Weiner2014-05-063-37/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Dave Jones reports the following crash when find_get_pages_tag() runs into an exceptional entry: kernel BUG at mm/filemap.c:1347! RIP: find_get_pages_tag+0x1cb/0x220 Call Trace: find_get_pages_tag+0x36/0x220 pagevec_lookup_tag+0x21/0x30 filemap_fdatawait_range+0xbe/0x1e0 filemap_fdatawait+0x27/0x30 sync_inodes_sb+0x204/0x2a0 sync_inodes_one_sb+0x19/0x20 iterate_supers+0xb2/0x110 sys_sync+0x44/0xb0 ia32_do_call+0x13/0x13 1343 /* 1344 * This function is never used on a shmem/tmpfs 1345 * mapping, so a swap entry won't be found here. 1346 */ 1347 BUG(); After commit 0cd6144aadd2 ("mm + fs: prepare for non-page entries in page cache radix trees") this comment and BUG() are out of date because exceptional entries can now appear in all mappings - as shadows of recently evicted pages. However, as Hugh Dickins notes, "it is truly surprising for a PAGECACHE_TAG_WRITEBACK (and probably any other PAGECACHE_TAG_*) to appear on an exceptional entry. I expect it comes down to an occasional race in RCU lookup of the radix_tree: lacking absolute synchronization, we might sometimes catch an exceptional entry, with the tag which really belongs with the unexceptional entry which was there an instant before." And indeed, not only is the tree walk lockless, the tags are also read in chunks, one radix tree node at a time. There is plenty of time for page reclaim to swoop in and replace a page that was already looked up as tagged with a shadow entry. Remove the BUG() and update the comment. While reviewing all other lookup sites for whether they properly deal with shadow entries of evicted pages, update all the comments and fix memcg file charge moving to not miss shmem/tmpfs swapcache pages. Fixes: 0cd6144aadd2 ("mm + fs: prepare for non-page entries in page cache radix trees") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Dave Jones <davej@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/compaction: make isolate_freepages start at pageblock boundaryVlastimil Babka2014-05-061-10/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The compaction freepage scanner implementation in isolate_freepages() starts by taking the current cc->free_pfn value as the first pfn. In a for loop, it scans from this first pfn to the end of the pageblock, and then subtracts pageblock_nr_pages from the first pfn to obtain the first pfn for the next for loop iteration. This means that when cc->free_pfn starts at offset X rather than being aligned on pageblock boundary, the scanner will start at offset X in all scanned pageblock, ignoring potentially many free pages. Currently this can happen when a) zone's end pfn is not pageblock aligned, or b) through zone->compact_cached_free_pfn with CONFIG_HOLES_IN_ZONE enabled and a hole spanning the beginning of a pageblock This patch fixes the problem by aligning the initial pfn in isolate_freepages() to pageblock boundary. This also permits replacing the end-of-pageblock alignment within the for loop with a simple pageblock_nr_pages increment. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Heesub Shin <heesub.shin@samsung.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Christoph Lameter <cl@linux.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Dongjun Shin <d.j.shin@samsung.com> Cc: Sunghwan Yun <sunghwan.yun@samsung.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* MAINTAINERS: zswap/zbud: change maintainer email addressSeth Jennings2014-05-061-2/+2
| | | | | | | | sjenning@linux.vnet.ibm.com is no longer a viable entity. Signed-off-by: Seth Jennings <sjennings@variantweb.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/page-writeback.c: fix divide by zero in pos_ratio_polynomRik van Riel2014-05-061-3/+3
| | | | | | | | | | | | | | | | | | | | | | It is possible for "limit - setpoint + 1" to equal zero, after getting truncated to a 32 bit variable, and resulting in a divide by zero error. Using the fully 64 bit divide functions avoids this problem. It also will cause pos_ratio_polynom() to return the correct value when (setpoint - limit) exceeds 2^32. Also uninline pos_ratio_polynom, at Andrew's request. Signed-off-by: Rik van Riel <riel@redhat.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hugetlb: ensure hugepage access is denied if hugepages are not supportedNishanth Aravamudan2014-05-063-5/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, I am seeing the following when I `mount -t hugetlbfs /none /dev/hugetlbfs`, and then simply do a `ls /dev/hugetlbfs`. I think it's related to the fact that hugetlbfs is properly not correctly setting itself up in this state?: Unable to handle kernel paging request for data at address 0x00000031 Faulting instruction address: 0xc000000000245710 Oops: Kernel access of bad area, sig: 11 [#1] SMP NR_CPUS=2048 NUMA pSeries .... In KVM guests on Power, in a guest not backed by hugepages, we see the following: AnonHugePages: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 64 kB HPAGE_SHIFT == 0 in this configuration, which indicates that hugepages are not supported at boot-time, but this is only checked in hugetlb_init(). Extract the check to a helper function, and use it in a few relevant places. This does make hugetlbfs not supported (not registered at all) in this environment. I believe this is fine, as there are no valid hugepages and that won't change at runtime. [akpm@linux-foundation.org: use pr_info(), per Mel] [akpm@linux-foundation.org: fix build when HPAGE_SHIFT is undefined] Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* slub: fix memcg_propagate_slab_attrsVladimir Davydov2014-05-061-4/+7
| | | | | | | | | | | | | | | After creating a cache for a memcg we should initialize its sysfs attrs with the values from its parent. That's what memcg_propagate_slab_attrs is for. Currently it's broken - we clearly muddled root-vs-memcg caches there. Let's fix it up. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* drivers/rtc/rtc-pcf8523.c: fix month definitionChris Cui2014-05-061-2/+2
| | | | | | | | | | PCF8523 uses 1-12 to represent month according to datasheet. link: www.nxp.com/documents/data_sheet/PCF8523.pdf. Signed-off-by: Chris Cui <chris.wei.cui@gmail.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'for-linus' of ↵Linus Torvalds2014-05-066-88/+192
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse fixes from Miklos Szeredi: "This adds ctime update in the new cached writeback mode and also fixes/simplifies the mtime update handling. Support for rename flags (aka renameat2) is also added to the userspace API" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: add renameat2 support fuse: clear MS_I_VERSION fuse: clear FUSE_I_CTIME_DIRTY flag on setattr fuse: trust kernel i_ctime only fuse: remove .update_time fuse: allow ctime flushing to userspace fuse: fuse: add time_gran to INIT_OUT fuse: add .write_inode fuse: clean up fsync fuse: fuse: fallocate: use file_update_time() fuse: update mtime on open(O_TRUNC) in atomic_o_trunc mode fuse: update mtime on truncate(2) fuse: do not use uninitialized i_mode fuse: fix mtime update error in fsync fuse: check fallocate mode fuse: add __exit to fuse_ctl_cleanup
| * fuse: add renameat2 supportMiklos Szeredi2014-04-283-8/+58
| | | | | | | | | | | | Support RENAME_EXCHANGE and RENAME_NOREPLACE flags on the userspace ABI. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
| * fuse: clear MS_I_VERSIONMiklos Szeredi2014-04-281-1/+1
| | | | | | | | | | | | Fuse doesn't support i_version (yet). Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
| * fuse: clear FUSE_I_CTIME_DIRTY flag on setattrMaxim Patlasov2014-04-281-9/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The patch addresses two use-cases when the flag may be safely cleared: 1. fuse_do_setattr() is called with ATTR_CTIME flag set in attr->ia_valid. In this case attr->ia_ctime bears actual value. In-kernel fuse must send it to the userspace server and then assign the value to inode->i_ctime. 2. fuse_do_setattr() is called with ATTR_SIZE flag set in attr->ia_valid, whereas ATTR_CTIME is not set (truncate(2)). In this case in-kernel fuse must sent "now" to the userspace server and then assign the value to inode->i_ctime. In both cases we could clear I_DIRTY_SYNC, but that needs more thought. Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
| * fuse: trust kernel i_ctime onlyMaxim Patlasov2014-04-282-4/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | Let the kernel maintain i_ctime locally: update i_ctime explicitly on truncate, fallocate, open(O_TRUNC), setxattr, removexattr, link, rename, unlink. The inode flag I_DIRTY_SYNC serves as indication that local i_ctime should be flushed to the server eventually. The patch sets the flag and updates i_ctime in course of operations listed above. Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
| * fuse: remove .update_timeMiklos Szeredi2014-04-281-12/+0
| | | | | | | | | | | | This implements updating ctime as well as mtime on file_update_time(). Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
| * fuse: allow ctime flushing to userspaceMaxim Patlasov2014-04-284-6/+14
| | | | | | | | | | | | | | | | | | The patch extends fuse_setattr_in, and extends the flush procedure (fuse_flush_times()) called on ->write_inode() to send the ctime as well as mtime. Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
| * fuse: fuse: add time_gran to INIT_OUTMiklos Szeredi2014-04-282-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow userspace fs to specify time granularity. This is needed because with writeback_cache mode the kernel is responsible for generating mtime and ctime, but if the underlying filesystem doesn't support nanosecond granularity then the cache will contain a different value from the one stored on the filesystem resulting in a change of times after a cache flush. Make the default granularity 1s. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
| * fuse: add .write_inodeMiklos Szeredi2014-04-284-33/+45
| | | | | | | | | | | | | | | | ...and flush mtime from this. This allows us to use the kernel infrastructure for writing out dirty metadata (mtime at this point, but ctime in the next patches and also maybe atime). Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
| * fuse: clean up fsyncMiklos Szeredi2014-04-281-8/+3
| | | | | | | | | | | | | | | | | | Don't need to start I/O twice (once without i_mutex and one within). Also make sure that even if the userspace filesystem doesn't support FSYNC we do all the steps other than sending the message. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
| * fuse: fuse: fallocate: use file_update_time()Miklos Szeredi2014-04-281-6/+2
| | | | | | | | | | | | in preparation for getting rid of FUSE_I_MTIME_DIRTY. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
| * fuse: update mtime on open(O_TRUNC) in atomic_o_trunc modeMaxim Patlasov2014-04-281-4/+14
| | | | | | | | | | | | | | | | | | | | | | In case of fc->atomic_o_trunc is set, fuse does nothing in fuse_do_setattr() while handling open(O_TRUNC). Hence, i_mtime must be updated explicitly in fuse_finish_open(). The patch also adds extra locking encompassing open(O_TRUNC) operation to avoid races between the truncation and updating i_mtime. Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
| * fuse: update mtime on truncate(2)Maxim Patlasov2014-04-281-0/+2
| | | | | | | | | | | | | | | | | | | | Handling truncate(2), VFS doesn't set ATTR_MTIME bit in iattr structure; only ATTR_SIZE bit is set. In-kernel fuse must handle the case by setting mtime fields of struct fuse_setattr_in to "now" and set FATTR_MTIME bit even though ATTR_MTIME was not set. Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
| * fuse: do not use uninitialized i_modeMaxim Patlasov2014-04-281-1/+1
| | | | | | | | | | | | | | | | When inode is in I_NEW state, inode->i_mode is not initialized yet. Do not use it before fuse_init_inode() is called. Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
| * fuse: fix mtime update error in fsyncMiklos Szeredi2014-04-281-1/+1
| | | | | | | | | | | | Bad case of shadowing. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
| * fuse: check fallocate modeMiklos Szeredi2014-04-281-0/+3
| | | | | | | | | | | | | | Don't allow new fallocate modes until we figure out what (if anything) that takes. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
| * fuse: add __exit to fuse_ctl_cleanupFabian Frederick2014-04-282-2/+2
| | | | | | | | | | | | | | fuse_ctl_cleanup is only called by __exit fuse_exit Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparcLinus Torvalds2014-05-0612-107/+148
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull sparc fixes from David Miller: "I've been auditing the THP support on sparc64 and found several bugs, hopefully most of which are fixed completely here. Also an RT kernel locking fix from Kirill Tkhai" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc: sparc64: Give more detailed information in {pgd,pmd}_ERROR() and kill pte_ERROR(). sparc64: Add basic validations to {pud,pmd}_bad(). sparc64: Use 'ILOG2_4MB' instead of constant '22'. sparc64: Fix range check in kern_addr_valid(). sparc64: Fix top-level fault handling bugs. sparc64: Handle 32-bit tasks properly in compute_effective_address(). sparc64: Don't use _PAGE_PRESENT in pte_modify() mask. sparc64: Fix hex values in comment above pte_modify(). sparc64: Fix bugs in get_user_pages_fast() wrt. THP. sparc64: Fix huge PMD invalidation. sparc64: Fix executable bit testing in set_pmd_at() paths. sparc64: Normalize NMI watchdog logging and behavior. sparc64: Make itc_sync_lock raw sparc64: Fix argument sign extension for compat_sys_futex().
| * | sparc64: Give more detailed information in {pgd,pmd}_ERROR() and kill ↵David S. Miller2014-05-031-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | pte_ERROR(). pte_ERROR() is not used anywhere, delete it. For pgd_ERROR() and pmd_ERROR(), output something similar to x86, giving the address of the pgd/pmd as well as it's value. Also provide the caller, since these macros are invoked from pgd_clear_bad() and pmd_clear_bad() which provides little context as to what high level operation was occuring when the BAD state was detected. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | sparc64: Add basic validations to {pud,pmd}_bad().David S. Miller2014-05-031-15/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of returning false we should at least check the most basic things, otherwise page table corruptions will be very difficult to debug. PMD and PTE tables are of size PAGE_SIZE, so none of the sub-PAGE_SIZE bits should be set. We also complement this with a check that the physical address the pud/pmd points to is valid memory. PowerPC was used as a guide while implementating this. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | sparc64: Use 'ILOG2_4MB' instead of constant '22'.David S. Miller2014-05-034-10/+10
| | | | | | | | | | | | Signed-off-by: David S. Miller <davem@davemloft.net>
| * | sparc64: Fix range check in kern_addr_valid().David S. Miller2014-05-031-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In commit b2d438348024b75a1ee8b66b85d77f569a5dfed8 ("sparc64: Make PAGE_OFFSET variable."), the MAX_PHYS_ADDRESS_BITS value was increased (to 47). This constant reference to '41UL' was missed. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | sparc64: Fix top-level fault handling bugs.David S. Miller2014-05-031-30/+52
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make get_user_insn() able to cope with huge PMDs. Next, make do_fault_siginfo() more robust when get_user_insn() can't actually fetch the instruction. In particular, use the MMU announced fault address when that happens, instead of calling compute_effective_address() and computing garbage. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | sparc64: Handle 32-bit tasks properly in compute_effective_address().David S. Miller2014-05-031-3/+9
| | | | | | | | | | | | | | | | | | | | | If we have a 32-bit task we must chop off the top 32-bits of the 64-bit value just as the cpu would. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | sparc64: Don't use _PAGE_PRESENT in pte_modify() mask.David S. Miller2014-05-031-4/+4
| | | | | | | | | | | | Signed-off-by: David S. Miller <davem@davemloft.net>
| * | sparc64: Fix hex values in comment above pte_modify().David S. Miller2014-05-031-2/+2
| | | | | | | | | | | | | | | | | | | | | When _PAGE_SPECIAL and _PAGE_PMD_HUGE were added to the mask, the comment was not updated. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | sparc64: Fix bugs in get_user_pages_fast() wrt. THP.David S. Miller2014-05-032-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The large PMD path needs to check _PAGE_VALID not _PAGE_PRESENT, to decide if it needs to bail and return 0. pmd_large() should therefore just check _PAGE_PMD_HUGE. Calls to gup_huge_pmd() are guarded with a check of pmd_large(), so we just need to add a valid bit check. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | sparc64: Fix huge PMD invalidation.David S. Miller2014-05-033-15/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On sparc64 "present" and "valid" are seperate PTE bits, this allows us to naturally distinguish between the user explicitly asking for PROT_NONE with mprotect() and other situations. However we weren't handling this properly in the huge PMD paths. First of all, the page table walker in the TSB miss path only checks for _PAGE_PMD_HUGE. So the generic pmdp_invalidate() would clear _PAGE_PRESENT but the TLB miss paths would still load it into the TLB as a valid huge PMD. Fix this by clearing the valid bit in pmdp_invalidate(), and also checking the valid bit in USER_PGTABLE_CHECK_PMD_HUGE using "brgez" since _PAGE_VALID is bit 63 in both the sun4u and sun4v pte layouts. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | sparc64: Fix executable bit testing in set_pmd_at() paths.David S. Miller2014-05-031-6/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This code was mistakenly using the exec bit from the PMD in all cases, even when the PMD isn't a huge PMD. If it's not a huge PMD, test the exec bit in the individual ptes down in tlb_batch_pmd_scan(). Signed-off-by: David S. Miller <davem@davemloft.net>
| * | sparc64: Normalize NMI watchdog logging and behavior.David S. Miller2014-05-031-16/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Bring this code in line with the perf based generic NMI watchdog in kernel/watchdog.c (which we should convert over to at some point). In particular, don't do anything super fancy when the watchdog triggers, and specifically don't do a do_exit() which only makes things worse. Either panic(), or WARN(). The latter of which will do all of the actions such as give us a stack backtrace. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | sparc64: Make itc_sync_lock rawKirill Tkhai2014-05-021-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | One more place where we must not be able to be preempted or to be interrupted in RT. Always actually disable interrupts during synchronization cycle. Signed-off-by: Kirill Tkhai <tkhai@yandex.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | sparc64: Fix argument sign extension for compat_sys_futex().David S. Miller2014-05-021-1/+1
| | | | | | | | | | | | | | | | | | Only the second argument, 'op', is signed. Signed-off-by: David S. Miller <davem@davemloft.net>
* | | slab: Fix off by one in object max number tests.David Miller2014-05-051-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If freelist_idx_t is a byte, SLAB_OBJ_MAX_NUM should be 255 not 256, and likewise if freelist_idx_t is a short, then it should be 65535 not 65536. This was leading to all kinds of random crashes on sparc64 where PAGE_SIZE is 8192. One problem shown was that if spinlock debugging was enabled, we'd get deadlocks in copy_pte_range() or do_wp_page() with the same cpu already holding a lock it shouldn't hold, or the lock belonging to a completely unrelated process. Fixes: a41adfaa23df ("slab: introduce byte sized index for the freelist of a slab") Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | slab: fix the type of the index on freelist index accessorJoonsoo Kim2014-05-051-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit a41adfaa23df ("slab: introduce byte sized index for the freelist of a slab") changes the size of freelist index and also changes prototype of accessor function to freelist index. And there was a mistake. The mistake is that although it changes the size of freelist index correctly, it changes the size of the index of freelist index incorrectly. With patch, freelist index can be 1 byte or 2 bytes, that means that num of object on on a slab can be more than 255. So we need more than 1 byte for the index to find the index of free object on freelist. But, above patch makes this index type 1 byte, so slab which have more than 255 objects cannot work properly and in consequence of it, the system cannot boot. This issue was reported by Steven King on m68knommu which would use 2 bytes freelist index: https://lkml.org/lkml/2014/4/16/433 To fix is easy. To change the type of the index of freelist index on accessor functions is enough to fix this bug. Although 2 bytes is enough, I use 4 bytes since it have no bad effect and make things more easier. This fix was suggested and tested by Steven in his original report. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-and-acked-by: Steven King <sfking@fdwdc.com> Acked-by: Christoph Lameter <cl@linux.com> Tested-by: James Hogan <james.hogan@imgtec.com> Tested-by: David Miller <davem@davemloft.net> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2014-05-05133-1043/+1496
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David Miller: 1) e1000e computes header length incorrectly wrt vlans, fix from Vlad Yasevich. 2) ns_capable() check in sock_diag netlink code, from Andrew Lutomirski. 3) Fix invalid queue pairs handling in virtio_net, from Amos Kong. 4) Checksum offloading busted in sxgbe driver due to incorrect descriptor layout, fix from Byungho An. 5) Fix build failure with SMC_DEBUG set to 2 or larger, from Zi Shen Lim. 6) Fix uninitialized A and X registers in BPF interpreter, from Alexei Starovoitov. 7) Fix arch dependencies of candence driver. 8) Fix netlink capabilities checking tree-wide, from Eric W Biederman. 9) Don't dump IFLA_VF_PORTS if netlink request didn't ask for it in IFLA_EXT_MASK, from David Gibson. 10) IPV6 FIB dump restart doesn't handle table changes that happen meanwhile, causing the code to loop forever or emit dups, fix from Kumar Sandararajan. 11) Memory leak on VF removal in bnx2x, from Yuval Mintz. 12) Bug fixes for new Altera TSE driver from Vince Bridgers. 13) Fix route lookup key in SCTP, from Xugeng Zhang. 14) Use BH blocking spinlocks in SLIP, as per a similar fix to CAN/SLCAN driver. From Oliver Hartkopp. 15) TCP doesn't bump retransmit counters in some code paths, fix from Eric Dumazet. 16) Clamp delayed_ack in tcp_cubic to prevent theoretical divides by zero. Fix from Liu Yu. 17) Fix locking imbalance in error paths of HHF packet scheduler, from John Fastabend. 18) Properly reference the transport module when vsock_core_init() runs, from Andy King. 19) Fix buffer overflow in cdc_ncm driver, from Bjørn Mork. 20) IP_ECN_decapsulate() doesn't see a correct SKB network header in ip_tunnel_rcv(), fix from Ying Cai. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (132 commits) net: macb: Fix race between HW and driver net: macb: Remove 'unlikely' optimization net: macb: Re-enable RX interrupt only when RX is done net: macb: Clear interrupt flags net: macb: Pass same size to DMA_UNMAP as used for DMA_MAP ip_tunnel: Set network header properly for IP_ECN_decapsulate() e1000e: Restrict MDIO Slow Mode workaround to relevant parts e1000e: Fix issue with link flap on 82579 e1000e: Expand workaround for 10Mb HD throughput bug e1000e: Workaround for dropped packets in Gig/100 speeds on 82579 net/mlx4_core: Don't issue PCIe speed/width checks for VFs net/mlx4_core: Load the Eth driver first net/mlx4_core: Fix slave id computation for single port VF net/mlx4_core: Adjust port number in qp_attach wrapper when detaching net: cdc_ncm: fix buffer overflow Altera TSE: ALTERA_TSE should depend on HAS_DMA vsock: Make transport the proto owner net: sched: lock imbalance in hhf qdisc net: mvmdio: Check for a valid interrupt instead of an error net phy: Check for aneg completion before setting state to PHY_RUNNING ...
| * | | net: macb: Fix race between HW and driverSoren Brinkmann2014-05-051-10/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Under "heavy" RX load, the driver cannot handle the descriptors fast enough. In detail, when a descriptor is consumed, its used flag is cleared and once the RX budget is consumed all descriptors with a cleared used flag are prepared to receive more data. Under load though, the HW may constantly receive more data and use those descriptors with a cleared used flag before they are actually prepared for next usage. The head and tail pointers into the RX-ring should always be valid and we can omit clearing and checking of the used flag. Signed-off-by: Soren Brinkmann <soren.brinkmann@xilinx.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | net: macb: Remove 'unlikely' optimizationSoren Brinkmann2014-05-051-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Coverage data suggests that the unlikely case of receiving data while the receive handler is running may not be that unlikely. Coverage data after running iperf for a while: 91320: 891: work_done = bp->macbgem_ops.mog_rx(bp, budget); 91320: 892: if (work_done < budget) { 2362: 893: napi_complete(napi); -: 894: -: 895: /* Packets received while interrupts were disabled */ 4724: 896: status = macb_readl(bp, RSR); 2362: 897: if (unlikely(status)) { 762: 898: if (bp->caps & MACB_CAPS_ISR_CLEAR_ON_WRITE) 762: 899: macb_writel(bp, ISR, MACB_BIT(RCOMP)); -: 900: napi_reschedule(napi); -: 901: } else { 1600: 902: macb_writel(bp, IER, MACB_RX_INT_FLAGS); -: 903: } -: 904: } Signed-off-by: Soren Brinkmann <soren.brinkmann@xilinx.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | net: macb: Re-enable RX interrupt only when RX is doneSoren Brinkmann2014-05-051-7/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When data is received during the driver processing received data the NAPI is re-scheduled. In that case the RX interrupt should not be re-enabled. Signed-off-by: Soren Brinkmann <soren.brinkmann@xilinx.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | net: macb: Clear interrupt flagsSoren Brinkmann2014-05-051-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A few interrupt flags were not cleared in the ISR, resulting in a sytem trapped in the ISR in cases one of those interrupts occurred. Clear all flags to avoid such situations. Signed-off-by: Soren Brinkmann <soren.brinkmann@xilinx.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | net: macb: Pass same size to DMA_UNMAP as used for DMA_MAPSoren Brinkmann2014-05-051-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Just as commit "net: macb: DMA-unmap full rx-buffer" (48330e08fa168395b9fd9f369f06cca1df204361), pass the size that was used for mapping the memory also to the unmap routine to avoid warnings from the DMA_API. Signed-off-by: Soren Brinkmann <soren.brinkmann@xilinx.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | ip_tunnel: Set network header properly for IP_ECN_decapsulate()Ying Cai2014-05-051-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In ip_tunnel_rcv(), set skb->network_header to inner IP header before IP_ECN_decapsulate(). Without the fix, IP_ECN_decapsulate() takes outer IP header as inner IP header, possibly causing error messages or packet drops. Note that this skb_reset_network_header() call was in this spot when the original feature for checking consistency of ECN bits through tunnels was added in eccc1bb8d4b4 ("tunnel: drop packet if ECN present with not-ECT"). It was only removed from this spot in 3d7b46cd20e3 ("ip_tunnel: push generic protocol handling to ip_tunnel module."). Fixes: 3d7b46cd20e3 ("ip_tunnel: push generic protocol handling to ip_tunnel module.") Reported-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Ying Cai <ycai@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | Merge branch 'master' of ↵David S. Miller2014-05-053-29/+46
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net Jeff Kirsher says: ==================== Intel Wired LAN Driver Updates This series contains updates to e1000e only. David provides four fixes for e1000e, first is a workaround for a hardware erratum on 82579 devices which experienced packet loss in gigabit and 100 speeds when interconnect between the PHY and MAC is exiting K1 power saving state. Second expands the scope of a workaround to include i217 and i218 parts as well to address over aggressive transmit behavior when connecting at 10Mbs half-duplex. Next is to resolve a reported link flap issue on 82579 parts which was root caused as an interoperability problem between 82579 and at least some Broadcom PHYs in the Energy Efficient Ethernet wake mechanism. Lastly, restricts the workaround of putting the PHY into MDIO slow mode to access the PHY id to relevant parts since this issue has been fixed on the newer hardware. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | e1000e: Restrict MDIO Slow Mode workaround to relevant partsDavid Ertman2014-05-051-6/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It has been determined that the workaround of putting the PHY into MDIO slow mode to access the PHY id is not necessary with Lynx Point and newer parts. The issue that necessitated the workaround has been fixed on the newer hardware. We will maintains, as a last ditch attempt, the conversion to MDIO Slow Mode in the failure branch when attempting to access the PHY id so as to cover all contingencies. Signed-off-by: Dave Ertman <davidx.m.ertman@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>