summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* sched: Add 'flags' argument to sched_{set,get}attr() syscallsPeter Zijlstra2014-02-212-7/+10
| | | | | | | | | | | | | Because of a recent syscall design debate; its deemed appropriate for each syscall to have a flags argument for future extension; without immediately requiring new syscalls. Cc: juri.lelli@gmail.com Cc: Ingo Molnar <mingo@redhat.com> Suggested-by: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140214161929.GL27965@twins.programming.kicks-ass.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* sched: Fix information leak in sys_sched_getattr()Vegard Nossum2014-02-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | We're copying the on-stack structure to userspace, but forgot to give the right number of bytes to copy. This allows the calling process to obtain up to PAGE_SIZE bytes from the stack (and possibly adjacent kernel memory). This fix copies only as much as we actually have on the stack (attr->size defaults to the size of the struct) and leaves the rest of the userspace-provided buffer untouched. Found using kmemcheck + trinity. Fixes: d50dde5a10f30 ("sched: Add new scheduler syscalls to support an extended scheduling parameters ABI") Cc: Dario Faggioli <raistlin@linux.it> Cc: Juri Lelli <juri.lelli@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1392585857-10725-1-git-send-email-vegard.nossum@oracle.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* sched,numa: add cond_resched to task_numa_workRik van Riel2014-02-211-0/+2
| | | | | | | | | | | | | | | | Normally task_numa_work scans over a fairly small amount of memory, but it is possible to run into a large unpopulated part of virtual memory, with no pages mapped. In that case, task_numa_work can run for a while, and it may make sense to reschedule as required. Cc: akpm@linux-foundation.org Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Rik van Riel <riel@redhat.com> Reported-by: Xing Gang <gang.xing@hp.com> Tested-by: Chegu Vinod <chegu_vinod@hp.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1392761566-24834-2-git-send-email-riel@redhat.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* sched/core: Make dl_b->lock IRQ safeJuri Lelli2014-02-211-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix this lockdep warning: [ 44.804600] ========================================================= [ 44.805746] [ INFO: possible irq lock inversion dependency detected ] [ 44.805746] 3.14.0-rc2-test+ #14 Not tainted [ 44.805746] --------------------------------------------------------- [ 44.805746] bash/3674 just changed the state of lock: [ 44.805746] (&dl_b->lock){+.....}, at: [<ffffffff8106ad15>] sched_rt_handler+0x132/0x248 [ 44.805746] but this lock was taken by another, HARDIRQ-safe lock in the past: [ 44.805746] (&rq->lock){-.-.-.} and interrupts could create inverse lock ordering between them. [ 44.805746] [ 44.805746] other info that might help us debug this: [ 44.805746] Possible interrupt unsafe locking scenario: [ 44.805746] [ 44.805746] CPU0 CPU1 [ 44.805746] ---- ---- [ 44.805746] lock(&dl_b->lock); [ 44.805746] local_irq_disable(); [ 44.805746] lock(&rq->lock); [ 44.805746] lock(&dl_b->lock); [ 44.805746] <Interrupt> [ 44.805746] lock(&rq->lock); by making dl_b->lock acquiring always IRQ safe. Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1392107067-19907-3-git-send-email-juri.lelli@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* sched/core: Fix sched_rt_global_validateJuri Lelli2014-02-211-1/+2
| | | | | | | | | | | | Don't compare sysctl_sched_rt_runtime against sysctl_sched_rt_period if the former is equal to RUNTIME_INF, otherwise disabling -rt bandwidth management (with CONFIG_RT_GROUP_SCHED=n) fails. Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1392107067-19907-2-git-send-email-juri.lelli@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* sched/deadline: Fix overflow to handle period==0 and deadline!=0Steven Rostedt2014-02-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While debugging the crash with the bad nr_running accounting, I hit another bug where, after running my sched deadline test, I was getting failures to take a CPU offline. It was giving me a -EBUSY error. Adding a bunch of trace_printk()s around, I found that the cpu notifier that called sched_cpu_inactive() was returning a failure. The overflow value was coming up negative? Talking this over with Juri, the problem is that the total_bw update was suppose to be made by dl_overflow() which, during my tests, seemed to not be called. Adding more trace_printk()s, it wasn't that it wasn't called, but it exited out right away with the check of new_bw being equal to p->dl.dl_bw. The new_bw calculates the ratio between period and runtime. The bug is that if you set a deadline, you do not need to set a period if you plan on the period being equal to the deadline. That is, if period is zero and deadline is not, then the system call should set the period to be equal to the deadline. This is done elsewhere in the code. The fix is easy, check if period is set, and if it is not, then use the deadline. Cc: Juri Lelli <juri.lelli@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140219135335.7e74abd4@gandalf.local.home Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* sched/deadline: Fix bad accounting of nr_runningJuri Lelli2014-02-211-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rostedt writes: My test suite was locking up hard when enabling mmiotracer. This was due to the mmiotracer placing all but one CPU offline. I found this out when I was able to reproduce the bug with just my stress-cpu-hotplug test. This bug baffled me because it would not always trigger, and would only trigger on the first run after boot up. The stress-cpu-hotplug test would crash hard the first run, or never crash at all. But a new reboot may cause it to crash on the first run again. I spent all week bisecting this, as I couldn't find a consistent reproducer. I finally narrowed it down to the sched deadline patches, and even more peculiar, to the commit that added the sched deadline boot up self test to the latency tracer. Then it dawned on me to what the bug was. All it took was to run a task under sched deadline to screw up the CPU hot plugging. This explained why it would lock up only on the first run of the stress-cpu-hotplug test. The bug happened when the boot up self test of the schedule latency tracer would test a deadline task. The deadline task would corrupt something that would cause CPU hotplug to fail. If it didn't corrupt it, the stress test would always work (there's no other sched deadline tasks that would run to cause problems). If it did corrupt on boot up, the first test would lockup hard. I proved this theory by running my deadline test program on another box, and then run the stress-cpu-hotplug test, and it would now consistently lock up. I could run stress-cpu-hotplug over and over with no problem, but once I ran the deadline test, the next run of the stress-cpu-hotplug would lock hard. After adding lots of tracing to the code, I found the cause. The function tracer showed that migrate_tasks() was stuck in an infinite loop, where rq->nr_running never equaled 1 to break out of it. When I added a trace_printk() to see what that number was, it was 335 and never decrementing! Looking at the deadline code I found: static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags) { dequeue_dl_entity(&p->dl); dequeue_pushable_dl_task(rq, p); } static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags) { update_curr_dl(rq); __dequeue_task_dl(rq, p, flags); dec_nr_running(rq); } And this: if (dl_runtime_exceeded(rq, dl_se)) { __dequeue_task_dl(rq, curr, 0); if (likely(start_dl_timer(dl_se, curr->dl.dl_boosted))) dl_se->dl_throttled = 1; else enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH); if (!is_leftmost(curr, &rq->dl)) resched_task(curr); } Notice how we call __dequeue_task_dl() and in the else case we call enqueue_task_dl()? Also notice that dequeue_task_dl() has underscores where enqueue_task_dl() does not. The enqueue_task_dl() calls inc_nr_running(rq), but __dequeue_task_dl() does not. This is where we get nr_running out of sync. [snip] Another point where nr_running can get out of sync is when the dl_timer fires: dl_se->dl_throttled = 0; if (p->on_rq) { enqueue_task_dl(rq, p, ENQUEUE_REPLENISH); if (task_has_dl_policy(rq->curr)) check_preempt_curr_dl(rq, p, 0); else resched_task(rq->curr); This patch does two things: - correctly accounts for throttled tasks (that are now considered !running); - fixes the bug, updating nr_running from {inc,dec}_dl_tasks(), since we risk to update it twice in some situations (e.g., a task is dequeued while it has exceeded its budget). Cc: mingo@redhat.com Cc: torvalds@linux-foundation.org Cc: akpm@linux-foundation.org Reported-by: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Tested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1392884379-13744-1-git-send-email-juri.lelli@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* Merge tag 'pci-v3.14-fixes-1' of ↵Linus Torvalds2014-02-206-22/+152
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull PCI updates from Bjorn Helgaas: "The most interesting thing here is the change to enable INTx (by clearing PCI_COMMAND_INTX_DISABLE) if the BIOS left INTx disabled. Apparently the Baytrail BIOS does this, which means EHCI doesn't work. Also, fix an AHCI MSI regression and other issues with the recent MSI changes. This also adds pci_enable_msi_exact() and pci_enable_msix_exact(), which aren't regression fixes, but will keep us from touching drivers twice (once to stop using the deprecated pci_enable_msi(), etc., and again to use the *_exact() variants). There's also a minor MVEBU fix. Summary: MSI: - Fix AHCI single-MSI fallback (Alexander Gordeev) - Fix populate_msi_sysfs() error paths (Greg Kroah-Hartman) - Fix htmldocs problem (Masanari Iida) - Add pci_enable_msi_exact() and pci_enable_msix_exact() (Alexander Gordeev) - Update documentation (Alexander Gordeev) Miscellaneous: - mvebu: expose device ID & revision via lspci (Andrew Lunn) - Enable INTx if the BIOS left them disabled (Bjorn Helgaas)" * tag 'pci-v3.14-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: ahci: Fix broken fallback to single MSI mode PCI: Enable INTx if BIOS left them disabled PCI/MSI: Add pci_enable_msi_exact() and pci_enable_msix_exact() PCI/MSI: Fix cut-and-paste errors in documentation PCI/MSI: Add pci_enable_msi() documentation back PCI/MSI: Fix pci_msix_vec_count() htmldocs failure PCI/MSI: Fix leak of msi_attrs PCI/MSI: Check kmalloc() return value, fix leak of name PCI: mvebu: Use Device ID and revision from underlying endpoint
| * ahci: Fix broken fallback to single MSI modeAlexander Gordeev2014-02-141-1/+3
| | | | | | | | | | | | | | | | | | | | | | Commit 7b92b4f61ec4 ("PCI/MSI: Remove pci_enable_msi_block_auto()") introduced a regression: if multiple MSI initialization fails, the code falls back to INTx rather than to single MSI. Fixes: 7b92b4f61ec4 ("PCI/MSI: Remove pci_enable_msi_block_auto()") Signed-off-by: Alexander Gordeev <agordeev@redhat.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Tejun Heo <tj@kernel.org>
| * PCI: Enable INTx if BIOS left them disabledBjorn Helgaas2014-02-141-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some firmware leaves the Interrupt Disable bit set even if the device uses INTx interrupts. Clear Interrupt Disable so we get those interrupts. Based on the report mentioned below, if the user selects the "EHCI only" option in the Intel Baytrail BIOS, the EHCI device is handed off to the OS with the PCI_COMMAND_INTX_DISABLE bit set. Link: http://lkml.kernel.org/r/20140114181721.GC12126@xanatos Link: https://bugzilla.kernel.org/show_bug.cgi?id=70601 Reported-by: Chris Cheng <chris.cheng@atrustcorp.com> Reported-and-tested-by: Jamie Chen <jamie.chen@intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> CC: stable@vger.kernel.org CC: Sarah Sharp <sarah.a.sharp@linux.intel.com>
| * PCI/MSI: Add pci_enable_msi_exact() and pci_enable_msix_exact()Alexander Gordeev2014-02-132-2/+104
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The new functions are special cases for pci_enable_msi_range() and pci_enable_msix_range() when a particular number of MSI or MSI-X is needed. By contrast with pci_enable_msi_range() and pci_enable_msix_range() functions, pci_enable_msi_exact() and pci_enable_msix_exact() return zero in case of success, which indicates MSI or MSI-X interrupts have been successfully allocated. Signed-off-by: Alexander Gordeev <agordeev@redhat.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| * PCI/MSI: Fix cut-and-paste errors in documentationAlexander Gordeev2014-02-131-6/+6
| | | | | | | | | | | | | | | | | | Function pci_enable_msi_range() is used in examples where pci_enable_msix_range() should have been used instead. Reported-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Alexander Gordeev <agordeev@redhat.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| * PCI/MSI: Add pci_enable_msi() documentation backAlexander Gordeev2014-02-131-3/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We deprecated pci_enable_msi() in 302a2523c277 ("PCI/MSI: Add pci_enable_msi_range() and pci_enable_msix_range()"). But we changed our minds after noticing that: - pci_enable_msi() doesn't have confusing return values like pci_enable_msi_block() and pci_enable_msix() did, and - pci_enable_msi() has a hundred or so callers that we don't want to change. This adds back the pci_enable_msi() documentation. [bhelgaas: changelog] Signed-off-by: Alexander Gordeev <agordeev@redhat.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| * PCI/MSI: Fix pci_msix_vec_count() htmldocs failureMasanari Iida2014-02-131-1/+0
| | | | | | | | | | | | | | | | | | | | An empty line in msi.c caused "make htmldocs" failure: Warning(/home/iida/Repo/linux-next//drivers/pci/msi.c:962): bad line: Fixes: ff1aa430a2fa ("PCI/MSI: Add pci_msix_vec_count()") Signed-off-by: Masanari Iida <standby24x7@gmail.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| * PCI/MSI: Fix leak of msi_attrsGreg Kroah-Hartman2014-02-131-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Coverity reported that I forgot to clean up some allocated memory on the error path in populate_msi_sysfs(), so this patch fixes that. Thanks to Dave Jones for pointing out where the error was, I obviously can't read code this morning... Found by Coverity (CID 1163317). Fixes: 1c51b50c2995 ("PCI/MSI: Export MSI mode using attributes, not kobjects") Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Cc: Dave Jones <davej@redhat.com>
| * PCI/MSI: Check kmalloc() return value, fix leak of nameGreg Kroah-Hartman2014-02-131-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | Coverity reported that I forgot to check the return value of kmalloc() when creating the MSI attribute name, so fix that up and properly free it if there is an error when allocating the msi_dev_attr variable. Found by Coverity (CID 1163315 and 1163316). Fixes: 1c51b50c2995 ("PCI/MSI: Export MSI mode using attributes, not kobjects") Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| * PCI: mvebu: Use Device ID and revision from underlying endpointAndrew Lunn2014-02-121-9/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Marvell SoCs place the SoC number into the PCIe endpoint device ID. The SoC stepping is placed into the PCIe revision. The old plat-orion PCIe driver allowed this information to be seen in user space with a simple lspci command. The new driver places a virtual PCI-PCI bridge on top of these endpoints. It has its own hard coded PCI device ID. Thus it is no longer possible to see what the SoC is using lspci. When initializing the PCI-PCI bridge, set its device ID and revision from the underlying endpoint, thus restoring this functionality. Debian would like to use this in order to aid installing the correct DTB file. Fixes: 45361a4fe4464 ("pci: PCIe driver for Marvell Armada 370/XP systems") Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Acked-by: Jason Cooper <jason@lakedaemon.net> Cc: stable@vger.kernel.org # v3.11+
* | Merge branch 'for-3.14-fixes' of ↵Linus Torvalds2014-02-206-8/+35
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata Pull libata fixes from Tejun Heo: "Various device specific fixes. Nothing too interesting" * 'for-3.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata: ahci: disable NCQ on Samsung pci-e SSDs on macbooks ata: sata_mv: Cleanup only the initialized ports sata_sil: apply MOD15WRITE quirk to TOSHIBA MK2561GSYN ata: enable quirk from jmicron JMB350 for JMB394 ATA: SATA_MV: Add missing Kconfig select statememnt ata: pata_imx: Check the return value from clk_prepare_enable()
| * | ahci: disable NCQ on Samsung pci-e SSDs on macbooksLevente Kurusa2014-02-181-0/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Samsung's pci-e SSDs with device ID 0x1600 which are found on some macbooks time out on NCQ commands. Blacklist NCQ on the device so that the affected machines can at least boot. Original-patch-by: Levente Kurusa <levex@linux.com> Signed-off-by: Tejun Heo <tj@kernel.org> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=60731 Cc: stable@vger.kernel.org
| * | ata: sata_mv: Cleanup only the initialized portsEzequiel Garcia2014-02-161-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When an error occurs in the port initialization loop, currently the driver tries to cleanup all the ports. This results in a NULL pointer dereference if the ports were only partially initialized. Fix this by updating only the number of initialized ports (either with failure or successfully), before jumping to the error path and looping over that number in the cleanup loop. Cc: Andrew Lunn <andrew@lunn.ch> Reported-by: Mikael Pettersson <mikpelinux@gmail.com> Signed-off-by: Ezequiel Garcia <ezequiel.garcia@free-electrons.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org
| * | sata_sil: apply MOD15WRITE quirk to TOSHIBA MK2561GSYNTejun Heo2014-02-031-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's a bit odd to see a newer device showing mod15write; however, the reported behavior is highly consistent and other factors which could contribute seem to have been verified well enough. Also, both sata_sil itself and the drive are fairly outdated at this point making the risk of this change fairly low. It is possible, probably likely, that other drive models in the same family have the same problem; however, for now, let's just add the specific model which was tested. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: matson <lists-matsonpa@luxsci.me> References: http://lkml.kernel.org/g/201401211912.s0LJCk7F015058@rs103.luxsci.com Cc: stable@vger.kernel.org
| * | ata: enable quirk from jmicron JMB350 for JMB394Denis V. Lunev2014-01-311-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Without the patch the kernel generates the following error. ata11.15: SATA link up 1.5 Gbps (SStatus 113 SControl 310) ata11.15: Port Multiplier vendor mismatch '0x197b' != '0x123' ata11.15: PMP revalidation failed (errno=-19) ata11.15: failed to recover PMP after 5 tries, giving up This patch helps to bypass this error and the device becomes functional. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: <linux-ide@vger.kernel.org> Cc: stable@vger.kernel.org
| * | ATA: SATA_MV: Add missing Kconfig select statememntAndrew Lunn2014-01-291-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | SATA_MV depends on GENERIC_PHY. So if SATA_MV is built in, GENERIC_PHY cannot be modular. Fixes build error found by kbuild test robot. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Tejun Heo <tj@kernel.org>
| * | ata: pata_imx: Check the return value from clk_prepare_enable()Fabio Estevam2014-01-291-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | clk_prepare_enable() may fail, so let's check its return value and propagate it in the case of error. Signed-off-by: Fabio Estevam <fabio.estevam@freescale.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* | | Merge branch 'for-3.14-fixes' of ↵Linus Torvalds2014-02-207-28/+40
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: "Quite a few fixes this time. Three locking fixes, all marked for -stable. A couple error path fixes and some misc fixes. Hugh found a bug in memcg offlining sequence and we thought we could fix that from cgroup core side but that turned out to be insufficient and got reverted. A different fix has been applied to -mm" * 'for-3.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: update cgroup_enable_task_cg_lists() to grab siglock Revert "cgroup: use an ordered workqueue for cgroup destruction" cgroup: protect modifications to cgroup_idr with cgroup_mutex cgroup: fix locking in cgroup_cfts_commit() cgroup: fix error return from cgroup_create() cgroup: fix error return value in cgroup_mount() cgroup: use an ordered workqueue for cgroup destruction nfs: include xattr.h from fs/nfs/nfs3proc.c cpuset: update MAINTAINERS entry arm, pm, vmpressure: add missing slab.h includes
| * | | cgroup: update cgroup_enable_task_cg_lists() to grab siglockTejun Heo2014-02-181-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, there's nothing preventing cgroup_enable_task_cg_lists() from missing set PF_EXITING and race against cgroup_exit(). Depending on the timing, cgroup_exit() may finish with the task still linked on css_set leading to list corruption. Fix it by grabbing siglock in cgroup_enable_task_cg_lists() so that PF_EXITING is guaranteed to be visible. This whole on-demand cg_list optimization is extremely fragile and has ample possibility to lead to bugs which can cause things like once-a-year oops during boot. I'm wondering whether the better approach would be just adding "cgroup_disable=all" handling which disables the whole cgroup rather than tempting fate with this on-demand craziness. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: stable@vger.kernel.org
| * | | Revert "cgroup: use an ordered workqueue for cgroup destruction"Tejun Heo2014-02-121-6/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit ab3f5faa6255a0eb4f832675507d9e295ca7e9ba. Explanation from Hugh: It's because more thorough testing, by others here, found that it wasn't always solving the problem: so I asked Tejun privately to hold off from sending it in, until we'd worked out why not. Most of our testing being on a v3,11-based kernel, it was perfectly possible that the problem was merely our own e.g. missing Tejun's 8a2b75384444 ("workqueue: fix ordered workqueues in NUMA setups"). But that turned out not to be enough to fix it either. Then Filipe pointed out how percpu_ref_kill_and_confirm() uses call_rcu_sched() before we ever get to put the offline on to the workqueue: by the time we get to the workqueue, the ordering has already been lost. So, thanks for the Acks, but I'm afraid that this ordered workqueue solution is just not good enough: we should simply forget that patch and provide a different answer." Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Hugh Dickins <hughd@google.com>
| * | | cgroup: protect modifications to cgroup_idr with cgroup_mutexLi Zefan2014-02-112-16/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Setup cgroupfs like this: # mount -t cgroup -o cpuacct xxx /cgroup # mkdir /cgroup/sub1 # mkdir /cgroup/sub2 Then run these two commands: # for ((; ;)) { mkdir /cgroup/sub1/tmp && rmdir /mnt/sub1/tmp; } & # for ((; ;)) { mkdir /cgroup/sub2/tmp && rmdir /mnt/sub2/tmp; } & After seconds you may see this warning: ------------[ cut here ]------------ WARNING: CPU: 1 PID: 25243 at lib/idr.c:527 sub_remove+0x87/0x1b0() idr_remove called for id=6 which is not allocated. ... Call Trace: [<ffffffff8156063c>] dump_stack+0x7a/0x96 [<ffffffff810591ac>] warn_slowpath_common+0x8c/0xc0 [<ffffffff81059296>] warn_slowpath_fmt+0x46/0x50 [<ffffffff81300aa7>] sub_remove+0x87/0x1b0 [<ffffffff810f3f02>] ? css_killed_work_fn+0x32/0x1b0 [<ffffffff81300bf5>] idr_remove+0x25/0xd0 [<ffffffff810f2bab>] cgroup_destroy_css_killed+0x5b/0xc0 [<ffffffff810f4000>] css_killed_work_fn+0x130/0x1b0 [<ffffffff8107cdbc>] process_one_work+0x26c/0x550 [<ffffffff8107eefe>] worker_thread+0x12e/0x3b0 [<ffffffff81085f96>] kthread+0xe6/0xf0 [<ffffffff81570bac>] ret_from_fork+0x7c/0xb0 ---[ end trace 2d1577ec10cf80d0 ]--- It's because allocating/removing cgroup ID is not properly synchronized. The bug was introduced when we converted cgroup_ida to cgroup_idr. While synchronization is already done inside ida_simple_{get,remove}(), users are responsible for concurrent calls to idr_{alloc,remove}(). tj: Refreshed on top of b58c89986a77 ("cgroup: fix error return from cgroup_create()"). Fixes: 4e96ee8e981b ("cgroup: convert cgroup_ida to cgroup_idr") Cc: <stable@vger.kernel.org> #3.12+ Reported-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | cgroup: fix locking in cgroup_cfts_commit()Tejun Heo2014-02-081-9/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cgroup_cfts_commit() walks the cgroup hierarchy that the target subsystem is attached to and tries to apply the file changes. Due to the convolution with inode locking, it can't keep cgroup_mutex locked while iterating. It currently holds only RCU read lock around the actual iteration and then pins the found cgroup using dget(). Unfortunately, this is incorrect. Although the iteration does check cgroup_is_dead() before invoking dget(), there's nothing which prevents the dentry from going away inbetween. Note that this is different from the usual css iterations where css_tryget() is used to pin the css - css_tryget() tests whether the css can be pinned and fails if not. The problem can be solved by simply holding cgroup_mutex instead of RCU read lock around the iteration, which actually reduces LOC. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: stable@vger.kernel.org
| * | | cgroup: fix error return from cgroup_create()Tejun Heo2014-02-081-3/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cgroup_create() was returning 0 after allocation failures. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: stable@vger.kernel.org
| * | | cgroup: fix error return value in cgroup_mount()Tejun Heo2014-02-081-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When cgroup_mount() fails to allocate an id for the root, it didn't set ret before jumping to unlock_drop ending up returning 0 after a failure. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: stable@vger.kernel.org
| * | | cgroup: use an ordered workqueue for cgroup destructionHugh Dickins2014-02-071-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Sometimes the cleanup after memcg hierarchy testing gets stuck in mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0. There may turn out to be several causes, but a major cause is this: the workitem to offline parent can get run before workitem to offline child; parent's mem_cgroup_reparent_charges() circles around waiting for the child's pages to be reparented to its lrus, but it's holding cgroup_mutex which prevents the child from reaching its mem_cgroup_reparent_charges(). Just use an ordered workqueue for cgroup_destroy_wq. tj: Committing as the temporary fix until the reverse dependency can be removed from memcg. Comment updated accordingly. Fixes: e5fca243abae ("cgroup: use a dedicated workqueue for cgroup destruction") Suggested-by: Filipe Brandenburger <filbranden@google.com> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: stable@vger.kernel.org # 3.10+ Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | nfs: include xattr.h from fs/nfs/nfs3proc.cTejun Heo2014-02-031-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | fs/nfs/nfs3proc.c is making use of xattr but was getting linux/xattr.h indirectly through linux/cgroup.h, which will soon drop the inclusion of xattr.h. Explicitly include linux/xattr.h from nfs3proc.c so that compilation doesn't fail when linux/cgroup.h drops linux/xattr.h. As the following cgroup changes will depend on these changes, it probably would be easier to route this through cgroup branch. Would that be okay? Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Trond Myklebust <trond.myklebust@primarydata.com> Cc: linux-nfs@vger.kernel.org
| * | | cpuset: update MAINTAINERS entryLi Zefan2014-02-031-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Add mailing list and tree tag to the entry. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | arm, pm, vmpressure: add missing slab.h includesTejun Heo2014-02-033-0/+3
| | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | arch/arm/mach-tegra/pm.c, kernel/power/console.c and mm/vmpressure.c were somehow getting slab.h indirectly through cgroup.h which in turn was getting it indirectly through xattr.h. A scheduled cgroup change drops xattr.h inclusion from cgroup.h and breaks compilation of these three files. Add explicit slab.h includes to the three files. A pending cgroup patch depends on this change and it'd be great if this can be routed through cgroup/for-3.14-fixes branch. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Stephen Warren <swarren@wwwdotorg.org> Cc: Thierry Reding <thierry.reding@gmail.com> Cc: linux-tegra@vger.kernel.org Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: linux-pm@vger.kernel.org Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: cgroups@vger.kernel.org
* | | Merge branch 'for-3.14-fixes' of ↵Linus Torvalds2014-02-202-4/+8
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fixes from Tejun Heo: "Two workqueue fixes. One for an unlikely but possible critical bug during kworker shutdown and the other to make lockdep names a bit more descriptive" * 'for-3.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: ensure @task is valid across kthread_stop() workqueue: add args to workqueue lockdep name
| * | | workqueue: ensure @task is valid across kthread_stop()Lai Jiangshan2014-02-181-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a kworker should die, the kworkre is notified through WORKER_DIE flag instead of kthread_should_stop(). This, IIRC, is primarily to keep the test synchronized inside worker_pool lock. WORKER_DIE is first set while holding pool->lock, the lock is dropped and kthread_stop() is called. Unfortunately, this means that there's a slight chance that the target kworker may see WORKER_DIE before kthread_stop() finishes and exits and frees the target task before or during kthread_stop(). Fix it by pinning the target task before setting WORKER_DIE and putting it after kthread_stop() is done. tj: Improved patch description and comment. Moved pinning above WORKER_DIE for better signify what it's protecting. CC: stable@vger.kernel.org Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | workqueue: add args to workqueue lockdep nameLi Zhong2014-02-141-4/+1
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Tommi noticed a 'funny' lock class name: "%s#5" from a lock acquired in process_one_work(). Maybe #fmt plus #args could be used as the lock_name to give some more information for some fmt string like the above. __builtin_constant_p() check is removed (as there seems no good way to check all the variables in args list). However, by removing the check, it only adds two additional "s for those constants. Some lockdep name examples printed out after the change: lockdep name wq->name "events_long" events_long "%s"("khelper") khelper "xfs-data/%s"mp->m_fsname xfs-data/dm-3 Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* | | Merge branch 'fixes-for-v3.14' of ↵Linus Torvalds2014-02-202-2/+4
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.linaro.org/people/mszyprowski/linux-dma-mapping Pull DMA-mapping fixes from Marek Szyprowski: "This contains fixes for incorrect atomic test in dma-mapping subsystem for ARM and x86 architecture" * 'fixes-for-v3.14' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping: x86: dma-mapping: fix GFP_ATOMIC macro usage ARM: dma-mapping: fix GFP_ATOMIC macro usage
| * | | x86: dma-mapping: fix GFP_ATOMIC macro usageMarek Szyprowski2014-02-111-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | GFP_ATOMIC is not a single gfp flag, but a macro which expands to the other flags, where meaningful is the LACK of __GFP_WAIT flag. To check if caller wants to perform an atomic allocation, the code must test for a lack of the __GFP_WAIT flag. This patch fixes the issue introduced in v3.5-rc1. CC: stable@vger.kernel.org Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
| * | | ARM: dma-mapping: fix GFP_ATOMIC macro usageMarek Szyprowski2014-02-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | GFP_ATOMIC is not a single gfp flag, but a macro which expands to the other flags and LACK of __GFP_WAIT flag. To check if caller wanted to perform an atomic allocation, the code must test __GFP_WAIT flag presence. This patch fixes the issue introduced in v3.6-rc5 Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> CC: stable@vger.kernel.org
* | | | user_namespace.c: Remove duplicated word in commentBrian Campbell2014-02-201-1/+1
| | | | | | | | | | | | | | | | | | | | Signed-off-by: Brian Campbell <brian.campbell@editshare.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | Merge tag 'nfs-for-3.14-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds2014-02-198-21/+60
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NFS client bugfixes from Trond Myklebust: "Highlights include stable fixes for the following bugs: - General performance regression due to NFS_INO_INVALID_LABEL being set when the server doesn't support labeled NFS - Hang in the RPC code due to a socket out-of-buffer race - Infinite loop when trying to establish the NFSv4 lease - Use-after-free bug in the RPCSEC gss code. - nfs4_select_rw_stateid is returning with a non-zero error value on success Other bug fixes: - Potential memory scribble in the RPC bi-directional RPC code - Pipe version reference leak - Use the correct net namespace in the new NFSv4 migration code" * tag 'nfs-for-3.14-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFS fix error return in nfs4_select_rw_stateid NFSv4: Use the correct net namespace in nfs4_update_server SUNRPC: Fix a pipe_version reference leak SUNRPC: Ensure that gss_auth isn't freed before its upcall messages SUNRPC: Fix potential memory scribble in xprt_free_bc_request() SUNRPC: Fix races in xs_nospace() SUNRPC: Don't create a gss auth cache unless rpc.gssd is running NFS: Do not set NFS_INO_INVALID_LABEL unless server supports labeled NFS
| * | | | NFS fix error return in nfs4_select_rw_stateidAndy Adamson2014-02-191-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Do not return an error when nfs4_copy_delegation_stateid succeeds. Signed-off-by: Andy Adamson <andros@netapp.com> Link: http://lkml.kernel.org/r/1392737765-41942-1-git-send-email-andros@netapp.com Fixes: ef1820f9be27b (NFSv4: Don't try to recover NFSv4 locks when...) Cc: NeilBrown <neilb@suse.de> Cc: stable@vger.kernel.org # 3.12+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | | | NFSv4: Use the correct net namespace in nfs4_update_serverTrond Myklebust2014-02-173-10/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We need to use the same net namespace that was used to resolve the hostname and sockaddr arguments. Fixes: 32e62b7c3ef09 (NFS: Add nfs4_update_server) Cc: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | | | SUNRPC: Fix a pipe_version reference leakTrond Myklebust2014-02-161-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In gss_alloc_msg(), if the call to gss_encode_v1_msg() fails, we want to release the reference to the pipe_version that was obtained earlier in the function. Fixes: 9d3a2260f0f4b (SUNRPC: Fix buffer overflow checking in...) Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | | | SUNRPC: Ensure that gss_auth isn't freed before its upcall messagesTrond Myklebust2014-02-161-2/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix a race in which the RPC client is shutting down while the gss daemon is processing a downcall. If the RPC client manages to shut down before the gss daemon is done, then the struct gss_auth used in gss_release_msg() may have already been freed. Link: http://lkml.kernel.org/r/1392494917.71728.YahooMailNeo@web140002.mail.bf1.yahoo.com Reported-by: John <da_audiophile@yahoo.com> Reported-by: Borislav Petkov <bp@alien8.de> Cc: stable@vger.kernel.org # 3.12+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | | | SUNRPC: Fix potential memory scribble in xprt_free_bc_request()Trond Myklebust2014-02-111-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The call to xprt_free_allocation() will call list_del() on req->rq_bc_pa_list, which is not attached to a list. This patch moves the list_del() out of xprt_free_allocation() and into those callers that need it. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | | | SUNRPC: Fix races in xs_nospace()Trond Myklebust2014-02-111-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a send failure occurs due to the socket being out of buffer space, we call xs_nospace() in order to have the RPC task wait until the socket has drained enough to make it worth while trying again. The current patch fixes a race in which the socket is drained before we get round to setting up the machinery in xs_nospace(), and which is reported to cause hangs. Link: http://lkml.kernel.org/r/20140210170315.33dfc621@notabene.brown Fixes: a9a6b52ee1ba (SUNRPC: Don't start the retransmission timer...) Reported-by: Neil Brown <neilb@suse.com> Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | | | SUNRPC: Don't create a gss auth cache unless rpc.gssd is runningTrond Myklebust2014-02-101-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | An infinite loop is caused when nfs4_establish_lease() fails with -EACCES. This causes nfs4_handle_reclaim_lease_error() to sleep a bit and resets the NFS4CLNT_LEASE_EXPIRED bit. This in turn causes nfs4_state_manager() to try and reestablished the lease, again, again, again... The problem is a valid RPCSEC_GSS client is being created when rpc.gssd is not running. Link: http://lkml.kernel.org/r/1392066375-16502-1-git-send-email-steved@redhat.com Fixes: 0ea9de0ea6a4 (sunrpc: turn warn_gssd() log message into a dprintk()) Reported-by: Steve Dickson <steved@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>