summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* cpuset: schedule hotplug propagation from cpuset_attach() if the cpuset is emptyTejun Heo2013-01-071-0/+14
| | | | | | | | | | | | | | | | | | | | | | cpuset is scheduled to be decoupled from cgroup_lock which will make hotplug handling race with task migration. cpus or mems will be allowed to go offline between ->can_attach() and ->attach(). If hotplug takes down all cpus or mems of a cpuset while attach is in progress, ->attach() may end up putting tasks into an empty cpuset. This patchset makes ->attach() schedule hotplug propagation if the cpuset is empty after attaching is complete. This will move the tasks to the nearest ancestor which can execute and the end result would be as if hotplug handling happened after the tasks finished attaching. cpuset_write_resmask() now also flushes cpuset_propagate_hotplug_wq to wait for propagations scheduled directly by cpuset_attach(). This currently doesn't make any functional difference as everything is protected by cgroup_mutex but enables decoupling the locking. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
* cpuset: pin down cpus and mems while a task is being attachedTejun Heo2013-01-071-2/+26
| | | | | | | | | | | | | | | | | | | | | cpuset is scheduled to be decoupled from cgroup_lock which will make configuration updates race with task migration. Any config update will be allowed to happen between ->can_attach() and ->attach(). If such config update removes either all cpus or mems, by the time ->attach() is called, the condition verified by ->can_attach(), that the cpuset is capable of hosting the tasks, is no longer true. This patch adds cpuset->attach_in_progress which is incremented from ->can_attach() and decremented when the attach operation finishes either successfully or not. validate_change() treats cpusets w/ non-zero ->attach_in_progress like cpusets w/ tasks and refuses to remove all cpus or mems from it. This currently doesn't make any functional difference as everything is protected by cgroup_mutex but enables decoupling the locking. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
* cpuset: make CPU / memory hotplug propagation asynchronousTejun Heo2013-01-071-6/+48
| | | | | | | | | | | | | | | | | | | | | | | | | | | | cpuset_hotplug_workfn() has been invoking cpuset_propagate_hotplug() directly to propagate hotplug updates to !root cpusets; however, this has the following problems. * cpuset locking is scheduled to be decoupled from cgroup_mutex, cgroup_mutex will be unexported, and cgroup_attach_task() will do cgroup locking internally, so propagation can't synchronously move tasks to a parent cgroup while walking the hierarchy. * We can't use cgroup generic tree iterator because propagation to each cpuset may sleep. With propagation done asynchronously, we can lose the rather ugly cpuset specific iteration. Convert cpuset_propagate_hotplug() to cpuset_propagate_hotplug_workfn() and execute it from newly added cpuset->hotplug_work. The work items are run on an ordered workqueue, so the propagation order is preserved. cpuset_hotplug_workfn() schedules all propagations while holding cgroup_mutex and waits for completion without cgroup_mutex. Each in-flight propagation holds a reference to the cpuset->css. This patch doesn't cause any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
* cpuset: drop async_rebuild_sched_domains()Tejun Heo2013-01-071-60/+16
| | | | | | | | | | | | | | | | In general, we want to make cgroup_mutex one of the outermost locks and be able to use get_online_cpus() and friends from cgroup methods. With cpuset hotplug made async, get_online_cpus() can now be nested inside cgroup_mutex. Currently, cpuset avoids nesting get_online_cpus() inside cgroup_mutex by bouncing sched_domain rebuilding to a work item. As such nesting is allowed now, remove the workqueue bouncing code and always rebuild sched_domains synchronously. This also nests sched_domains_mutex inside cgroup_mutex, which is intended and should be okay. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
* cpuset: don't nest cgroup_mutex inside get_online_cpus()Tejun Heo2013-01-071-4/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CPU / memory hotplug path currently grabs cgroup_mutex from hotplug event notifications. We want to separate cpuset locking from cgroup core and make cgroup_mutex outer to hotplug synchronization so that, among other things, mechanisms which depend on get_online_cpus() can be used from cgroup callbacks. In general, we want to keep cgroup_mutex the outermost lock to minimize locking interactions among different controllers. Convert cpuset_handle_hotplug() to cpuset_hotplug_workfn() and schedule it from the hotplug notifications. As the function can already handle multiple mixed events without any input, converting it to a work function is mostly trivial; however, one complication is that cpuset_update_active_cpus() needs to update sched domains synchronously to reflect an offlined cpu to avoid confusing the scheduler. This is worked around by falling back to the the default single sched domain synchronously before scheduling the actual hotplug work. This makes sched domain rebuilt twice per CPU hotplug event but the operation isn't that heavy and a lot of the second operation would be noop for systems w/ single sched domain, which is the common case. This decouples cpuset hotplug handling from the notification callbacks and there can be an arbitrary delay between the actual event and updates to cpusets. Scheduler and mm can handle it fine but moving tasks out of an empty cpuset may race against writes to the cpuset restoring execution resources which can lead to confusing behavior. Flush hotplug work item from cpuset_write_resmask() to avoid such confusions. v2: Synchronous sched domain rebuilding using the fallback sched domain added. This fixes various issues caused by confused scheduler putting tasks on a dead CPU, including the one reported by Li Zefan. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
* cpuset: reorganize CPU / memory hotplug handlingTejun Heo2013-01-071-117/+104
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reorganize hotplug path to prepare for async hotplug handling. * Both CPU and memory hotplug handlings are collected into a single function - cpuset_handle_hotplug(). It doesn't take any argument but compares the current setttings of top_cpuset against what's actually available to determine what happened. This function directly updates top_cpuset. If there are CPUs or memory nodes which are taken down, cpuset_propagate_hotplug() in invoked on all !root cpusets. * cpuset_propagate_hotplug() is responsible for updating the specified cpuset so that it doesn't include any resource which isn't available to top_cpuset. If no CPU or memory is left after update, all tasks are moved to the nearest ancestor with both resources. * update_tasks_cpumask() and update_tasks_nodemask() are now always called after cpus or mems masks are updated even if the cpuset doesn't have any task. This is for brevity and not expected to have any measureable effect. * cpu_active_mask and N_HIGH_MEMORY are read exactly once per cpuset_handle_hotplug() invocation, all cpusets share the same view of what resources are available, and cpuset_handle_hotplug() can handle multiple resources going up and down. These properties will allow async operation. The reorganization, while drastic, is equivalent and shouldn't cause any behavior difference. This will enable making hotplug handling async and remove get_online_cpus() -> cgroup_mutex nesting. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
* cpuset: cleanup cpuset[_can]_attach()Tejun Heo2013-01-071-17/+18
| | | | | | | | | | | | | | | | | | | | | | cpuset_can_attach() prepare global variables cpus_attach and cpuset_attach_nodemask_{to|from} which are used by cpuset_attach(). There is no reason to prepare in cpuset_can_attach(). The same information can be accessed from cpuset_attach(). Move the prepartion logic from cpuset_can_attach() to cpuset_attach() and make the global variables static ones inside cpuset_attach(). With this change, there's no reason to keep cpuset_attach_nodemask_{from|to} global. Move them inside cpuset_attach(). Unfortunately, we need to keep cpus_attach global as it can't be allocated from cpuset_attach(). v2: cpus_attach not converted to cpumask_t as per Li Zefan and Rusty Russell. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Rusty Russell <rusty@rustcorp.com.au>
* cpuset: introduce cpuset_for_each_child()Tejun Heo2013-01-071-31/+54
| | | | | | | | | | | | | | | Instead of iterating cgroup->children directly, introduce and use cpuset_for_each_child() which wraps cgroup_for_each_child() and performs online check. As it uses the generic iterator, it requires RCU read locking too. As cpuset is currently protected by cgroup_mutex, non-online cpusets aren't visible to all the iterations and this patch currently doesn't make any functional difference. This will be used to de-couple cpuset locking from cgroup core. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
* cpuset: introduce CS_ONLINETejun Heo2013-01-071-1/+10
| | | | | | | | | Add CS_ONLINE which is set from css_online() and cleared from css_offline(). This will enable using generic cgroup iterator while allowing decoupling cpuset from cgroup internal locking. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
* cpuset: introduce ->css_on/offline()Tejun Heo2013-01-071-22/+44
| | | | | | | | | | | | | | | | | | | Add cpuset_css_on/offline() and rearrange css init/exit such that, * Allocation and clearing to the default values happen in css_alloc(). Allocation now uses kzalloc(). * Config inheritance and registration happen in css_online(). * css_offline() undoes what css_online() did. * css_free() frees. This doesn't introduce any visible behavior changes. This will help cleaning up locking. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
* cpuset: remove fast exit path from remove_tasks_in_empty_cpuset()Tejun Heo2013-01-071-8/+0
| | | | | | | | | | The function isn't that hot, the overhead of missing the fast exit is low, the test itself depends heavily on cgroup internals, and it's gonna be a hindrance when trying to decouple cpuset locking from cgroup core. Remove the fast exit path. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
* cpuset: remove unused cpuset_unlock()Tejun Heo2013-01-071-11/+0
| | | | | Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
* cgroup: implement cgroup_rightmost_descendant()Tejun Heo2013-01-072-0/+27
| | | | | | | | | | | Implement cgroup_rightmost_descendant() which returns the right most descendant of the specified cgroup. This can be used to skip the cgroup's subtree while iterating with cgroup_for_each_descendant_pre(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Li Zefan <lizefan@huawei.com>
* cgroup: remove unused dummy cgroup_fork_callbacks()Tejun Heo2013-01-071-1/+0
| | | | | | | | | 5edee61ede ("cgroup: cgroup_subsys->fork() should be called after the task is added to css_set") removed cgroup_fork_callbacks() but forgot to remove its dummy version for !CONFIG_CGROUPS. Remove it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
* Linux 3.8-rc2v3.8-rc2Linus Torvalds2013-01-021-1/+1
|
* Merge branch 'fixes-for-3.8' of ↵Linus Torvalds2013-01-021-2/+3
|\ | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/cooloney/linux-leds Pull LED fix from Bryan Wu. * 'fixes-for-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/cooloney/linux-leds: leds: leds-gpio: set devm_gpio_request_one() flags param correctly
| * leds: leds-gpio: set devm_gpio_request_one() flags param correctlyJavier Martinez Canillas2013-01-021-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit a99d76f leds: leds-gpio: use gpio_request_one changed the leds-gpio driver to use gpio_request_one() instead of gpio_request() + gpio_direction_output() Unfortunately, it also made a semantic change that breaks the leds-gpio driver. The gpio_request_one() flags parameter was set to: GPIOF_DIR_OUT | (led_dat->active_low ^ state) Since GPIOF_DIR_OUT is 0, the final flags value will just be the XOR'ed value of led_dat->active_low and state. This value were used to distinguish between HIGH/LOW output initial level and call gpio_direction_output() accordingly. With this new semantic gpio_request_one() will take the flags value of 1 as a configuration of input direction (GPIOF_DIR_IN) and will call gpio_direction_input() instead of gpio_direction_output(). int gpio_request_one(unsigned gpio, unsigned long flags, const char *label) { .. if (flags & GPIOF_DIR_IN) err = gpio_direction_input(gpio); else err = gpio_direction_output(gpio, (flags & GPIOF_INIT_HIGH) ? 1 : 0); .. } The right semantic is to evaluate led_dat->active_low ^ state and set the output initial level explicitly. Signed-off-by: Javier Martinez Canillas <javier.martinez@collabora.co.uk> Reported-by: Arnaud Patard <arnaud.patard@rtp-net.org> Tested-by: Ezequiel Garcia <ezequiel.garcia@free-electrons.com> Signed-off-by: Bryan Wu <cooloney@gmail.com>
* | Merge git://www.linux-watchdog.org/linux-watchdogLinus Torvalds2013-01-025-14/+29
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull watchdog fixes from Wim Van Sebroeck: "This fixes some small errors in the new da9055 driver, eliminates a compiler warning and adds DT support for the twl4030_wdt driver (so that we can have multiple watchdogs with DT on the omap platforms)." * git://www.linux-watchdog.org/linux-watchdog: watchdog: twl4030_wdt: add DT support watchdog: omap_wdt: eliminate unused variable and a compiler warning watchdog: da9055: Don't update wdt_dev->timeout in da9055_wdt_set_timeout error path watchdog: da9055: Fix invalid free of devm_ allocated data
| * | watchdog: twl4030_wdt: add DT supportAaro Koskinen2013-01-023-2/+23
| | | | | | | | | | | | | | | | | | | | | | | | Add DT support for twl4030_wdt. This is needed to get twl4030_wdt to probe when booting with DT. Signed-off-by: Aaro Koskinen <aaro.koskinen@iki.fi> Signed-off-by: Wim Van Sebroeck <wim@iguana.be>
| * | watchdog: omap_wdt: eliminate unused variable and a compiler warningAaro Koskinen2013-01-021-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We forgot to delete this in the commit 4f4753d9 (watchdog: omap_wdt: convert to devm_ functions), and as a result the following compilation warning was introduced: drivers/watchdog/omap_wdt.c: In function 'omap_wdt_remove': drivers/watchdog/omap_wdt.c:299:19: warning: unused variable 'res' [-Wunused-variable] Signed-off-by: Aaro Koskinen <aaro.koskinen@iki.fi> Reviewed-by: Paul Walmsley <paul@pwsan.com> Signed-off-by: Wim Van Sebroeck <wim@iguana.be>
| * | watchdog: da9055: Don't update wdt_dev->timeout in da9055_wdt_set_timeout ↵Axel Lin2013-01-021-7/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | error path Otherwise, WDIOC_GETTIMEOUT returns wrong value if set_timeout fails. This patch also removes unnecessary ret variable in da9055_wdt_ping function. Signed-off-by: Axel Lin <axel.lin@ingics.com> Signed-off-by: Wim Van Sebroeck <wim@iguana.be>
| * | watchdog: da9055: Fix invalid free of devm_ allocated dataAxel Lin2013-01-021-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | It is not required to free devm_ allocated data. Since kref_put needs a valid release function, da9055_wdt_release_resources() is not deleted. Signed-off-by: Axel Lin <axel.lin@ingics.com> Signed-off-by: Wim Van Sebroeck <wim@iguana.be>
* | | Merge tag '3.8-pci-fixes' of ↵Linus Torvalds2013-01-026-55/+63
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull PCI updates from Bjorn Helgaas: "Some fixes for v3.8. They include a fix for the new SR-IOV sysfs management support, an expanded quirk for Ricoh SD card readers, a Stratus DMI quirk fix, and a PME polling fix." * tag '3.8-pci-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: PCI: Reduce Ricoh 0xe822 SD card reader base clock frequency to 50MHz PCI/PM: Do not suspend port if any subordinate device needs PME polling PCI: Add PCIe Link Capability link speed and width names PCI: Work around Stratus ftServer broken PCIe hierarchy (fix DMI check) PCI: Remove spurious error for sriov_numvfs store and simplify flow
| * | | PCI: Reduce Ricoh 0xe822 SD card reader base clock frequency to 50MHzAndy Lutomirski2012-12-262-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Otherwise it fails like this on cards like the Transcend 16GB SDHC card: mmc0: new SDHC card at address b368 mmcblk0: mmc0:b368 SDC 15.0 GiB mmcblk0: error -110 sending status command, retrying mmcblk0: error -84 transferring data, sector 0, nr 8, cmd response 0x900, card status 0xb0 Tested on my Lenovo x200 laptop. [bhelgaas: changelog] Signed-off-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Chris Ball <cjb@laptop.org> CC: Manoj Iyer <manoj.iyer@canonical.com> CC: stable@vger.kernel.org
| * | | PCI/PM: Do not suspend port if any subordinate device needs PME pollingHuang Ying2012-12-261-1/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ulrich reported that his USB3 cardreader does not work reliably when connected to the USB3 port. It turns out that USB3 controller failed to awaken when plugging in the USB3 cardreader. Further experiments found that the USB3 host controller can only be awakened via polling, not via PME interrupt. But if the PCIe port to which the USB3 host controller is connected is suspended, we cannot poll the controller because its config space is not accessible when the PCIe port is in a low power state. To solve the issue, the PCIe port will not be suspended if any subordinate device needs PME polling. [bhelgaas: use bool consistently rather than mixing int/bool] Reference: http://lkml.kernel.org/r/50841CCC.9030809@uli-eckhardt.de Reported-by: Ulrich Eckhardt <usb@uli-eckhardt.de> Tested-by: Sarah Sharp <sarah.a.sharp@linux.intel.com> Signed-off-by: Huang Ying <ying.huang@intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> CC: stable@vger.kernel.org # v3.6+
| * | | PCI: Add PCIe Link Capability link speed and width namesBjorn Helgaas2012-12-261-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add standard #defines for the Supported Link Speeds field in the PCIe Link Capabilities register. Note that prior to PCIe spec r3.0, these encodings were defined: 0001b 2.5GT/s Link speed supported 0010b 5.0GT/s and 2.5GT/s Link speed supported Starting with spec r3.0, these encodings refer to bits 0 and 1 in the Supported Link Speeds Vector in the Link Capabilities 2 register, and bits 0 and 1 there mean 2.5 GT/s and 5.0 GT/s, respectively. Therefore, code that followed r2.0 and interpreted 0x1 as 2.5GT/s and 0x2 as 5.0GT/s will continue to work, and we can identify a device using the new encodings because it will have a non-zero Link Capabilities 2 register. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| * | | PCI: Work around Stratus ftServer broken PCIe hierarchy (fix DMI check)Myron Stowe2012-12-261-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 284f5f9 was intended to disable the "only_one_child()" optimization on Stratus ftServer systems, but its DMI check is wrong. It looks for DMI_SYS_VENDOR that contains "ftServer", when it should look for DMI_SYS_VENDOR containing "Stratus" and DMI_PRODUCT_NAME containing "ftServer". Tested on Stratus ftServer 6400. Reported-by: Fadeeva Marina <astarta@rat.ru> Reference: https://bugzilla.kernel.org/show_bug.cgi?id=51331 Signed-off-by: Myron Stowe <myron.stowe@redhat.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> CC: stable@vger.kernel.org # v3.5+
| * | | PCI: Remove spurious error for sriov_numvfs store and simplify flowBjorn Helgaas2012-12-261-51/+34
| | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we request "num_vfs" and the driver's sriov_configure() method enables exactly that number ("num_vfs_enabled"), we complain "Invalid value for number of VFs to enable" and return an error. We should silently return success instead. Also, use kstrtou16() since numVFs is defined to be a 16-bit field and rework to simplify control flow. Reported-by: Greg Rose <gregory.v.rose@intel.com> Reference: http://lkml.kernel.org/r/20121214101911.00002f59@unknown Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Tested-by: Donald Dutile <ddutile@redhat.com>
* | | UAPI: Strip _UAPI prefix on header install no matter the whitespaceDavid Howells2013-01-021-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 56c176c9cac9 ("UAPI: strip the _UAPI prefix from header guards during header installation") strips the _UAPI prefix from header guards, but only if there's a single space between the cpp directive and the label. Make it more flexible and able to handle tabs and multiple white space characters. Signed-off-by: David Howells <dhowell@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | UAPI: Remove empty Kbuild filesDavid Howells2013-01-028-8/+0
| | | | | | | | | | | | | | | | | | | | | | | | Empty files can get deleted by the patch program, so remove empty Kbuild files and their links from the parent Kbuilds. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | Merge tag 'ecryptfs-3.8-rc2-fixes' of ↵Linus Torvalds2013-01-023-6/+14
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs Pull ecryptfs fixes from Tyler Hicks: "Two self-explanatory fixes and a third patch which improves performance: when overwriting a full page in the eCryptfs page cache, skip reading in and decrypting the corresponding lower page." * tag 'ecryptfs-3.8-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs: fs/ecryptfs/crypto.c: make ecryptfs_encode_for_filename() static eCryptfs: fix to use list_for_each_entry_safe() when delete items eCryptfs: Avoid unnecessary disk read and data decryption during writing
| * | | fs/ecryptfs/crypto.c: make ecryptfs_encode_for_filename() staticCong Ding2012-12-181-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | the function ecryptfs_encode_for_filename() is only used in this file Signed-off-by: Cong Ding <dinggnu@gmail.com> Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
| * | | eCryptfs: fix to use list_for_each_entry_safe() when delete itemsWei Yongjun2012-12-181-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since we will be removing items off the list using list_del() we need to use a safer version of the list_for_each_entry() macro aptly named list_for_each_entry_safe(). We should use the safe macro if the loop involves deletions of items. Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> [tyhicks: Fixed compiler err - missing list_for_each_entry_safe() param] Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
| * | | eCryptfs: Avoid unnecessary disk read and data decryption during writingLi Wang2012-11-071-2/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ecryptfs_write_begin grabs a page from page cache for writing. If the page contains invalid data, or data older than the counterpart on the disk, eCryptfs will read out the corresponing data from the disk into the page, decrypt them, then perform writing. However, for this page, if the length of the data to be written into is equal to page size, that means the whole page of data will be overwritten, in which case, it does not matter whatever the data were before, it is beneficial to perform writing directly rather than bothering to read and decrypt first. With this optimization, according to our test on a machine with Intel Core 2 Duo processor, iozone 'write' operation on an existing file with write size being multiple of page size will enjoy a steady 3x speedup. Signed-off-by: Li Wang <wangli@kylinos.com.cn> Signed-off-by: Yunchuan Wen <wenyunchuan@kylinos.com.cn> Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
* | | | Merge branch 'for-linus' of ↵Linus Torvalds2013-01-022-28/+29
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph fixes from Sage Weil: "Two of Alex's patches deal with a race when reseting server connections for open RBD images, one demotes some non-fatal BUGs to WARNs, and my patch fixes a protocol feature bit failure path." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: libceph: fix protocol feature mismatch failure path libceph: WARN, don't BUG on unexpected connection states libceph: always reset osds when kicking libceph: move linger requests sooner in kick_requests()
| * | | | libceph: fix protocol feature mismatch failure pathSage Weil2012-12-271-10/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We should not set con->state to CLOSED here; that happens in ceph_fault() in the caller, where it first asserts that the state is not yet CLOSED. Avoids a BUG when the features don't match. Since the fail_protocol() has become a trivial wrapper, replace calls to it with direct calls to reset_connection(). Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>
| * | | | libceph: WARN, don't BUG on unexpected connection statesAlex Elder2012-12-271-6/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A number of assertions in the ceph messenger are implemented with BUG_ON(), killing the system if connection's state doesn't match what's expected. At this point our state model is (evidently) not well understood enough for these assertions to trigger a BUG(). Convert all BUG_ON(con->state...) calls to be WARN_ON(con->state...) so we learn about these issues without killing the machine. We now recognize that a connection fault can occur due to a socket closure at any time, regardless of the state of the connection. So there is really nothing we can assert about the state of the connection at that point so eliminate that assertion. Reported-by: Ugis <ugis22@gmail.com> Tested-by: Ugis <ugis22@gmail.com> Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
| * | | | libceph: always reset osds when kickingAlex Elder2012-12-271-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When ceph_osdc_handle_map() is called to process a new osd map, kick_requests() is called to ensure all affected requests are updated if necessary to reflect changes in the osd map. This happens in two cases: whenever an incremental map update is processed; and when a full map update (or the last one if there is more than one) gets processed. In the former case, the kick_requests() call is followed immediately by a call to reset_changed_osds() to ensure any connections to osds affected by the map change are reset. But for full map updates this isn't done. Both cases should be doing this osd reset. Rather than duplicating the reset_changed_osds() call, move it into the end of kick_requests(). Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
| * | | | libceph: move linger requests sooner in kick_requests()Alex Elder2012-12-271-11/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The kick_requests() function is called by ceph_osdc_handle_map() when an osd map change has been indicated. Its purpose is to re-queue any request whose target osd is different from what it was when it was originally sent. It is structured as two loops, one for incomplete but registered requests, and a second for handling completed linger requests. As a special case, in the first loop if a request marked to linger has not yet completed, it is moved from the request list to the linger list. This is as a quick and dirty way to have the second loop handle sending the request along with all the other linger requests. Because of the way it's done now, however, this quick and dirty solution can result in these incomplete linger requests never getting re-sent as desired. The problem lies in the fact that the second loop only arranges for a linger request to be sent if it appears its target osd has changed. This is the proper handling for *completed* linger requests (it avoids issuing the same linger request twice to the same osd). But although the linger requests added to the list in the first loop may have been sent, they have not yet completed, so they need to be re-sent regardless of whether their target osd has changed. The first required fix is we need to avoid calling __map_request() on any incomplete linger request. Otherwise the subsequent __map_request() call in the second loop will find the target osd has not changed and will therefore not re-send the request. Second, we need to be sure that a sent but incomplete linger request gets re-sent. If the target osd is the same with the new osd map as it was when the request was originally sent, this won't happen. This can be fixed through careful handling when we move these requests from the request list to the linger list, by unregistering the request *before* it is registered as a linger request. This works because a side-effect of unregistering the request is to make the request's r_osd pointer be NULL, and *that* will ensure the second loop actually re-sends the linger request. Processing of such a request is done at that point, so continue with the next one once it's been moved. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
* | | | | mm: mempolicy: Convert shared_policy mutex to spinlockMel Gorman2013-01-022-21/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Sasha was fuzzing with trinity and reported the following problem: BUG: sleeping function called from invalid context at kernel/mutex.c:269 in_atomic(): 1, irqs_disabled(): 0, pid: 6361, name: trinity-main 2 locks held by trinity-main/6361: #0: (&mm->mmap_sem){++++++}, at: [<ffffffff810aa314>] __do_page_fault+0x1e4/0x4f0 #1: (&(&mm->page_table_lock)->rlock){+.+...}, at: [<ffffffff8122f017>] handle_pte_fault+0x3f7/0x6a0 Pid: 6361, comm: trinity-main Tainted: G W 3.7.0-rc2-next-20121024-sasha-00001-gd95ef01-dirty #74 Call Trace: __might_sleep+0x1c3/0x1e0 mutex_lock_nested+0x29/0x50 mpol_shared_policy_lookup+0x2e/0x90 shmem_get_policy+0x2e/0x30 get_vma_policy+0x5a/0xa0 mpol_misplaced+0x41/0x1d0 handle_pte_fault+0x465/0x6a0 This was triggered by a different version of automatic NUMA balancing but in theory the current version is vunerable to the same problem. do_numa_page -> numa_migrate_prep -> mpol_misplaced -> get_vma_policy -> shmem_get_policy It's very unlikely this will happen as shared pages are not marked pte_numa -- see the page_mapcount() check in change_pte_range() -- but it is possible. To address this, this patch restores sp->lock as originally implemented by Kosaki Motohiro. In the path where get_vma_policy() is called, it should not be calling sp_alloc() so it is not necessary to treat the PTL specially. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | Merge tag 'ext4_for_linus' of ↵Linus Torvalds2013-01-029-58/+152
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 bug fixes from Ted Ts'o: "Various bug fixes for ext4. Perhaps the most serious bug fixed is one which could cause file system corruptions when performing file punch operations." * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: avoid hang when mounting non-journal filesystems with orphan list ext4: lock i_mutex when truncating orphan inodes ext4: do not try to write superblock on ro remount w/o journal ext4: include journal blocks in df overhead calcs ext4: remove unaligned AIO warning printk ext4: fix an incorrect comment about i_mutex ext4: fix deadlock in journal_unmap_buffer() ext4: split off ext4_journalled_invalidatepage() jbd2: fix assertion failure in jbd2_journal_flush() ext4: check dioread_nolock on remount ext4: fix extent tree corruption caused by hole punch
| * | | | | ext4: avoid hang when mounting non-journal filesystems with orphan listTheodore Ts'o2012-12-271-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When trying to mount a file system which does not contain a journal, but which does have a orphan list containing an inode which needs to be truncated, the mount call with hang forever in ext4_orphan_cleanup() because ext4_orphan_del() will return immediately without removing the inode from the orphan list, leading to an uninterruptible loop in kernel code which will busy out one of the CPU's on the system. This can be trivially reproduced by trying to mount the file system found in tests/f_orphan_extents_inode/image.gz from the e2fsprogs source tree. If a malicious user were to put this on a USB stick, and mount it on a Linux desktop which has automatic mounts enabled, this could be considered a potential denial of service attack. (Not a big deal in practice, but professional paranoids worry about such things, and have even been known to allocate CVE numbers for such problems.) Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Zheng Liu <wenqing.lz@taobao.com> Cc: stable@vger.kernel.org
| * | | | | ext4: lock i_mutex when truncating orphan inodesTheodore Ts'o2012-12-271-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit c278531d39 added a warning when ext4_flush_unwritten_io() is called without i_mutex being taken. It had previously not been taken during orphan cleanup since races weren't possible at that point in the mount process, but as a result of this c278531d39, we will now see a kernel WARN_ON in this case. Take the i_mutex in ext4_orphan_cleanup() to suppress this warning. Reported-by: Alexander Beregalov <a.beregalov@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Zheng Liu <wenqing.lz@taobao.com> Cc: stable@vger.kernel.org
| * | | | | ext4: do not try to write superblock on ro remount w/o journalMichael Tokarev2012-12-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a journal-less ext4 filesystem is mounted on a read-only block device (blockdev --setro will do), each remount (for other, unrelated, flags, like suid=>nosuid etc) results in a series of scary messages from kernel telling about I/O errors on the device. This is becauese of the following code ext4_remount(): if (sbi->s_journal == NULL) ext4_commit_super(sb, 1); at the end of remount procedure, which forces writing (flushing) of a superblock regardless whenever it is dirty or not, if the filesystem is readonly or not, and whenever the device itself is readonly or not. We only need call ext4_commit_super when the file system had been previously mounted read/write. Thanks to Eric Sandeen for help in diagnosing this issue. Signed-off-By: Michael Tokarev <mjt@tls.msk.ru> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
| * | | | | ext4: include journal blocks in df overhead calcsEric Sandeen2012-12-251-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To more accurately calculate overhead for "bsd" style df reporting, we should count the journal blocks as overhead as well. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Tested-by: Eric Whitney <enwlinux@gmail.com>
| * | | | | ext4: remove unaligned AIO warning printkEric Sandeen2012-12-251-8/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Although I put this in, I now think it was a bad decision. For most users, there is very little to be done in this case. They get the message, once per day, with no real context or proposed action. TBH, it generates support calls when it probably does not need to; the message sounds more dire than the situation really is. Just nuke it. Normal investigation via blktrace or whatnot can reveal poor IO patterns if bad performance is encountered. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| * | | | | ext4: fix an incorrect comment about i_mutexAndy Lutomirski2012-12-251-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | i_mutex is not held when ->sync_file is called. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| * | | | | ext4: fix deadlock in journal_unmap_buffer()Jan Kara2012-12-253-25/+86
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We cannot wait for transaction commit in journal_unmap_buffer() because we hold page lock which ranks below transaction start. We solve the issue by bailing out of journal_unmap_buffer() and jbd2_journal_invalidatepage() with -EBUSY. Caller is then responsible for waiting for transaction commit to finish and try invalidation again. Since the issue can happen only for page stradding i_size, it is simple enough to manually call jbd2_journal_invalidatepage() for such page from ext4_setattr(), check the return value and wait if necessary. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| * | | | | ext4: split off ext4_journalled_invalidatepage()Jan Kara2012-12-252-8/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In data=journal mode we don't need delalloc or DIO handling in invalidatepage and similarly in other modes we don't need the journal handling. So split invalidatepage implementations. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| * | | | | jbd2: fix assertion failure in jbd2_journal_flush()Jan Kara2012-12-211-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following race is possible between start_this_handle() and someone calling jbd2_journal_flush(). Process A Process B start_this_handle(). if (journal->j_barrier_count) # false if (!journal->j_running_transaction) { #true read_unlock(&journal->j_state_lock); jbd2_journal_lock_updates() jbd2_journal_flush() write_lock(&journal->j_state_lock); if (journal->j_running_transaction) { # false ... wait for committing trans ... write_unlock(&journal->j_state_lock); ... write_lock(&journal->j_state_lock); if (!journal->j_running_transaction) { # true jbd2_get_transaction(journal, new_transaction); write_unlock(&journal->j_state_lock); goto repeat; # eventually blocks on j_barrier_count > 0 ... J_ASSERT(!journal->j_running_transaction); # fails We fix the race by rechecking j_barrier_count after reacquiring j_state_lock in exclusive mode. Reported-by: yjwsignal@empal.com Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org