| Commit message (Collapse) | Author | Age | Files | Lines |
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-idle-2.6
* 'idle-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-idle-2.6:
cpuidle/x86/perf: fix power:cpu_idle double end events and throw cpu_idle events from the cpuidle layer
intel_idle: open broadcast clock event
cpuidle: CPUIDLE_FLAG_CHECK_BM is omap3_idle specific
cpuidle: CPUIDLE_FLAG_TLB_FLUSHED is specific to intel_idle
cpuidle: delete unused CPUIDLE_FLAG_SHALLOW, BALANCED, DEEP definitions
SH, cpuidle: delete use of NOP CPUIDLE_FLAGS_SHALLOW
cpuidle: delete NOP CPUIDLE_FLAG_POLL
ACPI: processor_idle: delete use of NOP CPUIDLE_FLAGs
cpuidle: Rename X86 specific idle poll state[0] from C0 to POLL
ACPI, intel_idle: Cleanup idle= internal variables
cpuidle: Make cpuidle_enable_device() call poll_idle_init()
intel_idle: update Sandy Bridge core C-state residency targets
|
| |\ |
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
events from the cpuidle layer
Currently intel_idle and acpi_idle driver show double cpu_idle "exit idle"
events -> this patch fixes it and makes cpu_idle events throwing less complex.
It also introduces cpu_idle events for all architectures which use
the cpuidle subsystem, namely:
- arch/arm/mach-at91/cpuidle.c
- arch/arm/mach-davinci/cpuidle.c
- arch/arm/mach-kirkwood/cpuidle.c
- arch/arm/mach-omap2/cpuidle34xx.c
- arch/drivers/acpi/processor_idle.c (for all cases, not only mwait)
- arch/x86/kernel/process.c (did throw events before, but was a mess)
- drivers/idle/intel_idle.c (did throw events before)
Convention should be:
Fire cpu_idle events inside the current pm_idle function (not somewhere
down the the callee tree) to keep things easy.
Current possible pm_idle functions in X86:
c1e_idle, poll_idle, cpuidle_idle_call, mwait_idle, default_idle
-> this is really easy is now.
This affects userspace:
The type field of the cpu_idle power event can now direclty get
mapped to:
/sys/devices/system/cpu/cpuX/cpuidle/stateX/{name,desc,usage,time,...}
instead of throwing very CPU/mwait specific values.
This change is not visible for the intel_idle driver.
For the acpi_idle driver it should only be visible if the vendor
misses out C-states in his BIOS.
Another (perf timechart) patch reads out cpuidle info of cpu_idle
events from:
/sys/.../cpuidle/stateX/*, then the cpuidle events are mapped
to the correct C-/cpuidle state again, even if e.g. vendors miss
out C-states in their BIOS and for example only export C1 and C3.
-> everything is fine.
Signed-off-by: Thomas Renninger <trenn@suse.de>
CC: Robert Schoene <robert.schoene@tu-dresden.de>
CC: Jean Pihet <j-pihet@ti.com>
CC: Arjan van de Ven <arjan@linux.intel.com>
CC: Ingo Molnar <mingo@elte.hu>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: linux-pm@lists.linux-foundation.org
CC: linux-acpi@vger.kernel.org
CC: linux-kernel@vger.kernel.org
CC: linux-perf-users@vger.kernel.org
CC: linux-omap@vger.kernel.org
Signed-off-by: Len Brown <len.brown@intel.com>
|
| |\| |
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Intel_idle driver uses CLOCK_EVT_NOTIFY_BROADCAST_ENTER
CLOCK_EVT_NOTIFY_BROADCAST_EXIT
for broadcast clock events. The _ENTER/_EXIT doesn't really open broadcast clock
events, please see processor_idle.c for an example. In some situation, this will
cause boot hang, because some CPUs enters idle but local APIC timer stalls.
Reported-and-tested-by: Yan Zheng <zheng.z.yan@intel.com>
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
cc: stable@kernel.org
Signed-off-by: Len Brown <len.brown@intel.com>
|
| | |
| | |
| | |
| | | |
Signed-off-by: Len Brown <len.brown@intel.com>
|
| | |
| | |
| | |
| | | |
Signed-off-by: Len Brown <len.brown@intel.com>
|
| | |
| | |
| | |
| | | |
Signed-off-by: Len Brown <len.brown@intel.com>
|
| | |
| | |
| | |
| | |
| | |
| | | |
set but not checked.
Signed-off-by: Len Brown <len.brown@intel.com>
|
| | |
| | |
| | |
| | |
| | |
| | | |
it serves no purpose
Signed-off-by: Len Brown <len.brown@intel.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
CPUIDLE_FLAG_SHALLOW
CPUIDLE_FLAG_BALANCED
CPUIDLE_FLAG_DEEP
CPUIDLE_FLAG_CHECK_BM
were set by acpi_processor_setup_cpuidle(),
but never used by cpuidle or by acpi_idle.
So stop setting them.
Signed-off-by: Len Brown <len.brown@intel.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
C0 means and is well know as "not idle".
All documentation out there uses this term as "running"/"not idle"
state. Also Linux userspace tools (e.g. cpufreq-aperf and turbostat)
show C0 residency which there is correct, but means something totally
else than cpuidle "POLL" state.
Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Len Brown <len.brown@intel.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Having four variables for the same thing:
idle_halt, idle_nomwait, force_mwait and boot_option_idle_overrides
is rather confusing and unnecessary complex.
if idle= boot param is passed, only set up one variable:
boot_option_idle_overrides
Introduces following functional changes/fixes:
- intel_idle driver does not register if any idle=xy
boot param is passed.
- processor_idle.c will also not register a cpuidle driver
and get active if idle=halt is passed.
Before a cpuidle driver with one (C1, halt) state got registered
Now the default_idle function will be used which finally uses
the same idle call to enter sleep state (safe_halt()), but
without registering a whole cpuidle driver.
That means idle= param will always avoid cpuidle drivers to register
with one exception (same behavior as before):
idle=nomwait
may still register acpi_idle cpuidle driver, but C1 will not use
mwait, but hlt. This can be a workaround for IO based deeper sleep
states where C1 mwait causes problems.
Signed-off-by: Thomas Renninger <trenn@suse.de>
cc: x86@kernel.org
Signed-off-by: Len Brown <len.brown@intel.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The following scenario is possible with the current cpuidle code and
the ACPI cpuidle driver:
(1) acpi_processor_cst_has_changed() is called,
(2) cpuidle_disable_device() is called,
(3) cpuidle_remove_state_sysfs() is called to remove the (presumably
outdated) states info from sysfs,
(3) acpi_processor_get_power_info() is called, the first entry in the
pr->power.states[] table is filled with zeros,
(4) acpi_processor_setup_cpuidle() is called and it doesn't fill the
first entry in pr->power.states[],
(5) cpuidle_enable_device() is called,
(6) __cpuidle_register_device() is _not_ called, since the device has
already been registered,
(7) Consequently, poll_idle_init() is _not_ called either,
(8) cpuidle_add_state_sysfs() is called to create the sysfs attributes
for the new states and it uses the bogus first table entry from
acpi_processor_get_power_info() for creating state0.
This problem is avoided if cpuidle_enable_device()
unconditionally calls poll_idle_init().
Reported-by: Len Brown <len.brown@intel.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
cc: stable@kernel.org
|
| | |
| | |
| | |
| | | |
Signed-off-by: Len Brown <len.brown@intel.com>
|
|\ \ \
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-sfi-2.6
* 'sfi-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-sfi-2.6:
SFI: use ioremap_cache() instead of ioremap()
|
| |/ /
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
We copied ACPI's oversight of using ioremap() and creating
non-cached table mappings when we should have been using
ioremap_cache().
Signed-off-by: Len Brown <len.brown@intel.com>
|
|\ \ \
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin
* 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin:
fs: fix do_last error case when need_reval_dot
nfs: add missing rcu-walk check
fs: hlist UP debug fixup
fs: fix dropping of rcu-walk from force_reval_path
fs: force_reval_path drop rcu-walk before d_invalidate
fs: small rcu-walk documentation fixes
Fixed up trivial conflicts in Documentation/filesystems/porting
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
When open(2) without O_DIRECTORY opens an existing dir, it should return
EISDIR. In do_last(), the variable 'error' is initialized EISDIR, but it
is changed by d_revalidate() which returns any positive to represent
'the target dir is valid.'
Should we keep and return the initialized 'error' in this case.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
|
| | | |
| | | |
| | | |
| | | | |
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Po-Yu Chuang <ratbert.chuang@gmail.com> noticed that hlist_bl_set_first could
crash on a UP system when LIST_BL_LOCKMASK is 0, because
LIST_BL_BUG_ON(!((unsigned long)h->first & LIST_BL_LOCKMASK));
always evaulates to true.
Fix the expression, and also avoid a dependency between bit spinlock
implementation and list bl code (list code shouldn't know anything
except that bit 0 is set when adding and removing elements). Eventually
if a good use case comes up, we might use this list to store 1 or more
arbitrary bits of data, so it really shouldn't be tied to locking either,
but for now they are helpful for debugging.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
As J. R. Okajima noted, force_reval_path passes in the same dentry to
d_revalidate as the one in the nameidata structure (other callers pass in a
child), so the locking breaks. This can oops with a chrooted nfs mount, for
example. Similarly there can be other problems with revalidating a dentry
which is already in nameidata of the path walk.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
d_revalidate can return in rcu-walk mode even when it returns 0. We can't just
call any old dcache function on rcu-walk dentry (the dentry is unstable, so
even through d_lock can safely be taken, the result may no longer be what we
expect -- careful re-checks would be required). So just drop rcu in this case.
(I missed this conversion when switching to the rcu-walk convention that Linus
suggested)
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
|
| | | |
| | | |
| | | |
| | | | |
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
|
|\ \ \ \
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen
* 'stable/gntdev' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
xen/p2m: Fix module linking error.
xen p2m: clear the old pte when adding a page to m2p_override
xen gntdev: use gnttab_map_refs and gnttab_unmap_refs
xen: introduce gnttab_map_refs and gnttab_unmap_refs
xen p2m: transparently change the p2m mappings in the m2p override
xen/gntdev: Fix circular locking dependency
xen/gntdev: stop using "token" argument
xen: gntdev: move use of GNTMAP_contains_pte next to the map_op
xen: add m2p override mechanism
xen: move p2m handling to separate file
xen/gntdev: add VM_PFNMAP to vma
xen/gntdev: allow usermode to map granted pages
xen: define gnttab_set_map_op/unmap_op
Fix up trivial conflict in drivers/xen/Kconfig
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Fixes:
ERROR: "m2p_find_override_pfn" [drivers/block/xen-blkfront.ko] undefined!
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
When adding a page to m2p_override we change the p2m of the page so we
need to also clear the old pte of the kernel linear mapping because it
doesn't correspond anymore.
When we remove the page from m2p_override we restore the original p2m of
the page and we also restore the old pte of the kernel linear mapping.
Before changing the p2m mappings in m2p_add_override and
m2p_remove_override, check that the page passed as argument is valid and
return an error if it is not.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Use gnttab_map_refs and gnttab_unmap_refs to map and unmap the grant
ref, so that we can have a corresponding struct page.
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
gnttab_map_refs maps some grant refs and uses the new m2p override to
set a proper m2p mapping for the granted pages.
gnttab_unmap_refs unmaps the granted refs and removes th mappings from
the m2p override.
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
In m2p_add_override store the original mfn into page->index and then
change the p2m mapping, setting mfns as FOREIGN_FRAME.
In m2p_remove_override restore the original mapping.
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
apply_to_page_range will acquire PTE lock while priv->lock is held,
and mn_invl_range_start tries to acquire priv->lock with PTE already
held. Fix by not holding priv->lock during the entire map operation.
This is safe because map->vma is set nonzero while the lock is held,
which will cause subsequent maps to fail and will cause the unmap
ioctl (and other users of gntdev_del_map) to return -EBUSY until the
area is unmapped. It is similarly impossible for gntdev_vma_close to
be called while the vma is still being created.
Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
It's the struct page of the L1 pte page. But we can get its mfn
by simply doing an arbitrary_virt_to_machine() on it anyway (which is
the safe conservative choice; since we no longer allow HIGHPTE pages,
we would never expect to be operating on a mapped pte page).
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
This flag controls the meaning of gnttab_map_grant_ref.host_addr and
specifies that the field contains a reference to the pte entry to be
used to perform the mapping. Therefore move the use of this flag to
the point at which we actually use a reference to the pte instead of
something else, splitting up the usage of the flag in this way is
confusing and potentially error prone.
The other flags are all properties of the mapping itself as opposed to
properties of the hypercall arguments and therefore it make sense to
continue to pass them round in map->flags.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: Stefano Stabellini <Stefano.Stabellini@eu.citrix.com>
Cc: Derek G. Murray <Derek.Murray@cl.cam.ac.uk>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Add a simple hashtable based mechanism to override some portions of the
m2p, so that we can find out the pfn corresponding to an mfn of a
granted page. In fact entries corresponding to granted pages in the m2p
hold the original pfn value of the page in the source domain that
granted it.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
These pages are from other domains, so don't have any local PFN.
VM_PFNMAP is the closest concept Linux has to this.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
The gntdev driver allows usermode to map granted pages from other
domains. This is typically used to implement a Xen backend driver
in user mode.
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Stefano Stabellini <Stefano.Stabellini@eu.citrix.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
| | |/ /
| |/| |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Impact: hypercall definitions
These functions populate the gnttab data structures used by the
granttab map and unmap ops and are used in the backend drivers.
Originally xen-unstable.hg 9625:c3bb51c443a7
[ Include Stefano's fix for phys_addr_t ]
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
|\ \ \ \
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen
* 'stable/platform-pci-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
xen-platform: Fix compile errors if CONFIG_PCI is not enabled.
xen: rename platform-pci module to xen-platform-pci.
xen-platform: use PCI interfaces to request IO and MEM resources.
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
drivers/xen/platform-pci.c:127: error: implicit declaration of function
'pci_request_region'
drivers/xen/platform-pci.c:165: error: implicit declaration of function
'pci_release_region'
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
platform-pci is rather generic for a modular distro style kernel.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
| |/ / /
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
This is the correct interface to use and something has broken the use
of the previous incorrect interface (which fails because the request
conflicts with the resources assigned for the PCI device itself
instead of nesting like the PCI interfaces do).
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: stable@kernel.org # 2.6.37 only
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
In the current implementation mem_cgroup_end_migration() decides whether
the page migration has succeeded or not by checking "oldpage->mapping".
But if we are tring to migrate a shmem swapcache, the page->mapping of it
is NULL from the begining, so the check would be invalid. As a result,
mem_cgroup_end_migration() assumes the migration has succeeded even if
it's not, so "newpage" would be freed while it's not uncharged.
This patch fixes it by passing mem_cgroup_end_migration() the result of
the page migration.
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
In mem_cgroup_alloc() we currently do either kmalloc() or vmalloc() then
followed by memset() to zero the memory. This can be more efficiently
achieved by using kzalloc() and vzalloc(). There's also one situation
where we can use kzalloc_node() - this is what's new in this version of
the patch.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Commit b1dd693e ("memcg: avoid deadlock between move charge and
try_charge()") can cause another deadlock about mmap_sem on task migration
if cpuset and memcg are mounted onto the same mount point.
After the commit, cgroup_attach_task() has sequence like:
cgroup_attach_task()
ss->can_attach()
cpuset_can_attach()
mem_cgroup_can_attach()
down_read(&mmap_sem) (1)
ss->attach()
cpuset_attach()
mpol_rebind_mm()
down_write(&mmap_sem) (2)
up_write(&mmap_sem)
cpuset_migrate_mm()
do_migrate_pages()
down_read(&mmap_sem)
up_read(&mmap_sem)
mem_cgroup_move_task()
mem_cgroup_clear_mc()
up_read(&mmap_sem)
We can cause deadlock at (2) because we've already aquire the mmap_sem at (1).
But the commit itself is necessary to fix deadlocks which have existed
before the commit like:
Ex.1)
move charge | try charge
--------------------------------------+------------------------------
mem_cgroup_can_attach() | down_write(&mmap_sem)
mc.moving_task = current | ..
mem_cgroup_precharge_mc() | __mem_cgroup_try_charge()
mem_cgroup_count_precharge() | prepare_to_wait()
down_read(&mmap_sem) | if (mc.moving_task)
-> cannot aquire the lock | -> true
| schedule()
| -> move charge should wake it up
Ex.2)
move charge | try charge
--------------------------------------+------------------------------
mem_cgroup_can_attach() |
mc.moving_task = current |
mem_cgroup_precharge_mc() |
mem_cgroup_count_precharge() |
down_read(&mmap_sem) |
.. |
up_read(&mmap_sem) |
| down_write(&mmap_sem)
mem_cgroup_move_task() | ..
mem_cgroup_move_charge() | __mem_cgroup_try_charge()
down_read(&mmap_sem) | prepare_to_wait()
-> cannot aquire the lock | if (mc.moving_task)
| -> true
| schedule()
| -> move charge should wake it up
This patch fixes all of these problems by:
1. revert the commit.
2. To fix the Ex.1, we set mc.moving_task after mem_cgroup_count_precharge()
has released the mmap_sem.
3. To fix the Ex.2, we use down_read_trylock() instead of down_read() in
mem_cgroup_move_charge() and, if it has failed to aquire the lock, cancel
all extra charges, wake up all waiters, and retry trylock.
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Reported-by: Ben Blum <bblum@andrew.cmu.edu>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Paul Menage <menage@google.com>
Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Adding the number of swap pages to the byte limit of a memory control
group makes no sense. Convert the pages to bytes before adding them.
The only user of this code is the OOM killer, and the way it is used means
that the error results in a higher OOM badness value. Since the cgroup
limit is the same for all tasks in the cgroup, the error should have no
practical impact at the moment.
But let's not wait for future or changing users to trip over it.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: David Rientjes <rientjes@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Introduce a new bit spin lock, PCG_MOVE_LOCK, to synchronize the page
accounting and migration code. This reworks the locking scheme of
_update_stat() and _move_account() by adding new lock bit PCG_MOVE_LOCK,
which is always taken under IRQ disable.
1. If pages are being migrated from a memcg, then updates to that
memcg page statistics are protected by grabbing PCG_MOVE_LOCK using
move_lock_page_cgroup(). In an upcoming commit, memcg dirty page
accounting will be updating memcg page accounting (specifically: num
writeback pages) from IRQ context (softirq). Avoid a deadlocking
nested spin lock attempt by disabling irq on the local processor when
grabbing the PCG_MOVE_LOCK.
2. lock for update_page_stat is used only for avoiding race with
move_account(). So, IRQ awareness of lock_page_cgroup() itself is not
a problem. The problem is between mem_cgroup_update_page_stat() and
mem_cgroup_move_account_page().
Trade-off:
* Changing lock_page_cgroup() to always disable IRQ (or
local_bh) has some impacts on performance and I think
it's bad to disable IRQ when it's not necessary.
* adding a new lock makes move_account() slower. Score is
here.
Performance Impact: moving a 8G anon process.
Before:
real 0m0.792s
user 0m0.000s
sys 0m0.780s
After:
real 0m0.854s
user 0m0.000s
sys 0m0.842s
This score is bad but planned patches for optimization can reduce
this impact.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Andrea Righi <arighi@develer.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Replace usage of the mem_cgroup_update_file_mapped() memcg
statistic update routine with two new routines:
* mem_cgroup_inc_page_stat()
* mem_cgroup_dec_page_stat()
As before, only the file_mapped statistic is managed. However, these more
general interfaces allow for new statistics to be more easily added. New
statistics are added with memcg dirty page accounting.
Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrea Righi <arighi@develer.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Document cgroup dirty memory interfaces and statistics.
[akpm@linux-foundation.org: fix use_hierarchy description]
Signed-off-by: Andrea Righi <arighi@develer.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|