| Commit message (Collapse) | Author | Age | Files | Lines |
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
"Several notable changes this cycle:
- Thread mode was merged. This will be used for cgroup2 support for
CPU and possibly other controllers. Unfortunately, CPU controller
cgroup2 support didn't make this pull request but most contentions
have been resolved and the support is likely to be merged before
the next merge window.
- cgroup.stat now shows the number of descendant cgroups.
- cpuset now can enable the easier-to-configure v2 behavior on v1
hierarchy"
* 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
cpuset: Allow v2 behavior in v1 cgroup
cgroup: Add mount flag to enable cpuset to use v2 behavior in v1 cgroup
cgroup: remove unneeded checks
cgroup: misc changes
cgroup: short-circuit cset_cgroup_from_root() on the default hierarchy
cgroup: re-use the parent pointer in cgroup_destroy_locked()
cgroup: add cgroup.stat interface with basic hierarchy stats
cgroup: implement hierarchy limits
cgroup: keep track of number of descent cgroups
cgroup: add comment to cgroup_enable_threaded()
cgroup: remove unnecessary empty check when enabling threaded mode
cgroup: update debug controller to print out thread mode information
cgroup: implement cgroup v2 thread support
cgroup: implement CSS_TASK_ITER_THREADED
cgroup: introduce cgroup->dom_cgrp and threaded css_set handling
cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS
cgroup: reorganize cgroup.procs / task write path
cgroup: replace css_set walking populated test with testing cgrp->nr_populated_csets
cgroup: distinguish local and children populated states
cgroup: remove now unused list_head @pending in cgroup_apply_cftypes()
...
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
css_task_iter currently always walks all tasks. With the scheduled
cgroup v2 thread support, the iterator would need to handle multiple
types of iteration. As a preparation, add @flags to
css_task_iter_start() and implement CSS_TASK_ITER_PROCS. If the flag
is not specified, it walks all tasks as before. When asserted, the
iterator only walks the group leaders.
For now, the only user of the flag is cgroup v2 "cgroup.procs" file
which no longer needs to skip non-leader tasks in cgroup_procs_next().
Note that cgroup v1 "cgroup.procs" can't use the group leader walk as
v1 "cgroup.procs" doesn't mean "list all thread group leaders in the
cgroup" but "list all thread group id's with any threads in the
cgroup".
While at it, update cgroup_procs_show() to use task_pid_vnr() instead
of task_tgid_vnr(). As the iteration guarantees that the function
only sees group leaders, this doesn't change the output and will allow
sharing the function for thread iteration.
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
Pull percpu updates from Tejun Heo:
"A lot of changes for percpu this time around. percpu inherited the
same area allocator from the original pre-virtual-address-mapped
implementation. This was from the time when percpu allocator wasn't
used all that much and the implementation was focused on simplicity,
with the unfortunate computational complexity of O(number of areas
allocated from the chunk) per alloc / free.
With the increase in percpu usage, we're hitting cases where the lack
of scalability is hurting. The most prominent one right now is bpf
perpcu map creation / destruction which may allocate and free a lot of
entries consecutively and it's likely that the problem will become
more prominent in the future.
To address the issue, Dennis replaced the area allocator with hinted
bitmap allocator which is more consistent. While the new allocator
does perform a bit worse in some cases, it outperforms the old
allocator way more than an order of magnitude in other more common
scenarios while staying mostly flat in CPU overhead and completely
flat in memory consumption"
* 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (27 commits)
percpu: update header to contain bitmap allocator explanation.
percpu: update pcpu_find_block_fit to use an iterator
percpu: use metadata blocks to update the chunk contig hint
percpu: update free path to take advantage of contig hints
percpu: update alloc path to only scan if contig hints are broken
percpu: keep track of the best offset for contig hints
percpu: skip chunks if the alloc does not fit in the contig hint
percpu: add first_bit to keep track of the first free in the bitmap
percpu: introduce bitmap metadata blocks
percpu: replace area map allocator with bitmap
percpu: generalize bitmap (un)populated iterators
percpu: increase minimum percpu allocation size and align first regions
percpu: introduce nr_empty_pop_pages to help empty page accounting
percpu: change the number of pages marked in the first_chunk pop bitmap
percpu: combine percpu address checks
percpu: modify base_addr to be region specific
percpu: setup_first_chunk rename schunk/dchunk to chunk
percpu: end chunk area maps page aligned for the populated bitmap
percpu: unify allocation of schunk and dchunk
percpu: setup_first_chunk remove dyn_size and consolidate logic
...
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The other patches contain a lot of information, so adding this
information in a separate patch. It adds my copyright and a brief
explanation of how the bitmap allocator works. There is a minor typo as
well in the prior explanation so that is fixed.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The simple, and expensive, way to find a free area is to iterate over
the entire bitmap until an area is found that fits the allocation size
and alignment. This patch makes use of an iterate that find an area to
check by using the block level contig hints. It will only return an area
that can fit the size and alignment request. If the request can fit
inside a block, it returns the first_free bit to start checking from to
see if it can be fulfilled prior to the contig hint. The pcpu_alloc_area
check has a bound of a block size added in case it is wrong.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The largest free region will either be a block level contig hint or an
aggregate over the left_free and right_free areas of blocks. This is a
much smaller set of free areas that need to be checked than a full
traverse.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The bitmap allocator must keep metadata consistent. The easiest way is
to scan after every allocation for each affected block and the entire
chunk. This is rather expensive.
The free path can take advantage of current contig hints to prevent
scanning within the start and end block. If a scan is needed, it can
be done by scanning backwards from the start and forwards from the end
to identify the entire free area this can be combined with. The blocks
can then be updated by some basic checks rather than complete block
scans.
A chunk scan happens when the freed area makes a page free, a block
free, or spans across blocks. This is necessary as the contig hint at
this point could span across blocks. The check uses the minimum of page
size and the block size to allow for variable sized blocks. There is a
tradeoff here with not updating after every free. It is possible a
contig hint in one block can be merged with the contig hint in the next
block. This means the contig hint can be off by up to a page. However,
if the chunk's contig hint is contained in one block, the contig hint
will be accurate.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Metadata is kept per block to keep track of where the contig hints are.
Scanning can be avoided when the contig hints are not broken. In that
case, left and right contigs have to be managed manually.
This patch changes the allocation path hint updating to only scan when
contig hints are broken.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This patch makes the contig hint starting offset optimization from the
previous patch as honest as it can be. For both chunk and block starting
offsets, make sure it keeps the starting offset with the best alignment.
The block skip optimization is added in a later patch when the
pcpu_find_block_fit iterator is swapped in.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This patch adds chunk->contig_bits_start to keep track of the contig
hint's offset and the check to skip the chunk if it does not fit. If
the chunk's contig hint starting offset cannot satisfy an allocation,
the allocator assumes there is enough memory pressure in this chunk to
either use a different chunk or create a new one. This accepts a less
tight packing for a smoother latency curve.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This patch adds first_bit to keep track of the first free bit in the
bitmap. This hint helps prevent scanning of fully allocated blocks.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This patch introduces the bitmap metadata blocks and adds the skeleton
of the code that will be used to maintain these blocks. Each chunk's
bitmap is made up of full metadata blocks. These blocks maintain basic
metadata to help prevent scanning unnecssarily to update hints. Full
scanning methods are used for the skeleton and will be replaced in the
coming patches. A number of helper functions are added as well to do
conversion of pages to blocks and manage offsets. Comments will be
updated as the final version of each function is added.
There exists a relationship between PAGE_SIZE, PCPU_BITMAP_BLOCK_SIZE,
the region size, and unit_size. Every chunk's region (including offsets)
is page aligned at the beginning to preserve alignment. The end is
aligned to LCM(PAGE_SIZE, PCPU_BITMAP_BLOCK_SIZE) to ensure that the end
can fit with the populated page map which is by page and every metadata
block is fully accounted for. The unit_size is already page aligned, but
must also be aligned with PCPU_BITMAP_BLOCK_SIZE to ensure full metadata
blocks.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.
This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.
Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.
In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.
Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 310 | 4770
16B | 557 | 1325
64B | 436 | 273
256B | 776 | 131
1024B | 3280 | 122
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 490 | 70
16B | 515 | 75
64B | 610 | 80
256B | 950 | 100
1024B | 3520 | 200
This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.
While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:
Area Map Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 4118 | 4892
16B | 1651 | 1163
64B | 598 | 285
256B | 771 | 158
1024B | 3034 | 160
Bitmap Allocator:
Object Size | Alloc Time (ms) | Free Time (ms)
----------------------------------------------
4B | 481 | 67
16B | 506 | 69
64B | 636 | 75
256B | 892 | 90
1024B | 3262 | 147
The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.
The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.
V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.
Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The area map allocator only used a bitmap for the backing page state.
The new bitmap allocator will use bitmaps to manage the allocation
region in addition to this.
This patch generalizes the bitmap iterators so they can be reused with
the bitmap allocator.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This patch increases the minimum allocation size of percpu memory to
4-bytes. This change will help minimize the metadata overhead
associated with the bitmap allocator. The assumption is that most
allocations will be of objects or structs greater than 2 bytes with
integers or longs being used rather than shorts.
The first chunk regions are now aligned with the minimum allocation
size. The reserved region is expected to be set as a multiple of the
minimum allocation size. The static region is aligned up and the delta
is removed from the dynamic size. This works because the dynamic size is
increased to be page aligned. If the static size is not minimum
allocation size aligned, then there must be a gap that is added to the
dynamic size. The dynamic size will never be smaller than the set value.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
pcpu_nr_empty_pop_pages is used to ensure there are a handful of free
pages around to serve atomic allocations. A new field, nr_empty_pop_pages,
is added to the pcpu_chunk struct to keep track of the number of empty
pages. This field is needed as the number of empty populated pages is
globally tracked and deltas are used to update in the bitmap allocator.
Pages that contain a hidden area are not considered to be empty. This
new field is exposed in percpu_stats.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The populated bitmap represents the state of the pages the chunk serves.
Prior, the bitmap was marked completely used as the first chunk was
allocated and immutable. This is misleading because the first chunk may
not be completely filled. Additionally, with moving the base_addr up in
the previous patch, the population check no longer corresponds to what
was being checked.
This patch modifies the population map to be only the number of pages
the region serves and to make what it was checking correspond correctly
again. The change is to remove any misunderstanding between the size of
the populated bitmap and the actual size of it. The work function page
iterators now use nr_pages for the check rather than pcpu_unit_pages
because nr_populated is now chunk specific. Without this, the work
function would try to populate the remainder of these chunks despite it
not serving any more than nr_pages when nr_pages is set less than
pcpu_unit_pages.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The percpu address checks for the reserved and dynamic region chunks are
now specific to each region. The address checking logic can be combined
taking advantage of the global references to the dynamic and static
region chunks.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Originally, the first chunk was served by one or two chunks, each
given a region they are responsible for. Despite this, the arithmetic
was based off of the true base_addr of the chunk making it be overly
inclusive.
This patch moves the base_addr of chunks that are responsible for the
first chunk. The base_addr must remain page aligned to keep the
address alignment correct, so it is the beginning of the region served
page aligned down. start_offset holds where the region served begins
from this new base_addr.
The corresponding percpu address checks are modified to be more specific
as a result. The first chunk considers only the dynamic region and both
first chunk and reserved chunk checks ignore the static region. The
static region addresses should never be passed into the allocator. There
is no impact here besides distinguishing the first chunk and making the
checks specific.
The percpu pointer to physical address is left intact as addresses are
not given out in the non-allocated portion of percpu memory.
nr_pages is added to pcpu_chunk to keep track of the size of the entire
region served containing both start_offset and end_offset. This variable
will be used to manage the bitmap allocator.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
There is no need to have the static chunk and dynamic chunk be named
separately as the allocations are sequential. This preemptively solves
the misnomer problem with the base_addrs being moved up in the following
patch. It also removes a ternary operation deciding the first chunk.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The area map allocator manages the first chunk area by hiding all but
the region it is responsible for serving in the area map. To align this
with the populated page bitmap, end_offset is introduced to keep track
of the delta to end page aligned. The area map is appended with the
page aligned end when necessary to be in line with how the bitmap
allocator requires the ending to be aligned with the LCM of PAGE_SIZE
and the size of each bitmap block. percpu_stats is updated to ignore
this region when present.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Create a common allocator for first chunk initialization,
pcpu_alloc_first_chunk. Comments for this function will be added in a
later patch once the bitmap allocator is added.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
There is logic for setting variables in the static chunk init code that
could be consolidated with the dynamic chunk init code. This combines
this logic to setup for combining the allocation paths. reserved_size is
used as the conditional as a dynamic region will always exist.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Prior this variable was used to manage statistics when the first chunk
had a reserved region. The previous patch introduced start_offset to
keep track of the offset by value rather than boolean. Therefore,
has_reserved can be removed.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The reserved chunk arithmetic uses a global variable
pcpu_reserved_chunk_limit that is set in the first chunk init code to
hide a portion of the area map. The bitmap allocator to come will
eventually move the base_addr up and require both the reserved chunk
and static chunk to maintain this offset. pcpu_reserved_chunk_limit is
removed and start_offset is added.
The first chunk that is circulated and is pcpu_first_chunk serves the
dynamic region, the region following the reserved region. The reserved
chunk address check will temporarily use the first chunk to identify its
address range. A following patch will increase the base_addr and remove
this. If there is no reserved chunk, this will check the static region
and return false because those values should never be passed into the
allocator.
Lastly, when linking in the first chunk, make sure to count the right
free region for the number of empty populated pages.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The first chunk is handled as a special case as it is composed of the
static, reserved, and dynamic regions. The code handles each case
individually. The next several patches will merge these code paths and
lay the foundation for the bitmap allocator.
This patch modifies logic to enforce that a dynamic region exists and
changes the area map to account for that. This brings the logic closer
to the dynamic chunk's init logic.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The header comment for percpu memory is a little hard to parse and is
not super clear about how the first chunk is managed. This adds a
little more clarity to the situation.
There is also quite a bit of tricky logic in the pcpu_build_alloc_info.
This adds a restructure of a comment to add a little more information.
Unfortunately, you will still have to piece together a handful of other
comments too, but should help direct you to the meaningful comments.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Percpu memory holds a minimum threshold of pages that are populated
in order to serve atomic percpu memory requests. This change makes it
easier to verify that there are a minimum number of populated pages
lying around.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This makes the debugfs output for percpu_stats a little easier
to read by changing the spacing of the output to be consistent.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
| |/
| |
| |
| |
| |
| |
| | |
Changes the use of a void buffer to an int buffer for clarity.
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Merge updates from Andrew Morton:
- various misc bits
- DAX updates
- OCFS2
- most of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (119 commits)
mm,fork: introduce MADV_WIPEONFORK
x86,mpx: make mpx depend on x86-64 to free up VMA flag
mm: add /proc/pid/smaps_rollup
mm: hugetlb: clear target sub-page last when clearing huge page
mm: oom: let oom_reap_task and exit_mmap run concurrently
swap: choose swap device according to numa node
mm: replace TIF_MEMDIE checks by tsk_is_oom_victim
mm, oom: do not rely on TIF_MEMDIE for memory reserves access
z3fold: use per-cpu unbuddied lists
mm, swap: don't use VMA based swap readahead if HDD is used as swap
mm, swap: add sysfs interface for VMA based swap readahead
mm, swap: VMA based swap readahead
mm, swap: fix swap readahead marking
mm, swap: add swap readahead hit statistics
mm/vmalloc.c: don't reinvent the wheel but use existing llist API
mm/vmstat.c: fix wrong comment
selftests/memfd: add memfd_create hugetlbfs selftest
mm/shmem: add hugetlbfs support to memfd_create()
mm, devm_memremap_pages: use multi-order radix for ZONE_DEVICE lookups
mm/vmalloc.c: halve the number of comparisons performed in pcpu_get_vm_areas()
...
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Introduce MADV_WIPEONFORK semantics, which result in a VMA being empty
in the child process after fork. This differs from MADV_DONTFORK in one
important way.
If a child process accesses memory that was MADV_WIPEONFORK, it will get
zeroes. The address ranges are still valid, they are just empty.
If a child process accesses memory that was MADV_DONTFORK, it will get a
segmentation fault, since those address ranges are no longer valid in
the child after fork.
Since MADV_DONTFORK also seems to be used to allow very large programs
to fork in systems with strict memory overcommit restrictions, changing
the semantics of MADV_DONTFORK might break existing programs.
MADV_WIPEONFORK only works on private, anonymous VMAs.
The use case is libraries that store or cache information, and want to
know that they need to regenerate it in the child process after fork.
Examples of this would be:
- systemd/pulseaudio API checks (fail after fork) (replacing a getpid
check, which is too slow without a PID cache)
- PKCS#11 API reinitialization check (mandated by specification)
- glibc's upcoming PRNG (reseed after fork)
- OpenSSL PRNG (reseed after fork)
The security benefits of a forking server having a re-inialized PRNG in
every child process are pretty obvious. However, due to libraries
having all kinds of internal state, and programs getting compiled with
many different versions of each library, it is unreasonable to expect
calling programs to re-initialize everything manually after fork.
A further complication is the proliferation of clone flags, programs
bypassing glibc's functions to call clone directly, and programs calling
unshare, causing the glibc pthread_atfork hook to not get called.
It would be better to have the kernel take care of this automatically.
The patch also adds MADV_KEEPONFORK, to undo the effects of a prior
MADV_WIPEONFORK.
This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:
https://man.openbsd.org/minherit.2
[akpm@linux-foundation.org: numerically order arch/parisc/include/uapi/asm/mman.h #defines]
Link: http://lkml.kernel.org/r/20170811212829.29186-3-riel@redhat.com
Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Florian Weimer <fweimer@redhat.com>
Reported-by: Colm MacCártaigh <colm@allcosts.net>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Drewry <wad@chromium.org>
Cc: <linux-api@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Huge page helps to reduce TLB miss rate, but it has higher cache
footprint, sometimes this may cause some issue. For example, when
clearing huge page on x86_64 platform, the cache footprint is 2M. But
on a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M
LLC (last level cache). That is, in average, there are 2.5M LLC for
each core and 1.25M LLC for each thread.
If the cache pressure is heavy when clearing the huge page, and we clear
the huge page from the begin to the end, it is possible that the begin
of huge page is evicted from the cache after we finishing clearing the
end of the huge page. And it is possible for the application to access
the begin of the huge page after clearing the huge page.
To help the above situation, in this patch, when we clear a huge page,
the order to clear sub-pages is changed. In quite some situation, we
can get the address that the application will access after we clear the
huge page, for example, in a page fault handler. Instead of clearing
the huge page from begin to end, we will clear the sub-pages farthest
from the the sub-page to access firstly, and clear the sub-page to
access last. This will make the sub-page to access most cache-hot and
sub-pages around it more cache-hot too. If we cannot know the address
the application will access, the begin of the huge page is assumed to be
the the address the application will access.
With this patch, the throughput increases ~28.3% in vm-scalability
anon-w-seq test case with 72 processes on a 2 socket Xeon E5 v3 2699
system (36 cores, 72 threads). The test case creates 72 processes, each
process mmap a big anonymous memory area and writes to it from the begin
to the end. For each process, other processes could be seen as other
workload which generates heavy cache pressure. At the same time, the
cache miss rate reduced from ~33.4% to ~31.7%, the IPC (instruction per
cycle) increased from 0.56 to 0.74, and the time spent in user space is
reduced ~7.9%
Christopher Lameter suggests to clear bytes inside a sub-page from end
to begin too. But tests show no visible performance difference in the
tests. May because the size of page is small compared with the cache
size.
Thanks Andi Kleen to propose to use address to access to determine the
order of sub-pages to clear.
The hugetlbfs access address could be improved, will do that in another
patch.
[ying.huang@intel.com: improve readability of clear_huge_page()]
Link: http://lkml.kernel.org/r/20170830051842.1397-1-ying.huang@intel.com
Link: http://lkml.kernel.org/r/20170815014618.15842-1-ying.huang@intel.com
Suggested-by: Andi Kleen <andi.kleen@intel.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Jan Kara <jack@suse.cz>
Reviewed-by: Michal Hocko <mhocko@suse.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Nadia Yvette Chambers <nyc@holomorphy.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shaohua Li <shli@fb.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This is purely required because exit_aio() may block and exit_mmap() may
never start, if the oom_reap_task cannot start running on a mm with
mm_users == 0.
At the same time if the OOM reaper doesn't wait at all for the memory of
the current OOM candidate to be freed by exit_mmap->unmap_vmas, it would
generate a spurious OOM kill.
If it wasn't because of the exit_aio or similar blocking functions in
the last mmput, it would be enough to change the oom_reap_task() in the
case it finds mm_users == 0, to wait for a timeout or to wait for
__mmput to set MMF_OOM_SKIP itself, but it's not just exit_mmap the
problem here so the concurrency of exit_mmap and oom_reap_task is
apparently warranted.
It's a non standard runtime, exit_mmap() runs without mmap_sem, and
oom_reap_task runs with the mmap_sem for reading as usual (kind of
MADV_DONTNEED).
The race between the two is solved with a combination of
tsk_is_oom_victim() (serialized by task_lock) and MMF_OOM_SKIP
(serialized by a dummy down_write/up_write cycle on the same lines of
the ksm_exit method).
If the oom_reap_task() may be running concurrently during exit_mmap,
exit_mmap will wait it to finish in down_write (before taking down mm
structures that would make the oom_reap_task fail with use after free).
If exit_mmap comes first, oom_reap_task() will skip the mm if
MMF_OOM_SKIP is already set and in turn all memory is already freed and
furthermore the mm data structures may already have been taken down by
free_pgtables.
[aarcange@redhat.com: incremental one liner]
Link: http://lkml.kernel.org/r/20170726164319.GC29716@redhat.com
[rientjes@google.com: remove unused mmput_async]
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1708141733130.50317@chino.kir.corp.google.com
[aarcange@redhat.com: microoptimization]
Link: http://lkml.kernel.org/r/20170817171240.GB5066@redhat.com
Link: http://lkml.kernel.org/r/20170726162912.GA29716@redhat.com
Fixes: 26db62f179d1 ("oom: keep mm of the killed task available")
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: David Rientjes <rientjes@google.com>
Tested-by: David Rientjes <rientjes@google.com>
Reviewed-by: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
If the system has more than one swap device and swap device has the node
information, we can make use of this information to decide which swap
device to use in get_swap_pages() to get better performance.
The current code uses a priority based list, swap_avail_list, to decide
which swap device to use and if multiple swap devices share the same
priority, they are used round robin. This patch changes the previous
single global swap_avail_list into a per-numa-node list, i.e. for each
numa node, it sees its own priority based list of available swap
devices. Swap device's priority can be promoted on its matching node's
swap_avail_list.
The current swap device's priority is set as: user can set a >=0 value,
or the system will pick one starting from -1 then downwards. The
priority value in the swap_avail_list is the negated value of the swap
device's due to plist being sorted from low to high. The new policy
doesn't change the semantics for priority >=0 cases, the previous
starting from -1 then downwards now becomes starting from -2 then
downwards and -1 is reserved as the promoted value.
Take 4-node EX machine as an example, suppose 4 swap devices are
available, each sit on a different node:
swapA on node 0
swapB on node 1
swapC on node 2
swapD on node 3
After they are all swapped on in the sequence of ABCD.
Current behaviour:
their priorities will be:
swapA: -1
swapB: -2
swapC: -3
swapD: -4
And their position in the global swap_avail_list will be:
swapA -> swapB -> swapC -> swapD
prio:1 prio:2 prio:3 prio:4
New behaviour:
their priorities will be(note that -1 is skipped):
swapA: -2
swapB: -3
swapC: -4
swapD: -5
And their positions in the 4 swap_avail_lists[nid] will be:
swap_avail_lists[0]: /* node 0's available swap device list */
swapA -> swapB -> swapC -> swapD
prio:1 prio:3 prio:4 prio:5
swap_avali_lists[1]: /* node 1's available swap device list */
swapB -> swapA -> swapC -> swapD
prio:1 prio:2 prio:4 prio:5
swap_avail_lists[2]: /* node 2's available swap device list */
swapC -> swapA -> swapB -> swapD
prio:1 prio:2 prio:3 prio:5
swap_avail_lists[3]: /* node 3's available swap device list */
swapD -> swapA -> swapB -> swapC
prio:1 prio:2 prio:3 prio:4
To see the effect of the patch, a test that starts N process, each mmap
a region of anonymous memory and then continually write to it at random
position to trigger both swap in and out is used.
On a 2 node Skylake EP machine with 64GiB memory, two 170GB SSD drives
are used as swap devices with each attached to a different node, the
result is:
runtime=30m/processes=32/total test size=128G/each process mmap region=4G
kernel throughput
vanilla 13306
auto-binding 15169 +14%
runtime=30m/processes=64/total test size=128G/each process mmap region=2G
kernel throughput
vanilla 11885
auto-binding 14879 +25%
[aaron.lu@intel.com: v2]
Link: http://lkml.kernel.org/r/20170814053130.GD2369@aaronlu.sh.intel.com
Link: http://lkml.kernel.org/r/20170816024439.GA10925@aaronlu.sh.intel.com
[akpm@linux-foundation.org: use kmalloc_array()]
Link: http://lkml.kernel.org/r/20170814053130.GD2369@aaronlu.sh.intel.com
Link: http://lkml.kernel.org/r/20170816024439.GA10925@aaronlu.sh.intel.com
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Cc: "Chen, Tim C" <tim.c.chen@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
TIF_MEMDIE is set only to the tasks whick were either directly selected
by the OOM killer or passed through mark_oom_victim from the allocator
path. tsk_is_oom_victim is more generic and allows to identify all
tasks (threads) which share the mm with the oom victim.
Please note that the freezer still needs to check TIF_MEMDIE because we
cannot thaw tasks which do not participage in oom_victims counting
otherwise a !TIF_MEMDIE task could interfere after oom_disbale returns.
Link: http://lkml.kernel.org/r/20170810075019.28998-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
For ages we have been relying on TIF_MEMDIE thread flag to mark OOM
victims and then, among other things, to give these threads full access
to memory reserves. There are few shortcomings of this implementation,
though.
First of all and the most serious one is that the full access to memory
reserves is quite dangerous because we leave no safety room for the
system to operate and potentially do last emergency steps to move on.
Secondly this flag is per task_struct while the OOM killer operates on
mm_struct granularity so all processes sharing the given mm are killed.
Giving the full access to all these task_structs could lead to a quick
memory reserves depletion. We have tried to reduce this risk by giving
TIF_MEMDIE only to the main thread and the currently allocating task but
that doesn't really solve this problem while it surely opens up a room
for corner cases - e.g. GFP_NO{FS,IO} requests might loop inside the
allocator without access to memory reserves because a particular thread
was not the group leader.
Now that we have the oom reaper and that all oom victims are reapable
after 1b51e65eab64 ("oom, oom_reaper: allow to reap mm shared by the
kthreads") we can be more conservative and grant only partial access to
memory reserves because there are reasonable chances of the parallel
memory freeing. We still want some access to reserves because we do not
want other consumers to eat up the victim's freed memory. oom victims
will still contend with __GFP_HIGH users but those shouldn't be so
aggressive to starve oom victims completely.
Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to
the half of the reserves. This makes the access to reserves independent
on which task has passed through mark_oom_victim. Also drop any usage
of TIF_MEMDIE from the page allocator proper and replace it by
tsk_is_oom_victim as well which will make page_alloc.c completely
TIF_MEMDIE free finally.
CONFIG_MMU=n doesn't have oom reaper so let's stick to the original
ALLOC_NO_WATERMARKS approach.
There is a demand to make the oom killer memcg aware which will imply
many tasks killed at once. This change will allow such a usecase
without worrying about complete memory reserves depletion.
Link: http://lkml.kernel.org/r/20170810075019.28998-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
It's been noted that z3fold doesn't scale well when it's run in a large
number of threads on many cores, which can be easily reproduced with fio
'randrw' test with --numjobs=32. E.g. the result for 1 cluster (4 cores)
is:
Run status group 0 (all jobs):
READ: io=244785MB, aggrb=496883KB/s, minb=15527KB/s, ...
WRITE: io=246735MB, aggrb=500841KB/s, minb=15651KB/s, ...
While for 8 cores (2 clusters) the result is:
Run status group 0 (all jobs):
READ: io=244785MB, aggrb=265942KB/s, minb=8310KB/s, ...
WRITE: io=246735MB, aggrb=268060KB/s, minb=8376KB/s, ...
The bottleneck here is the pool lock which many threads become waiting
upon. To reduce that spin lock contention, z3fold can operate only on
the lists local to the current CPU whenever possible. Due to the nature
of z3fold unbuddied list handling (it only takes the first entry off the
list on a hot path), if the z3fold pool is big enough and balanced well
enough, limiting search to only local unbuddied list doesn't lead to a
significant compression ratio degrade (2.57x vs 2.65x in our
measurements).
This patch also introduces two worker threads: one for async in-page
object layout optimization and one for releasing freed pages. This is
done to speed up z3fold_free() which is often on a hot path.
The fio results for 8-core case are now the following:
Run status group 0 (all jobs):
READ: io=244785MB, aggrb=1568.3MB/s, minb=50182KB/s, ...
WRITE: io=246735MB, aggrb=1580.8MB/s, minb=50582KB/s, ...
So we're in for almost 6x performance increase.
Link: http://lkml.kernel.org/r/20170806181443.f9b65018f8bde25ef990f9e8@gmail.com
Signed-off-by: Vitaly Wool <vitalywool@gmail.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
VMA based swap readahead will readahead the virtual pages that is
continuous in the virtual address space. While the original swap
readahead will readahead the swap slots that is continuous in the swap
device. Although VMA based swap readahead is more correct for the swap
slots to be readahead, it will trigger more small random readings, which
may cause the performance of HDD (hard disk) to degrade heavily, and may
finally exceed the benefit.
To avoid the issue, in this patch, if the HDD is used as swap, the VMA
based swap readahead will be disabled, and the original swap readahead
will be used instead.
Link: http://lkml.kernel.org/r/20170807054038.1843-6-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The sysfs interface to control the VMA based swap readahead is added as
follow,
/sys/kernel/mm/swap/vma_ra_enabled
Enable the VMA based swap readahead algorithm, or use the original
global swap readahead algorithm.
/sys/kernel/mm/swap/vma_ra_max_order
Set the max order of the readahead window size for the VMA based swap
readahead algorithm.
The corresponding ABI documentation is added too.
Link: http://lkml.kernel.org/r/20170807054038.1843-5-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The swap readahead is an important mechanism to reduce the swap in
latency. Although pure sequential memory access pattern isn't very
popular for anonymous memory, the space locality is still considered
valid.
In the original swap readahead implementation, the consecutive blocks in
swap device are readahead based on the global space locality estimation.
But the consecutive blocks in swap device just reflect the order of page
reclaiming, don't necessarily reflect the access pattern in virtual
memory. And the different tasks in the system may have different access
patterns, which makes the global space locality estimation incorrect.
In this patch, when page fault occurs, the virtual pages near the fault
address will be readahead instead of the swap slots near the fault swap
slot in swap device. This avoid to readahead the unrelated swap slots.
At the same time, the swap readahead is changed to work on per-VMA from
globally. So that the different access patterns of the different VMAs
could be distinguished, and the different readahead policy could be
applied accordingly. The original core readahead detection and scaling
algorithm is reused, because it is an effect algorithm to detect the
space locality.
The test and result is as follow,
Common test condition
=====================
Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM) Swap device:
NVMe disk
Micro-benchmark with combined access pattern
============================================
vm-scalability, sequential swap test case, 4 processes to eat 50G
virtual memory space, repeat the sequential memory writing until 300
seconds. The first round writing will trigger swap out, the following
rounds will trigger sequential swap in and out.
At the same time, run vm-scalability random swap test case in
background, 8 processes to eat 30G virtual memory space, repeat the
random memory write until 300 seconds. This will trigger random swap-in
in the background.
This is a combined workload with sequential and random memory accessing
at the same time. The result (for sequential workload) is as follow,
Base Optimized
---- ---------
throughput 345413 KB/s 414029 KB/s (+19.9%)
latency.average 97.14 us 61.06 us (-37.1%)
latency.50th 2 us 1 us
latency.60th 2 us 1 us
latency.70th 98 us 2 us
latency.80th 160 us 2 us
latency.90th 260 us 217 us
latency.95th 346 us 369 us
latency.99th 1.34 ms 1.09 ms
ra_hit% 52.69% 99.98%
The original swap readahead algorithm is confused by the background
random access workload, so readahead hit rate is lower. The VMA-base
readahead algorithm works much better.
Linpack
=======
The test memory size is bigger than RAM to trigger swapping.
Base Optimized
---- ---------
elapsed_time 393.49 s 329.88 s (-16.2%)
ra_hit% 86.21% 98.82%
The score of base and optimized kernel hasn't visible changes. But the
elapsed time reduced and readahead hit rate improved, so the optimized
kernel runs better for startup and tear down stages. And the absolute
value of readahead hit rate is high, shows that the space locality is
still valid in some practical workloads.
Link: http://lkml.kernel.org/r/20170807054038.1843-4-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
In the original implementation, it is possible that the existing pages
in the swap cache (not newly readahead) could be marked as the readahead
pages. This will cause the statistics of swap readahead be wrong and
influence the swap readahead algorithm too.
This is fixed via marking a page as the readahead page only if it is
newly allocated and read from the disk.
When testing with linpack, after the fixing the swap readahead hit rate
increased from ~66% to ~86%.
Link: http://lkml.kernel.org/r/20170807054038.1843-3-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Patch series "mm, swap: VMA based swap readahead", v4.
The swap readahead is an important mechanism to reduce the swap in
latency. Although pure sequential memory access pattern isn't very
popular for anonymous memory, the space locality is still considered
valid.
In the original swap readahead implementation, the consecutive blocks in
swap device are readahead based on the global space locality estimation.
But the consecutive blocks in swap device just reflect the order of page
reclaiming, don't necessarily reflect the access pattern in virtual
memory space. And the different tasks in the system may have different
access patterns, which makes the global space locality estimation
incorrect.
In this patchset, when page fault occurs, the virtual pages near the
fault address will be readahead instead of the swap slots near the fault
swap slot in swap device. This avoid to readahead the unrelated swap
slots. At the same time, the swap readahead is changed to work on
per-VMA from globally. So that the different access patterns of the
different VMAs could be distinguished, and the different readahead
policy could be applied accordingly. The original core readahead
detection and scaling algorithm is reused, because it is an effect
algorithm to detect the space locality.
In addition to the swap readahead changes, some new sysfs interface is
added to show the efficiency of the readahead algorithm and some other
swap statistics.
This new implementation will incur more small random read, on SSD, the
improved correctness of estimation and readahead target should beat the
potential increased overhead, this is also illustrated in the test
results below. But on HDD, the overhead may beat the benefit, so the
original implementation will be used by default.
The test and result is as follow,
Common test condition
=====================
Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM)
Swap device: NVMe disk
Micro-benchmark with combined access pattern
============================================
vm-scalability, sequential swap test case, 4 processes to eat 50G
virtual memory space, repeat the sequential memory writing until 300
seconds. The first round writing will trigger swap out, the following
rounds will trigger sequential swap in and out.
At the same time, run vm-scalability random swap test case in
background, 8 processes to eat 30G virtual memory space, repeat the
random memory write until 300 seconds. This will trigger random swap-in
in the background.
This is a combined workload with sequential and random memory accessing
at the same time. The result (for sequential workload) is as follow,
Base Optimized
---- ---------
throughput 345413 KB/s 414029 KB/s (+19.9%)
latency.average 97.14 us 61.06 us (-37.1%)
latency.50th 2 us 1 us
latency.60th 2 us 1 us
latency.70th 98 us 2 us
latency.80th 160 us 2 us
latency.90th 260 us 217 us
latency.95th 346 us 369 us
latency.99th 1.34 ms 1.09 ms
ra_hit% 52.69% 99.98%
The original swap readahead algorithm is confused by the background
random access workload, so readahead hit rate is lower. The VMA-base
readahead algorithm works much better.
Linpack
=======
The test memory size is bigger than RAM to trigger swapping.
Base Optimized
---- ---------
elapsed_time 393.49 s 329.88 s (-16.2%)
ra_hit% 86.21% 98.82%
The score of base and optimized kernel hasn't visible changes. But the
elapsed time reduced and readahead hit rate improved, so the optimized
kernel runs better for startup and tear down stages. And the absolute
value of readahead hit rate is high, shows that the space locality is
still valid in some practical workloads.
This patch (of 5):
The statistics for total readahead pages and total readahead hits are
recorded and exported via the following sysfs interface.
/sys/kernel/mm/swap/ra_hits
/sys/kernel/mm/swap/ra_total
With them, the efficiency of the swap readahead could be measured, so
that the swap readahead algorithm and parameters could be tuned
accordingly.
[akpm@linux-foundation.org: don't display swap stats if CONFIG_SWAP=n]
Link: http://lkml.kernel.org/r/20170807054038.1843-2-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Although llist provides proper APIs, they are not used. Make them used.
Link: http://lkml.kernel.org/r/1502095374-16112-1-git-send-email-byungchul.park@lge.com
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Cc: zijun_hu <zijun_hu@htc.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Comment for pagetypeinfo_showblockcount() is mistakenly duplicated from
pagetypeinfo_show_free()'s comment. This commit fixes it.
Link: http://lkml.kernel.org/r/20170809185816.11244-1-sj38.park@gmail.com
Fixes: 467c996c1e19 ("Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo")
Signed-off-by: SeongJae Park <sj38.park@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This patch came out of discussions in this e-mail thread:
http://lkml.kernel.org/r/1499357846-7481-1-git-send-email-mike.kravetz%40oracle.com
The Oracle JVM team is developing a new garbage collection model. This
new model requires multiple mappings of the same anonymous memory. One
straight forward way to accomplish this is with memfd_create. They can
use the returned fd to create multiple mappings of the same memory.
The JVM today has an option to use (static hugetlb) huge pages. If this
option is specified, they would like to use the same garbage collection
model requiring multiple mappings to the same memory. Using hugetlbfs,
it is possible to explicitly mount a filesystem and specify file paths
in order to get an fd that can be used for multiple mappings. However,
this introduces additional system admin work and coordination.
Ideally they would like to get a hugetlbfs fd without requiring explicit
mounting of a filesystem. Today, mmap and shmget can make use of
hugetlbfs without explicitly mounting a filesystem. The patch adds this
functionality to memfd_create.
Add a new flag MFD_HUGETLB to memfd_create() that will specify the file
to be created resides in the hugetlbfs filesystem. This is the generic
hugetlbfs filesystem not associated with any specific mount point. As
with other system calls that request hugetlbfs backed pages, there is
the ability to encode huge page size in the flag arguments.
hugetlbfs does not support sealing operations, therefore specifying
MFD_ALLOW_SEALING with MFD_HUGETLB will result in EINVAL.
Of course, the memfd_man page would need updating if this type of
functionality moves forward.
Link: http://lkml.kernel.org/r/1502149672-7759-2-git-send-email-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
devm_memremap_pages() records mapped ranges in pgmap_radix with an entry
per section's worth of memory (128MB). The key for each of those
entries is a section number.
This leads to false positives when devm_memremap_pages() is passed a
section-unaligned range as lookups in the misalignment fail to return
NULL. We can close this hole by using the pfn as the key for entries in
the tree. The number of entries required to describe a remapped range
is reduced by leveraging multi-order entries.
In practice this approach usually yields just one entry in the tree if
the size and starting address are of the same power-of-2 alignment.
Previously we always needed nr_entries = mapping_size / 128MB.
Link: https://lists.01.org/pipermail/linux-nvdimm/2016-August/006666.html
Link: http://lkml.kernel.org/r/150215410565.39310.13767886055248249438.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
In pcpu_get_vm_areas(), it checks each range is not overlapped. To make
sure it is, only (N^2)/2 comparison is necessary, while current code
does N^2 times. By starting from the next range, it achieves the goal
and the continue could be removed.
Also,
- the overlap check of two ranges could be done with one clause
- one typo in comment is fixed.
Link: http://lkml.kernel.org/r/20170803063822.48702-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
When order is -1 or too big, *1UL << order* will be 0, which will cause
a divide error. Although it seems that all callers of
__fragmentation_index() will only do so with a valid order, the patch
can make it more robust.
Should prevent reoccurrences of
https://bugzilla.kernel.org/show_bug.cgi?id=196555
Link: http://lkml.kernel.org/r/1501751520-2598-1-git-send-email-wen.yang99@zte.com.cn
Signed-off-by: Wen Yang <wen.yang99@zte.com.cn>
Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
alloc_gigantic_page doesn't consider movability of the gigantic hugetlb
when scanning eligible ranges for the allocation. As 1GB hugetlb pages
are not movable currently this can break the movable zone assumption
that all allocations are migrateable and as such break memory hotplug.
Reorganize the code and use the standard zonelist allocations scheme
that we use for standard hugetbl pages. htlb_alloc_mask will ensure
that only migratable hugetlb pages will ever see a movable zone.
Link: http://lkml.kernel.org/r/20170803083549.21407-1-mhocko@kernel.org
Fixes: 944d9fec8d7a ("hugetlb: add support for gigantic page allocation at runtime")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|