summaryrefslogtreecommitdiffstats
path: root/include/trace
diff options
context:
space:
mode:
authorLiam R. Howlett <Liam.Howlett@Oracle.com>2022-09-06 19:48:39 +0000
committerAndrew Morton <akpm@linux-foundation.org>2022-09-26 19:46:13 -0700
commit54a611b605901c7d5d05b6b8f5d04a6ceb0962aa (patch)
tree6f0de88b45238a6cb0973140d6c06a0e9d4fca72 /include/trace
parent9832fb87834e2bd925d30020962c81b05948fa7b (diff)
downloadlinux-54a611b605901c7d5d05b6b8f5d04a6ceb0962aa.tar.gz
linux-54a611b605901c7d5d05b6b8f5d04a6ceb0962aa.tar.bz2
linux-54a611b605901c7d5d05b6b8f5d04a6ceb0962aa.zip
Maple Tree: add new data structure
Patch series "Introducing the Maple Tree" The maple tree is an RCU-safe range based B-tree designed to use modern processor cache efficiently. There are a number of places in the kernel that a non-overlapping range-based tree would be beneficial, especially one with a simple interface. If you use an rbtree with other data structures to improve performance or an interval tree to track non-overlapping ranges, then this is for you. The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf nodes. With the increased branching factor, it is significantly shorter than the rbtree so it has fewer cache misses. The removal of the linked list between subsequent entries also reduces the cache misses and the need to pull in the previous and next VMA during many tree alterations. The first user that is covered in this patch set is the vm_area_struct, where three data structures are replaced by the maple tree: the augmented rbtree, the vma cache, and the linked list of VMAs in the mm_struct. The long term goal is to reduce or remove the mmap_lock contention. The plan is to get to the point where we use the maple tree in RCU mode. Readers will not block for writers. A single write operation will be allowed at a time. A reader re-walks if stale data is encountered. VMAs would be RCU enabled and this mode would be entered once multiple tasks are using the mm_struct. Davidlor said : Yes I like the maple tree, and at this stage I don't think we can ask for : more from this series wrt the MM - albeit there seems to still be some : folks reporting breakage. Fundamentally I see Liam's work to (re)move : complexity out of the MM (not to say that the actual maple tree is not : complex) by consolidating the three complimentary data structures very : much worth it considering performance does not take a hit. This was very : much a turn off with the range locking approach, which worst case scenario : incurred in prohibitive overhead. Also as Liam and Matthew have : mentioned, RCU opens up a lot of nice performance opportunities, and in : addition academia[1] has shown outstanding scalability of address spaces : with the foundation of replacing the locked rbtree with RCU aware trees. A similar work has been discovered in the academic press https://pdos.csail.mit.edu/papers/rcuvm:asplos12.pdf Sheer coincidence. We designed our tree with the intention of solving the hardest problem first. Upon settling on a b-tree variant and a rough outline, we researched ranged based b-trees and RCU b-trees and did find that article. So it was nice to find reassurances that we were on the right path, but our design choice of using ranges made that paper unusable for us. This patch (of 70): The maple tree is an RCU-safe range based B-tree designed to use modern processor cache efficiently. There are a number of places in the kernel that a non-overlapping range-based tree would be beneficial, especially one with a simple interface. If you use an rbtree with other data structures to improve performance or an interval tree to track non-overlapping ranges, then this is for you. The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf nodes. With the increased branching factor, it is significantly shorter than the rbtree so it has fewer cache misses. The removal of the linked list between subsequent entries also reduces the cache misses and the need to pull in the previous and next VMA during many tree alterations. The first user that is covered in this patch set is the vm_area_struct, where three data structures are replaced by the maple tree: the augmented rbtree, the vma cache, and the linked list of VMAs in the mm_struct. The long term goal is to reduce or remove the mmap_lock contention. The plan is to get to the point where we use the maple tree in RCU mode. Readers will not block for writers. A single write operation will be allowed at a time. A reader re-walks if stale data is encountered. VMAs would be RCU enabled and this mode would be entered once multiple tasks are using the mm_struct. There is additional BUG_ON() calls added within the tree, most of which are in debug code. These will be replaced with a WARN_ON() call in the future. There is also additional BUG_ON() calls within the code which will also be reduced in number at a later date. These exist to catch things such as out-of-range accesses which would crash anyways. Link: https://lkml.kernel.org/r/20220906194824.2110408-1-Liam.Howlett@oracle.com Link: https://lkml.kernel.org/r/20220906194824.2110408-2-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: David Howells <dhowells@redhat.com> Tested-by: Sven Schnelle <svens@linux.ibm.com> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Diffstat (limited to 'include/trace')
-rw-r--r--include/trace/events/maple_tree.h123
1 files changed, 123 insertions, 0 deletions
diff --git a/include/trace/events/maple_tree.h b/include/trace/events/maple_tree.h
new file mode 100644
index 000000000000..2be403bdc2bd
--- /dev/null
+++ b/include/trace/events/maple_tree.h
@@ -0,0 +1,123 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM maple_tree
+
+#if !defined(_TRACE_MM_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_MM_H
+
+
+#include <linux/tracepoint.h>
+
+struct ma_state;
+
+TRACE_EVENT(ma_op,
+
+ TP_PROTO(const char *fn, struct ma_state *mas),
+
+ TP_ARGS(fn, mas),
+
+ TP_STRUCT__entry(
+ __field(const char *, fn)
+ __field(unsigned long, min)
+ __field(unsigned long, max)
+ __field(unsigned long, index)
+ __field(unsigned long, last)
+ __field(void *, node)
+ ),
+
+ TP_fast_assign(
+ __entry->fn = fn;
+ __entry->min = mas->min;
+ __entry->max = mas->max;
+ __entry->index = mas->index;
+ __entry->last = mas->last;
+ __entry->node = mas->node;
+ ),
+
+ TP_printk("%s\tNode: %p (%lu %lu) range: %lu-%lu",
+ __entry->fn,
+ (void *) __entry->node,
+ (unsigned long) __entry->min,
+ (unsigned long) __entry->max,
+ (unsigned long) __entry->index,
+ (unsigned long) __entry->last
+ )
+)
+TRACE_EVENT(ma_read,
+
+ TP_PROTO(const char *fn, struct ma_state *mas),
+
+ TP_ARGS(fn, mas),
+
+ TP_STRUCT__entry(
+ __field(const char *, fn)
+ __field(unsigned long, min)
+ __field(unsigned long, max)
+ __field(unsigned long, index)
+ __field(unsigned long, last)
+ __field(void *, node)
+ ),
+
+ TP_fast_assign(
+ __entry->fn = fn;
+ __entry->min = mas->min;
+ __entry->max = mas->max;
+ __entry->index = mas->index;
+ __entry->last = mas->last;
+ __entry->node = mas->node;
+ ),
+
+ TP_printk("%s\tNode: %p (%lu %lu) range: %lu-%lu",
+ __entry->fn,
+ (void *) __entry->node,
+ (unsigned long) __entry->min,
+ (unsigned long) __entry->max,
+ (unsigned long) __entry->index,
+ (unsigned long) __entry->last
+ )
+)
+
+TRACE_EVENT(ma_write,
+
+ TP_PROTO(const char *fn, struct ma_state *mas, unsigned long piv,
+ void *val),
+
+ TP_ARGS(fn, mas, piv, val),
+
+ TP_STRUCT__entry(
+ __field(const char *, fn)
+ __field(unsigned long, min)
+ __field(unsigned long, max)
+ __field(unsigned long, index)
+ __field(unsigned long, last)
+ __field(unsigned long, piv)
+ __field(void *, val)
+ __field(void *, node)
+ ),
+
+ TP_fast_assign(
+ __entry->fn = fn;
+ __entry->min = mas->min;
+ __entry->max = mas->max;
+ __entry->index = mas->index;
+ __entry->last = mas->last;
+ __entry->piv = piv;
+ __entry->val = val;
+ __entry->node = mas->node;
+ ),
+
+ TP_printk("%s\tNode %p (%lu %lu) range:%lu-%lu piv (%lu) val %p",
+ __entry->fn,
+ (void *) __entry->node,
+ (unsigned long) __entry->min,
+ (unsigned long) __entry->max,
+ (unsigned long) __entry->index,
+ (unsigned long) __entry->last,
+ (unsigned long) __entry->piv,
+ (void *) __entry->val
+ )
+)
+#endif /* _TRACE_MM_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>