summaryrefslogtreecommitdiffstats
path: root/fs/ext4/ext4.h
diff options
context:
space:
mode:
authorOjaswin Mujoo <ojaswin@linux.ibm.com>2023-03-25 13:43:41 +0530
committerTheodore Ts'o <tytso@mit.edu>2023-04-06 01:13:13 -0400
commit3872778664e36528caf8b27f355e75482f6d562d (patch)
tree2479c17631c20fda6365762d3b9188221f4841b4 /fs/ext4/ext4.h
parenta8e38fd37cff911638ac288adb138265f71e50c0 (diff)
downloadlinux-stable-3872778664e36528caf8b27f355e75482f6d562d.tar.gz
linux-stable-3872778664e36528caf8b27f355e75482f6d562d.tar.bz2
linux-stable-3872778664e36528caf8b27f355e75482f6d562d.zip
ext4: Use rbtrees to manage PAs instead of inode i_prealloc_list
Currently, the kernel uses i_prealloc_list to hold all the inode preallocations. This is known to cause degradation in performance in workloads which perform large number of sparse writes on a single file. This is mainly because functions like ext4_mb_normalize_request() and ext4_mb_use_preallocated() iterate over this complete list, resulting in slowdowns when large number of PAs are present. Patch 27bc446e2 partially fixed this by enforcing a limit of 512 for the inode preallocation list and adding logic to continually trim the list if it grows above the threshold, however our testing revealed that a hardcoded value is not suitable for all kinds of workloads. To optimize this, add an rbtree to the inode and hold the inode preallocations in this rbtree. This will make iterating over inode PAs faster and scale much better than a linked list. Additionally, we also had to remove the LRU logic that was added during trimming of the list (in ext4_mb_release_context()) as it will add extra overhead in rbtree. The discards now happen in the lowest-logical-offset-first order. ** Locking notes ** With the introduction of rbtree to maintain inode PAs, we can't use RCU to walk the tree for searching since it can result in partial traversals which might miss some nodes(or entire subtrees) while discards happen in parallel (which happens under a lock). Hence this patch converts the ei->i_prealloc_lock spin_lock to rw_lock. Almost all the codepaths that read/modify the PA rbtrees are protected by the higher level inode->i_data_sem (except ext4_mb_discard_group_preallocations() and ext4_clear_inode()) IIUC, the only place we need lock protection is when one thread is reading "searching" the PA rbtree (earlier protected under rcu_read_lock()) and another is "deleting" the PAs in ext4_mb_discard_group_preallocations() function (which iterates all the PAs using the grp->bb_prealloc_list and deletes PAs from the tree without taking any inode lock (i_data_sem)). So, this patch converts all rcu_read_lock/unlock() paths for inode list PA to use read_lock() and all places where we were using ei->i_prealloc_lock spinlock will now be using write_lock(). Note that this makes the fast path (searching of the right PA e.g. ext4_mb_use_preallocated() or ext4_mb_normalize_request()), now use read_lock() instead of rcu_read_lock/unlock(). Ths also will now block due to slow discard path (ext4_mb_discard_group_preallocations()) which uses write_lock(). But this is not as bad as it looks. This is because - 1. The slow path only occurs when the normal allocation failed and we can say that we are low on disk space. One can argue this scenario won't be much frequent. 2. ext4_mb_discard_group_preallocations(), locks and unlocks the rwlock for deleting every individual PA. This gives enough opportunity for the fast path to acquire the read_lock for searching the PA inode list. Suggested-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/4137bce8f6948fedd8bae134dabae24acfe699c6.1679731817.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Diffstat (limited to 'fs/ext4/ext4.h')
-rw-r--r--fs/ext4/ext4.h4
1 files changed, 2 insertions, 2 deletions
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 9b2cfc32cf78..8cf955b1dca5 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1120,8 +1120,8 @@ struct ext4_inode_info {
/* mballoc */
atomic_t i_prealloc_active;
- struct list_head i_prealloc_list;
- spinlock_t i_prealloc_lock;
+ struct rb_root i_prealloc_node;
+ rwlock_t i_prealloc_lock;
/* extents status tree */
struct ext4_es_tree i_es_tree;