summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* selftests/resctrl: Replace file write with volatile variableIlpo Järvinen2024-02-133-21/+16
| | | | | | | | | | | | | | | | | | | | The fill_buf code prevents compiler optimizating the entire read loop away by writing the final value of the variable into a file. While it achieves the goal, writing into a file requires significant amount of work within the innermost test loop and also error handling. A simpler approach is to take advantage of volatile. Writing through a pointer to a volatile variable is enough to prevent compiler from optimizing the write away, and therefore compiler cannot remove the read loop either. Add a volatile 'value_sink' into resctrl_tests.c and make fill_buf to write into it. As a result, the error handling in fill_buf.c can be simplified. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
* selftests/resctrl: Open perf fd before start & add error handlingIlpo Järvinen2024-02-133-11/+23
| | | | | | | | | | | | | | Perf fd (pe_fd) is opened, reset, and enabled during every test the CAT selftest runs. Also, ioctl(pe_fd, ...) calls are not error checked even if ioctl() could return an error. Open perf fd only once before the tests and only reset and enable the counter within the test loop. Add error checking to pe_fd ioctl() calls. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
* selftests/resctrl: Move cat_val() to cat_test.c and rename to cat_test()Ilpo Järvinen2024-02-133-87/+90
| | | | | | | | | | | The main CAT test function is called cat_val() and resides in cache.c which is illogical. Rename the function to cat_test() and move it into cat_test.c. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
* selftests/resctrl: Convert perf related globals to localsIlpo Järvinen2024-02-131-32/+40
| | | | | | | | | | | | | | | Perf related variables pea_llc_miss, pe_read, and pe_fd are globals in cache.c. Convert them to locals for better scoping and make pea_llc_miss simpler by renaming it to pea. Make close(pe_fd) handling easier to understand by doing it inside cat_val(). Make also sizeof()s use safer way to determine the right struct. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
* selftests/resctrl: Improve perf initIlpo Järvinen2024-02-131-10/+6
| | | | | | | | | | | | | | | | | | struct perf_event_attr initialization is spread into perf_event_initialize() and perf_event_attr_initialize() and setting ->config is hardcoded by the deepest level. perf_event_attr init belongs to perf_event_attr_initialize() so move it entirely there. Rename the other function perf_event_initialized_read_format(). Call each init function directly from the test as they will take different parameters (especially true after the perf related global variables are moved to local variables). Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
* selftests/resctrl: Consolidate naming of perf event related thingsIlpo Järvinen2024-02-131-21/+21
| | | | | | | | | | | | | Naming for perf event related functions, types, and variables is inconsistent. Make struct read_format and all functions related to perf events start with "perf_". Adjust variable names towards the same direction but use shorter names for variables where appropriate (pe prefix). Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
* selftests/resctrl: Remove nested calls in perf event handlingIlpo Järvinen2024-02-131-50/+14
| | | | | | | | | | | | | | Perf event handling has functions that are the sole caller of another perf event handling related function: - reset_enable_llc_perf() calls perf_event_open_llc_miss() - perf_event_measure() calls get_llc_perf() Remove the extra layer of calls to make the code easier to follow by moving the code into the calling function. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
* selftests/resctrl: Remove unnecessary __u64 -> unsigned long conversionIlpo Järvinen2024-02-133-19/+13
| | | | | | | | | | | | Perf counters are __u64 but the code converts them to unsigned long before printing them out. Remove unnecessary type conversion and retain the perf originating value as __u64. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
* selftests/resctrl: Split show_cache_info() to test specific and generic partsIlpo Järvinen2024-02-134-44/+66
| | | | | | | | | | | | show_cache_info() calculates results and provides generic cache information. This makes it hard to alter pass/fail conditions. Separate the test specific checks into CAT and CMT test files and leave only the generic information part into show_cache_info(). Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
* selftests/resctrl: Split measure_cache_vals()Ilpo Järvinen2024-02-133-27/+41
| | | | | | | | | | | | | | | | | | measure_cache_vals() does a different thing depending on the test case that called it: - For CAT, it measures LLC misses through perf. - For CMT, it measures LLC occupancy through resctrl. Split these two functionalities into own functions the CAT and CMT tests can call directly. Replace passing the struct resctrl_val_param parameter with the filename because it's more generic and all those functions need out of resctrl_val. Co-developed-by: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
* selftests/resctrl: Exclude shareable bits from schemata in CAT testIlpo Järvinen2024-02-133-4/+99
| | | | | | | | | | | | | | | | | | CAT test doesn't take shareable bits into account, i.e., the test might be sharing cache with some devices (e.g., graphics). Introduce get_mask_no_shareable() and use it to provision an environment for CAT test where the allocated LLC is isolated better. Excluding shareable_bits may create hole(s) into the cbm_mask, thus add a new helper count_contiguous_bits() to find the longest contiguous set of CBM bits. create_bit_mask() is needed by an upcoming CAT test rewrite so make it available in resctrl.h right away. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
* selftests/resctrl: Create cache_portion_size() helperIlpo Järvinen2024-02-133-9/+33
| | | | | | | | | | | | | | | | | | CAT and CMT tests calculate size of the cache portion for the n-bits cache allocation on their own. Add cache_portion_size() helper that calculates size of the cache portion for the given number of bits and use it to replace the existing span calculations. This also prepares for the new CAT test that will need to determine the size of the cache portion also during results processing. Rename also 'cache_size' local variables to 'cache_total_size' to prevent misinterpretations. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
* selftests/resctrl: Mark get_cache_size() cache_type constIlpo Järvinen2024-02-132-2/+2
| | | | | | | | | | | get_cache_size() does not modify cache_type so it could be const. Mark cache_type const so that const char * can be passed to it. This prevents warnings once many of the test parameters are marked const. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
* selftests/resctrl: Refactor get_cbm_mask() and rename to get_full_cbm()Ilpo Järvinen2024-02-134-24/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Callers of get_cbm_mask() are required to pass a string into which the capacity bitmask (CBM) is read. Neither CAT nor CMT tests need the bitmask as string but just convert it into an unsigned long value. Another limitation is that the bit mask reader can only read .../cbm_mask files. Generalize the bit mask reading function into get_bit_mask() such that it can be used to handle other files besides the .../cbm_mask and handles the unsigned long conversion within get_bit_mask() using fscanf(). Change get_cbm_mask() to use get_bit_mask() and rename it to get_full_cbm() to better indicate what the function does. Return error from get_full_cbm() if the bitmask is zero for some reason because it makes the code more robust as the selftests naturally assume the bitmask has some bits. Also mark cache_type const while at it and remove useless comments that are related to processing of CBM bits. Co-developed-by: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
* selftests/resctrl: Refactor fill_buf functionsIlpo Järvinen2024-02-132-43/+18
| | | | | | | | | | | | | | | | | | | There are unnecessary nested calls in fill_buf.c: - run_fill_buf() calls fill_cache() - alloc_buffer() calls malloc_and_init_memory() Simplify the code flow and remove those unnecessary call levels by moving the called code inside the calling function and remove the duplicated error print. Resolve the difference in run_fill_buf() and fill_cache() parameter name into 'buf_size' which is more descriptive than 'span'. Also, while moving the allocation related code, rename 'p' into 'buf' to be consistent in naming the variables. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
* selftests/resctrl: Split fill_buf to allow tests finer-grained controlIlpo Järvinen2024-02-131-6/+15
| | | | | | | | | | | | | | | | | | | | MBM, MBA and CMT test cases call run_fill_buf() that in turn calls fill_cache() to alloc and loop indefinitely around the buffer. This binds buffer allocation and running the benchmark into a single bundle so that a selftest cannot allocate a buffer once and reuse it. CAT test doesn't want to loop around the buffer continuously and after rewrite it needs the ability to allocate the buffer separately. Split buffer allocation out of fill_cache() into alloc_buffer(). This change is part of preparation for the new CAT test that allocates a buffer and does multiple passes over the same buffer (but not in an infinite loop). Co-developed-by: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
* selftests/resctrl: Change function comments to say < 0 on errorIlpo Järvinen2024-02-132-6/+6
| | | | | | | | | | | | | | | A number function comments state the function return non-zero on failure but in reality they can only return 0 on success and < 0 on error. Update the comments to say < 0 on error to match the behavior. While at it, improve cat_val() comment to state that 0 means the test was run (either pass or fail). Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
* selftests/resctrl: Don't use ctrlc_handler() outside signal handlingIlpo Järvinen2024-02-131-1/+0
| | | | | | | | | | | | | | | | | | | | | | perf_event_open_llc_miss() calls ctrlc_handler() to cleanup if perf_event_open() returns an error. Those cleanups, however, are not the responsibility of perf_event_open_llc_miss() and it thus interferes unnecessarily with the usual cleanup pattern. Worse yet, ctrlc_handler() calls exit() in the end preventing the ordinary cleanup done in the calling function from executing. ctrlc_handler() should only be used as a signal handler, not during normal error handling. Remove call to ctrlc_handler() from perf_event_open_llc_miss(). As unmounting resctrlfs and test cleanup are already handled properly by error rollbacks in the calling functions, no other changes are necessary. Suggested-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
* selftests/resctrl: Return -1 instead of errno on errorIlpo Järvinen2024-02-137-14/+14
| | | | | | | | | | | | | | | | | | A number of functions in the resctrl selftests return errno. It is problematic because errno is positive which is often counterintuitive. Also, every site returning errno prints the error message already with ksft_perror() so there is not much added value in returning the precise error code. Simply convert all places returning errno to return -1 that is typical userspace error code in case of failures. While at it, improve resctrl_val() comment to state that 0 means the test was run (either pass or fail). Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
* selftests/resctrl: Convert perror() to ksft_perror() or ksft_print_msg()Ilpo Järvinen2024-02-139-70/+77
| | | | | | | | | | | | | | | | | | | | | | | | | The resctrl selftest code contains a number of perror() calls. Some of them come with hash character and some don't. The kselftest framework provides ksft_perror() that is compatible with test output formatting so it should be used instead of adding custom hash signs. Some perror() calls are too far away from anything that sets error. For those call sites, ksft_print_msg() must be used instead. Convert perror() to ksft_perror() or ksft_print_msg(). Other related changes: - Remove hash signs - Remove trailing stops & newlines from ksft_perror() - Add terminating newlines for converted ksft_print_msg() - Use consistent capitalization - Small fixes/tweaks to typos & grammar of the messages - Extract error printing out of PARENT_EXIT() to be able to differentiate Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
* selftests: livepatch: Test livepatching a heavily called syscallMarcos Paulo de Souza2024-01-225-2/+218
| | | | | | | | | | | | | | | | | | | | | | | | | | The test proves that a syscall can be livepatched. It is interesting because syscalls are called a tricky way. Also the process gets livepatched either when sleeping in the userspace or when entering or leaving the kernel space. The livepatch is a bit tricky: 1. The syscall function name is architecture specific. Also ARCH_HAS_SYSCALL_WRAPPER must be taken in account. 2. The syscall must stay working the same way for other processes on the system. It is solved by decrementing a counter only for PIDs of the test processes. It means that the test processes has to call the livepatched syscall at least once. The test creates one userspace process per online cpu. The processes are calling getpid in a busy loop. The intention is to create random locations when the livepatch gets enabled. Nothing is guarantted. The magic is in the randomness. Reviewed-by: Joe Lawrence <joe.lawrence@redhat.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
* livepatch: Move tests from lib/livepatch to selftests/livepatchMarcos Paulo de Souza2024-01-2227-115/+98
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The modules are being moved from lib/livepatch to tools/testing/selftests/livepatch/test_modules. This code moving will allow writing more complex tests, like for example an userspace C code that will call a livepatched kernel function. The modules are now built as out-of-tree modules, but being part of the kernel source means they will be maintained. Another advantage of the code moving is to be able to easily change, debug and rebuild the tests by running make on the selftests/livepatch directory, which is not currently possible since the modules on lib/livepatch are build and installed using the "modules" target. The current approach also keeps the ability to execute the tests manually by executing the scripts inside selftests/livepatch directory, as it's currently supported. If the modules are modified, they needed to be rebuilt before running the scripts though. The modules are built before running the selftests when using the kselftest invocations: make kselftest TARGETS=livepatch or make -C tools/testing/selftests/livepatch run_tests Having the modules being built as out-of-modules requires changing the currently used 'modprobe' by 'insmod' and adapt the test scripts that check for the kernel message buffer. Now it is possible to only compile the modules by running: make -C tools/testing/selftests/livepatch/ This way the test modules and other test program can be built in order to be packaged if so desired. As there aren't any modules being built on lib/livepatch, remove the TEST_LIVEPATCH Kconfig and it's references. Note: "make gen_tar" packages the pre-built binaries into the tarball. It means that it will store the test modules pre-built for the kernel running on the build host. Note that these modules need not binary compatible with the kernel built from the same sources. But the same is true for other packaged selftest binaries. The entire kernel sources are needed for rebuilding the selftests on another system. Reviewed-by: Joe Lawrence <joe.lawrence@redhat.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
* kselftests: lib.mk: Add TEST_GEN_MODS_DIR variableMarcos Paulo de Souza2024-01-222-5/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add TEST_GEN_MODS_DIR variable for kselftests. It can point to a directory containing kernel modules that will be used by selftest scripts. The modules are built as external modules for the running kernel. As a result they are always binary compatible and the same tests can be used for older or newer kernels. The build requires "kernel-devel" package to be installed. For example, in the upstream sources, the rpm devel package is produced by "make rpm-pkg" The modules can be built independently by make -C tools/testing/selftests/livepatch/ or they will be automatically built before running the tests via make -C tools/testing/selftests/livepatch/ run_tests Note that they are _not_ built when running the standalone tests by calling, for example, ./test-state.sh. Along with TEST_GEN_MODS_DIR, it was necessary to create a new install rule. INSTALL_MODS_RULE is needed because INSTALL_SINGLE_RULE would copy the entire TEST_GEN_MODS_DIR directory to the destination, even the files created by Kbuild to compile the modules. The new install rule copies only the .ko files, as we would expect the gen_tar to work. Reviewed-by: Joe Lawrence <joe.lawrence@redhat.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
* Linux 6.8-rc1v6.8-rc1Linus Torvalds2024-01-211-2/+2
|
* Merge tag 'bcachefs-2024-01-21' of https://evilpiepirate.org/git/bcachefsLinus Torvalds2024-01-2178-1426/+1629
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull more bcachefs updates from Kent Overstreet: "Some fixes, Some refactoring, some minor features: - Assorted prep work for disk space accounting rewrite - BTREE_TRIGGER_ATOMIC: after combining our trigger callbacks, this makes our trigger context more explicit - A few fixes to avoid excessive transaction restarts on multithreaded workloads: fstests (in addition to ktest tests) are now checking slowpath counters, and that's shaking out a few bugs - Assorted tracepoint improvements - Starting to break up bcachefs_format.h and move on disk types so they're with the code they belong to; this will make room to start documenting the on disk format better. - A few minor fixes" * tag 'bcachefs-2024-01-21' of https://evilpiepirate.org/git/bcachefs: (46 commits) bcachefs: Improve inode_to_text() bcachefs: logged_ops_format.h bcachefs: reflink_format.h bcachefs; extents_format.h bcachefs: ec_format.h bcachefs: subvolume_format.h bcachefs: snapshot_format.h bcachefs: alloc_background_format.h bcachefs: xattr_format.h bcachefs: dirent_format.h bcachefs: inode_format.h bcachefs; quota_format.h bcachefs: sb-counters_format.h bcachefs: counters.c -> sb-counters.c bcachefs: comment bch_subvolume bcachefs: bch_snapshot::btime bcachefs: add missing __GFP_NOWARN bcachefs: opts->compression can now also be applied in the background bcachefs: Prep work for variable size btree node buffers bcachefs: grab s_umount only if snapshotting ...
| * bcachefs: Improve inode_to_text()Kent Overstreet2024-01-211-7/+18
| | | | | | | | | | | | Add line breaks - inode_to_text() is now much easier to read. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: logged_ops_format.hKent Overstreet2024-01-212-27/+31
| | | | | | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: reflink_format.hKent Overstreet2024-01-213-47/+48
| | | | | | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs; extents_format.hKent Overstreet2024-01-212-279/+284
| | | | | | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: ec_format.hKent Overstreet2024-01-212-16/+20
| | | | | | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: subvolume_format.hKent Overstreet2024-01-212-32/+36
| | | | | | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: snapshot_format.hKent Overstreet2024-01-212-33/+37
| | | | | | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: alloc_background_format.hKent Overstreet2024-01-212-93/+94
| | | | | | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: xattr_format.hKent Overstreet2024-01-212-15/+20
| | | | | | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: dirent_format.hKent Overstreet2024-01-212-39/+43
| | | | | | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: inode_format.hKent Overstreet2024-01-212-164/+167
| | | | | | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs; quota_format.hKent Overstreet2024-01-212-42/+48
| | | | | | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: sb-counters_format.hKent Overstreet2024-01-212-95/+100
| | | | | | | | | | | | bcachefs_format.h has gotten too big; let's do some organizing. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: counters.c -> sb-counters.cKent Overstreet2024-01-215-8/+7
| | | | | | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: comment bch_subvolumeKent Overstreet2024-01-211-0/+3
| | | | | | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: bch_snapshot::btimeKent Overstreet2024-01-212-0/+3
| | | | | | | | | | | | | | Add a field to bch_snapshot for creation time; this will be important when we start exposing the snapshot tree to userspace. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: add missing __GFP_NOWARNKent Overstreet2024-01-211-1/+1
| | | | | | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: opts->compression can now also be applied in the backgroundKent Overstreet2024-01-2111-23/+24
| | | | | | | | | | | | | | | | | | The "apply this compression method in the background" paths now use the compression option if background_compression is not set; this means that setting or changing the compression option will cause existing data to be compressed accordingly in the background. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: Prep work for variable size btree node buffersKent Overstreet2024-01-2118-97/+87
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | bcachefs btree nodes are big - typically 256k - and btree roots are pinned in memory. As we're now up to 18 btrees, we now have significant memory overhead in mostly empty btree roots. And in the future we're going to start enforcing that certain btree node boundaries exist, to solve lock contention issues - analagous to XFS's AGIs. Thus, we need to start allocating smaller btree node buffers when we can. This patch changes code that refers to the filesystem constant c->opts.btree_node_size to refer to the btree node buffer size - btree_buf_bytes() - where appropriate. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: grab s_umount only if snapshottingSu Yue2024-01-211-6/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When I was testing mongodb over bcachefs with compression, there is a lockdep warning when snapshotting mongodb data volume. $ cat test.sh prog=bcachefs $prog subvolume create /mnt/data $prog subvolume create /mnt/data/snapshots while true;do $prog subvolume snapshot /mnt/data /mnt/data/snapshots/$(date +%s) sleep 1s done $ cat /etc/mongodb.conf systemLog: destination: file logAppend: true path: /mnt/data/mongod.log storage: dbPath: /mnt/data/ lockdep reports: [ 3437.452330] ====================================================== [ 3437.452750] WARNING: possible circular locking dependency detected [ 3437.453168] 6.7.0-rc7-custom+ #85 Tainted: G E [ 3437.453562] ------------------------------------------------------ [ 3437.453981] bcachefs/35533 is trying to acquire lock: [ 3437.454325] ffffa0a02b2b1418 (sb_writers#10){.+.+}-{0:0}, at: filename_create+0x62/0x190 [ 3437.454875] but task is already holding lock: [ 3437.455268] ffffa0a02b2b10e0 (&type->s_umount_key#48){.+.+}-{3:3}, at: bch2_fs_file_ioctl+0x232/0xc90 [bcachefs] [ 3437.456009] which lock already depends on the new lock. [ 3437.456553] the existing dependency chain (in reverse order) is: [ 3437.457054] -> #3 (&type->s_umount_key#48){.+.+}-{3:3}: [ 3437.457507] down_read+0x3e/0x170 [ 3437.457772] bch2_fs_file_ioctl+0x232/0xc90 [bcachefs] [ 3437.458206] __x64_sys_ioctl+0x93/0xd0 [ 3437.458498] do_syscall_64+0x42/0xf0 [ 3437.458779] entry_SYSCALL_64_after_hwframe+0x6e/0x76 [ 3437.459155] -> #2 (&c->snapshot_create_lock){++++}-{3:3}: [ 3437.459615] down_read+0x3e/0x170 [ 3437.459878] bch2_truncate+0x82/0x110 [bcachefs] [ 3437.460276] bchfs_truncate+0x254/0x3c0 [bcachefs] [ 3437.460686] notify_change+0x1f1/0x4a0 [ 3437.461283] do_truncate+0x7f/0xd0 [ 3437.461555] path_openat+0xa57/0xce0 [ 3437.461836] do_filp_open+0xb4/0x160 [ 3437.462116] do_sys_openat2+0x91/0xc0 [ 3437.462402] __x64_sys_openat+0x53/0xa0 [ 3437.462701] do_syscall_64+0x42/0xf0 [ 3437.462982] entry_SYSCALL_64_after_hwframe+0x6e/0x76 [ 3437.463359] -> #1 (&sb->s_type->i_mutex_key#15){+.+.}-{3:3}: [ 3437.463843] down_write+0x3b/0xc0 [ 3437.464223] bch2_write_iter+0x5b/0xcc0 [bcachefs] [ 3437.464493] vfs_write+0x21b/0x4c0 [ 3437.464653] ksys_write+0x69/0xf0 [ 3437.464839] do_syscall_64+0x42/0xf0 [ 3437.465009] entry_SYSCALL_64_after_hwframe+0x6e/0x76 [ 3437.465231] -> #0 (sb_writers#10){.+.+}-{0:0}: [ 3437.465471] __lock_acquire+0x1455/0x21b0 [ 3437.465656] lock_acquire+0xc6/0x2b0 [ 3437.465822] mnt_want_write+0x46/0x1a0 [ 3437.465996] filename_create+0x62/0x190 [ 3437.466175] user_path_create+0x2d/0x50 [ 3437.466352] bch2_fs_file_ioctl+0x2ec/0xc90 [bcachefs] [ 3437.466617] __x64_sys_ioctl+0x93/0xd0 [ 3437.466791] do_syscall_64+0x42/0xf0 [ 3437.466957] entry_SYSCALL_64_after_hwframe+0x6e/0x76 [ 3437.467180] other info that might help us debug this: [ 3437.469670] 2 locks held by bcachefs/35533: other info that might help us debug this: [ 3437.467507] Chain exists of: sb_writers#10 --> &c->snapshot_create_lock --> &type->s_umount_key#48 [ 3437.467979] Possible unsafe locking scenario: [ 3437.468223] CPU0 CPU1 [ 3437.468405] ---- ---- [ 3437.468585] rlock(&type->s_umount_key#48); [ 3437.468758] lock(&c->snapshot_create_lock); [ 3437.469030] lock(&type->s_umount_key#48); [ 3437.469291] rlock(sb_writers#10); [ 3437.469434] *** DEADLOCK *** [ 3437.469670] 2 locks held by bcachefs/35533: [ 3437.469838] #0: ffffa0a02ce00a88 (&c->snapshot_create_lock){++++}-{3:3}, at: bch2_fs_file_ioctl+0x1e3/0xc90 [bcachefs] [ 3437.470294] #1: ffffa0a02b2b10e0 (&type->s_umount_key#48){.+.+}-{3:3}, at: bch2_fs_file_ioctl+0x232/0xc90 [bcachefs] [ 3437.470744] stack backtrace: [ 3437.470922] CPU: 7 PID: 35533 Comm: bcachefs Kdump: loaded Tainted: G E 6.7.0-rc7-custom+ #85 [ 3437.471313] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014 [ 3437.471694] Call Trace: [ 3437.471795] <TASK> [ 3437.471884] dump_stack_lvl+0x57/0x90 [ 3437.472035] check_noncircular+0x132/0x150 [ 3437.472202] __lock_acquire+0x1455/0x21b0 [ 3437.472369] lock_acquire+0xc6/0x2b0 [ 3437.472518] ? filename_create+0x62/0x190 [ 3437.472683] ? lock_is_held_type+0x97/0x110 [ 3437.472856] mnt_want_write+0x46/0x1a0 [ 3437.473025] ? filename_create+0x62/0x190 [ 3437.473204] filename_create+0x62/0x190 [ 3437.473380] user_path_create+0x2d/0x50 [ 3437.473555] bch2_fs_file_ioctl+0x2ec/0xc90 [bcachefs] [ 3437.473819] ? lock_acquire+0xc6/0x2b0 [ 3437.474002] ? __fget_files+0x2a/0x190 [ 3437.474195] ? __fget_files+0xbc/0x190 [ 3437.474380] ? lock_release+0xc5/0x270 [ 3437.474567] ? __x64_sys_ioctl+0x93/0xd0 [ 3437.474764] ? __pfx_bch2_fs_file_ioctl+0x10/0x10 [bcachefs] [ 3437.475090] __x64_sys_ioctl+0x93/0xd0 [ 3437.475277] do_syscall_64+0x42/0xf0 [ 3437.475454] entry_SYSCALL_64_after_hwframe+0x6e/0x76 [ 3437.475691] RIP: 0033:0x7f2743c313af ====================================================== In __bch2_ioctl_subvolume_create(), we grab s_umount unconditionally and unlock it at the end of the function. There is a comment "why do we need this lock?" about the lock coming from commit 42d237320e98 ("bcachefs: Snapshot creation, deletion") The reason is that __bch2_ioctl_subvolume_create() calls sync_inodes_sb() which enforce locked s_umount to writeback all dirty nodes before doing snapshot works. Fix it by read locking s_umount for snapshotting only and unlocking s_umount after sync_inodes_sb(). Signed-off-by: Su Yue <glass.su@suse.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: kvfree bch_fs::snapshots in bch2_fs_snapshots_exitSu Yue2024-01-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | bch_fs::snapshots is allocated by kvzalloc in __snapshot_t_mut. It should be freed by kvfree not kfree. Or umount will triger: [ 406.829178 ] BUG: unable to handle page fault for address: ffffe7b487148008 [ 406.830676 ] #PF: supervisor read access in kernel mode [ 406.831643 ] #PF: error_code(0x0000) - not-present page [ 406.832487 ] PGD 0 P4D 0 [ 406.832898 ] Oops: 0000 [#1] PREEMPT SMP PTI [ 406.833512 ] CPU: 2 PID: 1754 Comm: umount Kdump: loaded Tainted: G OE 6.7.0-rc7-custom+ #90 [ 406.834746 ] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014 [ 406.835796 ] RIP: 0010:kfree+0x62/0x140 [ 406.836197 ] Code: 80 48 01 d8 0f 82 e9 00 00 00 48 c7 c2 00 00 00 80 48 2b 15 78 9f 1f 01 48 01 d0 48 c1 e8 0c 48 c1 e0 06 48 03 05 56 9f 1f 01 <48> 8b 50 08 48 89 c7 f6 c2 01 0f 85 b0 00 00 00 66 90 48 8b 07 f6 [ 406.837810 ] RSP: 0018:ffffb9d641607e48 EFLAGS: 00010286 [ 406.838213 ] RAX: ffffe7b487148000 RBX: ffffb9d645200000 RCX: ffffb9d641607dc4 [ 406.838738 ] RDX: 000065bb00000000 RSI: ffffffffc0d88b84 RDI: ffffb9d645200000 [ 406.839217 ] RBP: ffff9a4625d00068 R08: 0000000000000001 R09: 0000000000000001 [ 406.839650 ] R10: 0000000000000001 R11: 000000000000001f R12: ffff9a4625d4da80 [ 406.840055 ] R13: ffff9a4625d00000 R14: ffffffffc0e2eb20 R15: 0000000000000000 [ 406.840451 ] FS: 00007f0a264ffb80(0000) GS:ffff9a4e2d500000(0000) knlGS:0000000000000000 [ 406.840851 ] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 406.841125 ] CR2: ffffe7b487148008 CR3: 000000018c4d2000 CR4: 00000000000006f0 [ 406.841464 ] Call Trace: [ 406.841583 ] <TASK> [ 406.841682 ] ? __die+0x1f/0x70 [ 406.841828 ] ? page_fault_oops+0x159/0x470 [ 406.842014 ] ? fixup_exception+0x22/0x310 [ 406.842198 ] ? exc_page_fault+0x1ed/0x200 [ 406.842382 ] ? asm_exc_page_fault+0x22/0x30 [ 406.842574 ] ? bch2_fs_release+0x54/0x280 [bcachefs] [ 406.842842 ] ? kfree+0x62/0x140 [ 406.842988 ] ? kfree+0x104/0x140 [ 406.843138 ] bch2_fs_release+0x54/0x280 [bcachefs] [ 406.843390 ] kobject_put+0xb7/0x170 [ 406.843552 ] deactivate_locked_super+0x2f/0xa0 [ 406.843756 ] cleanup_mnt+0xba/0x150 [ 406.843917 ] task_work_run+0x59/0xa0 [ 406.844083 ] exit_to_user_mode_prepare+0x197/0x1a0 [ 406.844302 ] syscall_exit_to_user_mode+0x16/0x40 [ 406.844510 ] do_syscall_64+0x4e/0xf0 [ 406.844675 ] entry_SYSCALL_64_after_hwframe+0x6e/0x76 [ 406.844907 ] RIP: 0033:0x7f0a2664e4fb Signed-off-by: Su Yue <glass.su@suse.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: bios must be 512 byte alginedKent Overstreet2024-01-211-0/+4
| | | | | | | | | | | | Fixes: 023f9ac9f70f bcachefs: Delete dio read alignment check Reported-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: remove redundant variable tmpColin Ian King2024-01-211-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The variable tmp is being assigned a value but it isn't being read afterwards. The assignment is redundant and so tmp can be removed. Cleans up clang scan build warning: warning: Although the value stored to 'ret' is used in the enclosing expression, the value is never actually read from 'ret' [deadcode.DeadStores] Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: Improve trace_trans_restart_relockKent Overstreet2024-01-215-24/+44
| | | | | | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: Fix excess transaction restarts in __bchfs_fallocate()Kent Overstreet2024-01-214-16/+35
| | | | | | | | | | | | | | | | | | drop_locks_do() should not be used in a fastpath without first trying the do in nonblocking mode - the unlock and relock will cause excessive transaction restarts and potentially livelocking with other threads that are contending for the same locks. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>