summaryrefslogtreecommitdiffstats
path: root/fs
Commit message (Collapse)AuthorAgeFilesLines
* exec: Fix NOMMU linux_binprm::exec in transfer_args_to_stack()Max Filippov2024-04-031-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | commit 2aea94ac14d1e0a8ae9e34febebe208213ba72f7 upstream. In NOMMU kernel the value of linux_binprm::p is the offset inside the temporary program arguments array maintained in separate pages in the linux_binprm::page. linux_binprm::exec being a copy of linux_binprm::p thus must be adjusted when that array is copied to the user stack. Without that adjustment the value passed by the NOMMU kernel to the ELF program in the AT_EXECFN entry of the aux array doesn't make any sense and it may break programs that try to access memory pointed to by that entry. Adjust linux_binprm::exec before the successful return from the transfer_args_to_stack(). Cc: <stable@vger.kernel.org> Fixes: b6a2fea39318 ("mm: variable length argument support") Fixes: 5edc2a5123a7 ("binfmt_elf_fdpic: wire up AT_EXECFD, AT_EXECFN, AT_SECURE") Signed-off-by: Max Filippov <jcmvbkbc@gmail.com> Link: https://lore.kernel.org/r/20240320182607.1472887-1-jcmvbkbc@gmail.com Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* btrfs: zoned: use zone aware sb location for scrubJohannes Thumshirn2024-04-031-1/+11
| | | | | | | | | | | | | | | | | | | | | | | commit 74098a989b9c3370f768140b7783a7aaec2759b3 upstream. At the moment scrub_supers() doesn't grab the super block's location via the zoned device aware btrfs_sb_log_location() but via btrfs_sb_offset(). This leads to checksum errors on 'scrub' as we're not accessing the correct location of the super block. So use btrfs_sb_log_location() for getting the super blocks location on scrub. Reported-by: WA AM <waautomata@gmail.com> Link: http://lore.kernel.org/linux-btrfs/CANU2Z0EvUzfYxczLgGUiREoMndE9WdQnbaawV5Fv5gNXptPUKw@mail.gmail.com CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* btrfs: zoned: don't skip block groups with 100% zone unusableJohannes Thumshirn2024-04-031-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | commit a8b70c7f8600bc77d03c0b032c0662259b9e615e upstream. Commit f4a9f219411f ("btrfs: do not delete unused block group if it may be used soon") changed the behaviour of deleting unused block-groups on zoned filesystems. Starting with this commit, we're using btrfs_space_info_used() to calculate the number of used bytes in a space_info. But btrfs_space_info_used() also accounts btrfs_space_info::bytes_zone_unusable as used bytes. So if a block group is 100% zone_unusable it is skipped from the deletion step. In order not to skip fully zone_unusable block-groups, also check if the block-group has bytes left that can be used on a zoned filesystem. Fixes: f4a9f219411f ("btrfs: do not delete unused block group if it may be used soon") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* btrfs: fix race in read_extent_buffer_pages()Tavian Barnes2024-04-031-0/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit ef1e68236b9153c27cb7cf29ead0c532870d4215 upstream. There are reports from tree-checker that detects corrupted nodes, without any obvious pattern so possibly an overwrite in memory. After some debugging it turns out there's a race when reading an extent buffer the uptodate status can be missed. To prevent concurrent reads for the same extent buffer, read_extent_buffer_pages() performs these checks: /* (1) */ if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags)) return 0; /* (2) */ if (test_and_set_bit(EXTENT_BUFFER_READING, &eb->bflags)) goto done; At this point, it seems safe to start the actual read operation. Once that completes, end_bbio_meta_read() does /* (3) */ set_extent_buffer_uptodate(eb); /* (4) */ clear_bit(EXTENT_BUFFER_READING, &eb->bflags); Normally, this is enough to ensure only one read happens, and all other callers wait for it to finish before returning. Unfortunately, there is a racey interleaving: Thread A | Thread B | Thread C ---------+----------+--------- (1) | | | (1) | (2) | | (3) | | (4) | | | (2) | | | (1) When this happens, thread B kicks of an unnecessary read. Worse, thread C will see UPTODATE set and return immediately, while the read from thread B is still in progress. This race could result in tree-checker errors like this as the extent buffer is concurrently modified: BTRFS critical (device dm-0): corrupted node, root=256 block=8550954455682405139 owner mismatch, have 11858205567642294356 expect [256, 18446744073709551360] Fix it by testing UPTODATE again after setting the READING bit, and if it's been set, skip the unnecessary read. Fixes: d7172f52e993 ("btrfs: use per-buffer locking for extent_buffer reading") Link: https://lore.kernel.org/linux-btrfs/CAHk-=whNdMaN9ntZ47XRKP6DBes2E5w7fi-0U3H2+PS18p+Pzw@mail.gmail.com/ Link: https://lore.kernel.org/linux-btrfs/f51a6d5d7432455a6a858d51b49ecac183e0bbc9.1706312914.git.wqu@suse.com/ Link: https://lore.kernel.org/linux-btrfs/c7241ea4-fcc6-48d2-98c8-b5ea790d6c89@gmx.com/ CC: stable@vger.kernel.org # 6.5+ Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Tavian Barnes <tavianator@tavianator.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor update of changelog ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* btrfs: validate device maj:min during openAnand Jain2024-04-031-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 9f7eb8405dcbc79c5434821e9e3e92abe187ee8e upstream. Boris managed to create a device capable of changing its maj:min without altering its device path. Only multi-devices can be scanned. A device that gets scanned and remains in the btrfs kernel cache might end up with an incorrect maj:min. Despite the temp-fsid feature patch did not introduce this bug, it could lead to issues if the above multi-device is converted to a single device with a stale maj:min. Subsequently, attempting to mount the same device with the correct maj:min might mistake it for another device with the same fsid, potentially resulting in wrongly auto-enabling the temp-fsid feature. To address this, this patch validates the device's maj:min at the time of device open and updates it if it has changed since the last scan. CC: stable@vger.kernel.org # 6.7+ Fixes: a5b8a5f9f835 ("btrfs: support cloned-device mount capability") Reported-by: Boris Burkov <boris@bur.io> Co-developed-by: Boris Burkov <boris@bur.io> Reviewed-by: Boris Burkov <boris@bur.io># Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* btrfs: do not skip re-registration for the mounted deviceAnand Jain2024-04-031-10/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit d565fffa68560ac540bf3d62cc79719da50d5e7a upstream. There are reports that since version 6.7 update-grub fails to find the device of the root on systems without initrd and on a single device. This looks like the device name changed in the output of /proc/self/mountinfo: 6.5-rc5 working 18 1 0:16 / / rw,noatime - btrfs /dev/sda8 ... 6.7 not working: 17 1 0:15 / / rw,noatime - btrfs /dev/root ... and "update-grub" shows this error: /usr/sbin/grub-probe: error: cannot find a device for / (is /dev mounted?) This looks like it's related to the device name, but grub-probe recognizes the "/dev/root" path and tries to find the underlying device. However there's a special case for some filesystems, for btrfs in particular. The generic root device detection heuristic is not done and it all relies on reading the device infos by a btrfs specific ioctl. This ioctl returns the device name as it was saved at the time of device scan (in this case it's /dev/root). The change in 6.7 for temp_fsid to allow several single device filesystem to exist with the same fsid (and transparently generate a new UUID at mount time) was to skip caching/registering such devices. This also skipped mounted device. One step of scanning is to check if the device name hasn't changed, and if yes then update the cached value. This broke the grub-probe as it always read the device /dev/root and couldn't find it in the system. A temporary workaround is to create a symlink but this does not survive reboot. The right fix is to allow updating the device path of a mounted filesystem even if this is a single device one. In the fix, check if the device's major:minor number matches with the cached device. If they do, then we can allow the scan to happen so that device_list_add() can take care of updating the device path. The file descriptor remains unchanged. This does not affect the temp_fsid feature, the UUID of the mounted filesystem remains the same and the matching is based on device major:minor which is unique per mounted filesystem. This covers the path when the device (that exists for all mounted devices) name changes, updating /dev/root to /dev/sdx. Any other single device with filesystem and is not mounted is still skipped. Note that if a system is booted and initial mount is done on the /dev/root device, this will be the cached name of the device. Only after the command "btrfs device scan" it will change as it triggers the rename. The fix was verified by users whose systems were affected. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=218353 Link: https://lore.kernel.org/lkml/CAKLYgeJ1tUuqLcsquwuFqjDXPSJpEiokrWK2gisPKDZLs8Y2TQ@mail.gmail.com/ Fixes: bc27d6f0aa0e ("btrfs: scan but don't register device on single device filesystem") CC: stable@vger.kernel.org # 6.7+ Tested-by: Alex Romosan <aromosan@gmail.com> Tested-by: CHECK_1234543212345@protonmail.com Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* btrfs: fix deadlock with fiemap and extent lockingJosef Bacik2024-04-031-17/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit b0ad381fa7690244802aed119b478b4bdafc31dd upstream. While working on the patchset to remove extent locking I got a lockdep splat with fiemap and pagefaulting with my new extent lock replacement lock. This deadlock exists with our normal code, we just don't have lockdep annotations with the extent locking so we've never noticed it. Since we're copying the fiemap extent to user space on every iteration we have the chance of pagefaulting. Because we hold the extent lock for the entire range we could mkwrite into a range in the file that we have mmap'ed. This would deadlock with the following stack trace [<0>] lock_extent+0x28d/0x2f0 [<0>] btrfs_page_mkwrite+0x273/0x8a0 [<0>] do_page_mkwrite+0x50/0xb0 [<0>] do_fault+0xc1/0x7b0 [<0>] __handle_mm_fault+0x2fa/0x460 [<0>] handle_mm_fault+0xa4/0x330 [<0>] do_user_addr_fault+0x1f4/0x800 [<0>] exc_page_fault+0x7c/0x1e0 [<0>] asm_exc_page_fault+0x26/0x30 [<0>] rep_movs_alternative+0x33/0x70 [<0>] _copy_to_user+0x49/0x70 [<0>] fiemap_fill_next_extent+0xc8/0x120 [<0>] emit_fiemap_extent+0x4d/0xa0 [<0>] extent_fiemap+0x7f8/0xad0 [<0>] btrfs_fiemap+0x49/0x80 [<0>] __x64_sys_ioctl+0x3e1/0xb50 [<0>] do_syscall_64+0x94/0x1a0 [<0>] entry_SYSCALL_64_after_hwframe+0x6e/0x76 I wrote an fstest to reproduce this deadlock without my replacement lock and verified that the deadlock exists with our existing locking. To fix this simply don't take the extent lock for the entire duration of the fiemap. This is safe in general because we keep track of where we are when we're searching the tree, so if an ordered extent updates in the middle of our fiemap call we'll still emit the correct extents because we know what offset we were on before. The only place we maintain the lock is searching delalloc. Since the delalloc stuff can change during writeback we want to lock the extent range so we have a consistent view of delalloc at the time we're checking to see if we need to set the delalloc flag. With this patch applied we no longer deadlock with my testcase. CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* fs/aio: Check IOCB_AIO_RW before the struct aio_kiocb conversionBart Van Assche2024-04-031-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 961ebd120565cb60cebe21cb634fbc456022db4a upstream. The first kiocb_set_cancel_fn() argument may point at a struct kiocb that is not embedded inside struct aio_kiocb. With the current code, depending on the compiler, the req->ki_ctx read happens either before the IOCB_AIO_RW test or after that test. Move the req->ki_ctx read such that it is guaranteed that the IOCB_AIO_RW test happens first. Reported-by: Eric Biggers <ebiggers@kernel.org> Cc: Benjamin LaHaise <ben@communityfibre.ca> Cc: Eric Biggers <ebiggers@google.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Avi Kivity <avi@scylladb.com> Cc: Sandeep Dhavale <dhavale@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: stable@vger.kernel.org Fixes: b820de741ae4 ("fs/aio: Restrict kiocb_set_cancel_fn() to I/O submitted via libaio") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240304235715.3790858-1-bvanassche@acm.org Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ksmbd: fix potencial out-of-bounds when buffer offset is invalidNamjae Jeon2024-04-032-29/+42
| | | | | | | | | | | | | [ Upstream commit c6cd2e8d2d9aa7ee35b1fa6a668e32a22a9753da ] I found potencial out-of-bounds when buffer offset fields of a few requests is invalid. This patch set the minimum value of buffer offset field to ->Buffer offset to validate buffer length. Cc: stable@vger.kernel.org Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* NFSD: Fix nfsd_clid_class use of __string_len() macroSteven Rostedt (Google)2024-04-031-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 9388a2aa453321bcf1ad2603959debea9e6ab6d4 ] I'm working on restructuring the __string* macros so that it doesn't need to recalculate the string twice. That is, it will save it off when processing __string() and the __assign_str() will not need to do the work again as it currently does. Currently __string_len(item, src, len) doesn't actually use "src", but my changes will require src to be correct as that is where the __assign_str() will get its value from. The event class nfsd_clid_class has: __string_len(name, name, clp->cl_name.len) But the second "name" does not exist and causes my changes to fail to build. That second parameter should be: clp->cl_name.data. Link: https://lore.kernel.org/linux-trace-kernel/20240222122828.3d8d213c@gandalf.local.home Cc: Neil Brown <neilb@suse.de> Cc: Olga Kornievskaia <kolga@netapp.com> Cc: Dai Ngo <Dai.Ngo@oracle.com> Cc: Tom Talpey <tom@talpey.com> Cc: stable@vger.kernel.org Fixes: d27b74a8675ca ("NFSD: Use new __string_len C macros for nfsd_clid_class") Acked-by: Chuck Lever <chuck.lever@oracle.com> Acked-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* ksmbd: fix slab-out-of-bounds in smb_strndup_from_utf16()Namjae Jeon2024-04-031-1/+4
| | | | | | | | | | | | | | | [ Upstream commit a80a486d72e20bd12c335bcd38b6e6f19356b0aa ] If ->NameOffset of smb2_create_req is smaller than Buffer offset of smb2_create_req, slab-out-of-bounds read can happen from smb2_open. This patch set the minimum value of the name offset to the buffer offset to validate name length of smb2_create_req(). Cc: stable@vger.kernel.org Reported-by: Xuanzhe Yu <yuxuanzhe@outlook.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* cifs: open_cached_dir(): add FILE_READ_EA to desired accessEugene Korenevsky2024-04-031-1/+2
| | | | | | | | | | | | | | | | [ Upstream commit f1b8224b4e6ed59e7e6f5c548673c67410098d8d ] Since smb2_query_eas() reads EA and uses cached directory, open_cached_dir() should request FILE_READ_EA access. Otherwise listxattr() and getxattr() will fail with EACCES (0xc0000022 STATUS_ACCESS_DENIED SMB status). Link: https://bugzilla.kernel.org/show_bug.cgi?id=218543 Cc: stable@vger.kernel.org Signed-off-by: Eugene Korenevsky <ekorenevsky@astralinux.ru> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* cifs: reduce warning log level for server not advertising interfacesShyam Prasad N2024-04-031-2/+2
| | | | | | | | | | | | | | | | | | | [ Upstream commit 16a57d7681110b25708c7042688412238e6f73a9 ] Several users have reported this log getting dumped too regularly to kernel log. The likely root cause has been identified, and it suggests that this situation is expected for some configurations (for example SMB2.1). Since the function returns appropriately even for such cases, it is fairly harmless to make this a debug log. When needed, the verbosity can be increased to capture this log. Cc: stable@vger.kernel.org Reported-by: Jan Čermák <sairon@sairon.cz> Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* cifs: make cifs_chan_update_iface() a void functionDan Carpenter2024-04-032-11/+8
| | | | | | | | | | | | | [ Upstream commit 8d606c311b75e81063b4ea650b301cbe0c4ed5e1 ] The return values for cifs_chan_update_iface() didn't match what the documentation said and nothing was checking them anyway. Just make it a void function. Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Steve French <stfrench@microsoft.com> Stable-dep-of: 16a57d768111 ("cifs: reduce warning log level for server not advertising interfaces") Signed-off-by: Sasha Levin <sashal@kernel.org>
* cifs: delete unnecessary NULL checks in cifs_chan_update_iface()Dan Carpenter2024-04-031-15/+11
| | | | | | | | | | | | [ Upstream commit c3a11c0ec66c1e0652e3a2bb4f5cc74eea0ba486 ] We return early if "iface" is NULL so there is no need to check here. Delete those checks. Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Steve French <stfrench@microsoft.com> Stable-dep-of: 16a57d768111 ("cifs: reduce warning log level for server not advertising interfaces") Signed-off-by: Sasha Levin <sashal@kernel.org>
* cifs: make sure server interfaces are requested only for SMB3+Shyam Prasad N2024-04-034-3/+13
| | | | | | | | | | | | | | | | | | | | [ Upstream commit 13c0a74747cb7fdadf58c5d3a7d52cfca2d51736 ] Some code paths for querying server interfaces make a false assumption that it will only get called for SMB3+. Since this function now can get called from a generic code paths, the correct thing to do is to have specific handler for this functionality per SMB dialect, and call this handler. This change adds such a handler and implements this handler only for SMB 3.0 and 3.1.1. Cc: stable@vger.kernel.org Cc: Jan Čermák <sairon@sairon.cz> Reported-by: Paulo Alcantara <pc@manguebit.com> Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* nilfs2: prevent kernel bug at submit_bh_wbc()Ryusuke Konishi2024-04-031-1/+1
| | | | | | | | | | | | | | | | | | | | [ Upstream commit 269cdf353b5bdd15f1a079671b0f889113865f20 ] Fix a bug where nilfs_get_block() returns a successful status when searching and inserting the specified block both fail inconsistently. If this inconsistent behavior is not due to a previously fixed bug, then an unexpected race is occurring, so return a temporary error -EAGAIN instead. This prevents callers such as __block_write_begin_int() from requesting a read into a buffer that is not mapped, which would cause the BUG_ON check for the BH_Mapped flag in submit_bh_wbc() to fail. Link: https://lkml.kernel.org/r/20240313105827.5296-3-konishi.ryusuke@gmail.com Fixes: 1f5abe7e7dbc ("nilfs2: replace BUG_ON and BUG calls triggerable from ioctl") Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* nilfs2: fix failure to detect DAT corruption in btree and direct mappingsRyusuke Konishi2024-04-032-4/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit f2f26b4a84a0ef41791bd2d70861c8eac748f4ba ] Patch series "nilfs2: fix kernel bug at submit_bh_wbc()". This resolves a kernel BUG reported by syzbot. Since there are two flaws involved, I've made each one a separate patch. The first patch alone resolves the syzbot-reported bug, but I think both fixes should be sent to stable, so I've tagged them as such. This patch (of 2): Syzbot has reported a kernel bug in submit_bh_wbc() when writing file data to a nilfs2 file system whose metadata is corrupted. There are two flaws involved in this issue. The first flaw is that when nilfs_get_block() locates a data block using btree or direct mapping, if the disk address translation routine nilfs_dat_translate() fails with internal code -ENOENT due to DAT metadata corruption, it can be passed back to nilfs_get_block(). This causes nilfs_get_block() to misidentify an existing block as non-existent, causing both data block lookup and insertion to fail inconsistently. The second flaw is that nilfs_get_block() returns a successful status in this inconsistent state. This causes the caller __block_write_begin_int() or others to request a read even though the buffer is not mapped, resulting in a BUG_ON check for the BH_Mapped flag in submit_bh_wbc() failing. This fixes the first issue by changing the return value to code -EINVAL when a conversion using DAT fails with code -ENOENT, avoiding the conflicting condition that leads to the kernel bug described above. Here, code -EINVAL indicates that metadata corruption was detected during the block lookup, which will be properly handled as a file system error and converted to -EIO when passing through the nilfs2 bmap layer. Link: https://lkml.kernel.org/r/20240313105827.5296-1-konishi.ryusuke@gmail.com Link: https://lkml.kernel.org/r/20240313105827.5296-2-konishi.ryusuke@gmail.com Fixes: c3a7abf06ce7 ("nilfs2: support contiguous lookup of blocks") Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+cfed5b56649bddf80d6e@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=cfed5b56649bddf80d6e Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* f2fs: truncate page cache before clearing flags when aborting atomic writeSunmin Jeong2024-04-031-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 74b0ebcbdde4c7fe23c979e4cfc2fdbf349c39a3 ] In f2fs_do_write_data_page, FI_ATOMIC_FILE flag selects the target inode between the original inode and COW inode. When aborting atomic write and writeback occur simultaneously, invalid data can be written to original inode if the FI_ATOMIC_FILE flag is cleared meanwhile. To prevent the problem, let's truncate all pages before clearing the flag Atomic write thread Writeback thread f2fs_abort_atomic_write clear_inode_flag(inode, FI_ATOMIC_FILE) __writeback_single_inode do_writepages f2fs_do_write_data_page - use dn of original inode truncate_inode_pages_final Fixes: 3db1de0e582c ("f2fs: change the current atomic write way") Cc: stable@vger.kernel.org #v5.19+ Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com> Reviewed-by: Yeongjin Gil <youngjin.gil@samsung.com> Signed-off-by: Sunmin Jeong <s_min.jeong@samsung.com> Reviewed-by: Daeho Jeong <daehojeong@google.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* f2fs: mark inode dirty for FI_ATOMIC_COMMITTED flagSunmin Jeong2024-04-031-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 4bf78322346f6320313683dc9464e5423423ad5c ] In f2fs_update_inode, i_size of the atomic file isn't updated until FI_ATOMIC_COMMITTED flag is set. When committing atomic write right after the writeback of the inode, i_size of the raw inode will not be updated. It can cause the atomicity corruption due to a mismatch between old file size and new data. To prevent the problem, let's mark inode dirty for FI_ATOMIC_COMMITTED Atomic write thread Writeback thread __writeback_single_inode write_inode f2fs_update_inode - skip i_size update f2fs_ioc_commit_atomic_write f2fs_commit_atomic_write set_inode_flag(inode, FI_ATOMIC_COMMITTED) f2fs_do_sync_file f2fs_fsync_node_pages - skip f2fs_update_inode since the inode is clean Fixes: 3db1de0e582c ("f2fs: change the current atomic write way") Cc: stable@vger.kernel.org #v5.19+ Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com> Reviewed-by: Yeongjin Gil <youngjin.gil@samsung.com> Signed-off-by: Sunmin Jeong <s_min.jeong@samsung.com> Reviewed-by: Daeho Jeong <daehojeong@google.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* dlm: fix user space lkb refcountingAlexander Aring2024-04-031-5/+5
| | | | | | | | | | | | | | | | | [ Upstream commit 2ab3d705ca5d4f7ea345a21c3da41a447a549649 ] This patch fixes to check on the right return value if it was the last callback. The rv variable got overwritten by the return of copy_result_to_user(). Fixing it by introducing a second variable for the return value and don't let rv being overwritten. Cc: stable@vger.kernel.org Fixes: 61bed0baa4db ("fs: dlm: use a non-static queue for callbacks") Reported-by: Valentin Vidić <vvidic@valentin-vidic.from.hr> Closes: https://lore.kernel.org/gfs2/Ze4qSvzGJDt5yxC3@valentin-vidic.from.hr Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* ksmbd: retrieve number of blocks using vfs_getattr in set_file_allocation_infoMarios Makassikis2024-04-031-2/+8
| | | | | | | | | | | | | [ Upstream commit 34cd86b6632718b7df3999d96f51e63de41c5e4f ] Use vfs_getattr() to retrieve stat information, rather than make assumptions about how a filesystem fills inode structs. Cc: stable@vger.kernel.org Signed-off-by: Marios Makassikis <mmakassikis@freebox.fr> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* ksmbd: replace generic_fillattr with vfs_getattrMarios Makassikis2024-04-033-66/+127
| | | | | | | | | | | | | | | | [ Upstream commit 5614c8c487f6af627614dd2efca038e4afe0c6d7 ] generic_fillattr should not be used outside of ->getattr implementations. Use vfs_getattr instead, and adapt functions to return an error code to the caller. Cc: stable@vger.kernel.org Signed-off-by: Marios Makassikis <mmakassikis@freebox.fr> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* eventfd: simplify eventfd_signal()Christian Brauner2024-04-032-8/+5
| | | | | | | | | | | | | | | | | | | [ Upstream commit 3652117f854819a148ff0fbe4492587d3520b5e5 ] Ever since the eventfd type was introduced back in 2007 in commit e1ad7468c77d ("signal/timer/event: eventfd core") the eventfd_signal() function only ever passed 1 as a value for @n. There's no point in keeping that additional argument. Link: https://lore.kernel.org/r/20231122-vfs-eventfd-signal-v2-2-bd549b14ce0c@kernel.org Acked-by: Xu Yilun <yilun.xu@intel.com> Acked-by: Andrew Donnellan <ajd@linux.ibm.com> # ocxl Acked-by: Eric Farman <farman@linux.ibm.com> # s390 Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org> Stable-dep-of: 675daf435e9f ("vfio/platform: Create persistent IRQ handlers") Signed-off-by: Sasha Levin <sashal@kernel.org>
* cifs: allow changing password during remountSteve French2024-04-034-5/+30
| | | | | | | | | | | | | | | | | | | | | | [ Upstream commit c1eb537bf4560b3ad4df606c266c665624f3b502 ] There are cases where a session is disconnected and password has changed on the server (or expired) for this user and this currently can not be fixed without unmount and mounting again. This patch allows remount to change the password (for the non Kerberos case, Kerberos ticket refresh is handled differently) when the session is disconnected and the user can not reconnect due to still using old password. Future patches should also allow us to setup the keyring (cifscreds) to have an "alternate password" so we would be able to change the password before the session drops (without the risk of races between when the password changes and the disconnect occurs - ie cases where the old password is still needed because the new password has not fully rolled out to all servers yet). Cc: stable@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* cifs: prevent updating file size from server if we have a read/write leaseBharath SM2024-04-034-12/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit e4b61f3b1c67f5068590965f64ea6e8d5d5bd961 ] In cases of large directories, the readdir operation may span multiple round trips to retrieve contents. This introduces a potential race condition in case of concurrent write and readdir operations. If the readdir operation initiates before a write has been processed by the server, it may update the file size attribute to an older value. Address this issue by avoiding file size updates from readdir when we have read/write lease. Scenario: 1) process1: open dir xyz 2) process1: readdir instance 1 on xyz 3) process2: create file.txt for write 4) process2: write x bytes to file.txt 5) process2: close file.txt 6) process2: open file.txt for read 7) process1: readdir 2 - overwrites file.txt inode size to 0 8) process2: read contents of file.txt - bug, short read with 0 bytes Cc: stable@vger.kernel.org Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Bharath SM <bharathsm@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* smb: client: stop revalidating reparse points unnecessarilyPaulo Alcantara2024-04-033-81/+57
| | | | | | | | | | | | | | | | | | [ Upstream commit 6d039984c15d1ea1ca080176df6dfab443e44585 ] Query dir responses don't provide enough information on reparse points such as major/minor numbers and symlink targets other than reparse tags, however we don't need to unconditionally revalidate them only because they are reparse points. Instead, revalidate them only when their ctime or reparse tag has changed. For instance, Windows Server updates ctime of reparse points when their data have changed. Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com> Stable-dep-of: e4b61f3b1c67 ("cifs: prevent updating file size from server if we have a read/write lease") Signed-off-by: Sasha Levin <sashal@kernel.org>
* NFS: Read unlock folio on nfs_page_create_from_folio() errorBenjamin Coddington2024-04-031-0/+2
| | | | | | | | | | | | | | | | [ Upstream commit 11974eec839c167362af685aae5f5e1baaf979eb ] The netfs conversion lost a folio_unlock() for the case where nfs_page_create_from_folio() returns an error (usually -ENOMEM). Restore it. Reported-by: David Jeffery <djeffery@redhat.com> Cc: <stable@vger.kernel.org> # 6.4+ Fixes: 000dbe0bec05 ("NFS: Convert buffered read paths to use netfs when fscache is enabled") Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Acked-by: Dave Wysochanski <dwysocha@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* nfs: fix UAF in direct writesJosef Bacik2024-04-032-3/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 17f46b803d4f23c66cacce81db35fef3adb8f2af ] In production we have been hitting the following warning consistently ------------[ cut here ]------------ refcount_t: underflow; use-after-free. WARNING: CPU: 17 PID: 1800359 at lib/refcount.c:28 refcount_warn_saturate+0x9c/0xe0 Workqueue: nfsiod nfs_direct_write_schedule_work [nfs] RIP: 0010:refcount_warn_saturate+0x9c/0xe0 PKRU: 55555554 Call Trace: <TASK> ? __warn+0x9f/0x130 ? refcount_warn_saturate+0x9c/0xe0 ? report_bug+0xcc/0x150 ? handle_bug+0x3d/0x70 ? exc_invalid_op+0x16/0x40 ? asm_exc_invalid_op+0x16/0x20 ? refcount_warn_saturate+0x9c/0xe0 nfs_direct_write_schedule_work+0x237/0x250 [nfs] process_one_work+0x12f/0x4a0 worker_thread+0x14e/0x3b0 ? ZSTD_getCParams_internal+0x220/0x220 kthread+0xdc/0x120 ? __btf_name_valid+0xa0/0xa0 ret_from_fork+0x1f/0x30 This is because we're completing the nfs_direct_request twice in a row. The source of this is when we have our commit requests to submit, we process them and send them off, and then in the completion path for the commit requests we have if (nfs_commit_end(cinfo.mds)) nfs_direct_write_complete(dreq); However since we're submitting asynchronous requests we sometimes have one that completes before we submit the next one, so we end up calling complete on the nfs_direct_request twice. The only other place we use nfs_generic_commit_list() is in __nfs_commit_inode, which wraps this call in a nfs_commit_begin(); nfs_commit_end(); Which is a common pattern for this style of completion handling, one that is also repeated in the direct code with get_dreq()/put_dreq() calls around where we process events as well as in the completion paths. Fix this by using the same pattern for the commit requests. Before with my 200 node rocksdb stress running this warning would pop every 10ish minutes. With my patch the stress test has been running for several hours without popping. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* debugfs: fix wait/cancellation handling during removeJohannes Berg2024-04-031-5/+20
| | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 952c3fce297f12c7ff59380adb66b564e2bc9b64 ] Ben Greear further reports deadlocks during concurrent debugfs remove while files are being accessed, even though the code in question now uses debugfs cancellations. Turns out that despite all the review on the locking, we missed completely that the logic is wrong: if the refcount hits zero we can finish (and need not wait for the completion), but if it doesn't we have to trigger all the cancellations. As written, we can _never_ get into the loop triggering the cancellations. Fix this, and explain it better while at it. Cc: stable@vger.kernel.org Fixes: 8c88a474357e ("debugfs: add API to allow debugfs operations cancellation") Reported-by: Ben Greear <greearb@candelatech.com> Closes: https://lore.kernel.org/r/1c9fa9e5-09f1-0522-fdbc-dbcef4d255ca@candelatech.com Tested-by: Madhan Sai <madhan.singaraju@candelatech.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Link: https://lore.kernel.org/r/20240229153635.6bfab7eb34d3.I6c7aeff8c9d6628a8bc1ddcf332205a49d801f17@changeid Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* ext4: fix corruption during on-line resizeMaximilian Heyne2024-04-031-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit a6b3bfe176e8a5b05ec4447404e412c2a3fc92cc ] We observed a corruption during on-line resize of a file system that is larger than 16 TiB with 4k block size. With having more then 2^32 blocks resize_inode is turned off by default by mke2fs. The issue can be reproduced on a smaller file system for convenience by explicitly turning off resize_inode. An on-line resize across an 8 GiB boundary (the size of a meta block group in this setup) then leads to a corruption: dev=/dev/<some_dev> # should be >= 16 GiB mkdir -p /corruption /sbin/mke2fs -t ext4 -b 4096 -O ^resize_inode $dev $((2 * 2**21 - 2**15)) mount -t ext4 $dev /corruption dd if=/dev/zero bs=4096 of=/corruption/test count=$((2*2**21 - 4*2**15)) sha1sum /corruption/test # 79d2658b39dcfd77274e435b0934028adafaab11 /corruption/test /sbin/resize2fs $dev $((2*2**21)) # drop page cache to force reload the block from disk echo 1 > /proc/sys/vm/drop_caches sha1sum /corruption/test # 3c2abc63cbf1a94c9e6977e0fbd72cd832c4d5c3 /corruption/test 2^21 = 2^15*2^6 equals 8 GiB whereof 2^15 is the number of blocks per block group and 2^6 are the number of block groups that make a meta block group. The last checksum might be different depending on how the file is laid out across the physical blocks. The actual corruption occurs at physical block 63*2^15 = 2064384 which would be the location of the backup of the meta block group's block descriptor. During the on-line resize the file system will be converted to meta_bg starting at s_first_meta_bg which is 2 in the example - meaning all block groups after 16 GiB. However, in ext4_flex_group_add we might add block groups that are not part of the first meta block group yet. In the reproducer we achieved this by substracting the size of a whole block group from the point where the meta block group would start. This must be considered when updating the backup block group descriptors to follow the non-meta_bg layout. The fix is to add a test whether the group to add is already part of the meta block group or not. Fixes: 01f795f9e0d67 ("ext4: add online resizing support for meta_bg and 64-bit file systems") Cc: <stable@vger.kernel.org> Signed-off-by: Maximilian Heyne <mheyne@amazon.de> Tested-by: Srivathsa Dara <srivathsa.d.dara@oracle.com> Reviewed-by: Srivathsa Dara <srivathsa.d.dara@oracle.com> Link: https://lore.kernel.org/r/20240215155009.94493-1-mheyne@amazon.de Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
* btrfs: fix off-by-one chunk length calculation at contains_pending_extent()Filipe Manana2024-04-031-1/+1
| | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit ae6bd7f9b46a29af52ebfac25d395757e2031d0d ] At contains_pending_extent() the value of the end offset of a chunk we found in the device's allocation state io tree is inclusive, so when we calculate the length we pass to the in_range() macro, we must sum 1 to the expression "physical_end - physical_offset". In practice the wrong calculation should be harmless as chunks sizes are never 1 byte and we should never have 1 byte ranges of unallocated space. Nevertheless fix the wrong calculation. Reported-by: Alex Lyakas <alex.lyakas@zadara.com> Link: https://lore.kernel.org/linux-btrfs/CAOcd+r30e-f4R-5x-S7sV22RJPe7+pgwherA6xqN2_qe7o4XTg@mail.gmail.com/ Fixes: 1c11b63eff2a ("btrfs: replace pending/pinned chunks lists with io tree") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* btrfs: qgroup: always free reserved space for extent recordsQu Wenruo2024-04-031-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit d139ded8b9cdb897bb9539eb33311daf9a177fd2 ] [BUG] If qgroup is marked inconsistent (e.g. caused by operations needing full subtree rescan, like creating a snapshot and assign to a higher level qgroup), btrfs would immediately start leaking its data reserved space. The following script can easily reproduce it: mkfs.btrfs -O quota -f $dev mount $dev $mnt btrfs subvolume create $mnt/subv1 btrfs qgroup create 1/0 $mnt # This snapshot creation would mark qgroup inconsistent, # as the ownership involves different higher level qgroup, thus # we have to rescan both source and snapshot, which can be very # time consuming, thus here btrfs just choose to mark qgroup # inconsistent, and let users to determine when to do the rescan. btrfs subv snapshot -i 1/0 $mnt/subv1 $mnt/snap1 # Now this write would lead to qgroup rsv leak. xfs_io -f -c "pwrite 0 64k" $mnt/file1 # And at unmount time, btrfs would report 64K DATA rsv space leaked. umount $mnt And we would have the following dmesg output for the unmount: BTRFS info (device dm-1): last unmount of filesystem 14a3d84e-f47b-4f72-b053-a8a36eef74d3 BTRFS warning (device dm-1): qgroup 0/5 has unreleased space, type 0 rsv 65536 [CAUSE] Since commit e15e9f43c7ca ("btrfs: introduce BTRFS_QGROUP_RUNTIME_FLAG_NO_ACCOUNTING to skip qgroup accounting"), we introduce a mode for btrfs qgroup to skip the timing consuming backref walk, if the qgroup is already inconsistent. But this skip also covered the data reserved freeing, thus the qgroup reserved space for each newly created data extent would not be freed, thus cause the leakage. [FIX] Make the data extent reserved space freeing mandatory. The qgroup reserved space handling is way cheaper compared to the backref walking part, and we always have the super sensitive leak detector, thus it's definitely worth to always free the qgroup reserved data space. Reported-by: Fabian Vogt <fvogt@suse.com> Fixes: e15e9f43c7ca ("btrfs: introduce BTRFS_QGROUP_RUNTIME_FLAG_NO_ACCOUNTING to skip qgroup accounting") CC: stable@vger.kernel.org # 6.1+ Link: https://bugzilla.suse.com/show_bug.cgi?id=1216196 Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* fuse: don't unhash rootMiklos Szeredi2024-04-032-3/+5
| | | | | | | | | | | | [ Upstream commit b1fe686a765e6c0d71811d825b5a1585a202b777 ] The root inode is assumed to be always hashed. Do not unhash the root inode even if it is marked BAD. Fixes: 5d069dbe8aaf ("fuse: fix bad inode") Cc: <stable@vger.kernel.org> # v5.11 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* fuse: fix root lookup with nonzero generationMiklos Szeredi2024-04-031-0/+4
| | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 68ca1b49e430f6534d0774a94147a823e3b8b26e ] The root inode has a fixed nodeid and generation (1, 0). Prior to the commit 15db16837a35 ("fuse: fix illegal access to inode with reused nodeid") generation number on lookup was ignored. After this commit lookup with the wrong generation number resulted in the inode being unhashed. This is correct for non-root inodes, but replacing the root inode is wrong and results in weird behavior. Fix by reverting to the old behavior if ignoring the generation for the root inode, but issuing a warning in dmesg. Reported-by: Antonio SJ Musumeci <trapexit@spawn.link> Closes: https://lore.kernel.org/all/CAOQ4uxhek5ytdN8Yz2tNEOg5ea4NkBb4nk0FGPjPk_9nz-VG3g@mail.gmail.com/ Fixes: 15db16837a35 ("fuse: fix illegal access to inode with reused nodeid") Cc: <stable@vger.kernel.org> # v5.14 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* fuse: replace remaining make_bad_inode() with fuse_make_bad()Miklos Szeredi2024-04-031-1/+1
| | | | | | | | | | | [ Upstream commit 82e081aebe4d9c26e196c8260005cc4762b57a5d ] fuse_do_statx() was added with the wrong helper. Fixes: d3045530bdd2 ("fuse: implement statx") Cc: <stable@vger.kernel.org> # v6.6 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* ubifs: Set page uptodate in the correct placeMatthew Wilcox (Oracle)2024-04-031-9/+4
| | | | | | | | | | | | | | | | | [ Upstream commit 723012cab779eee8228376754e22c6594229bf8f ] Page cache reads are lockless, so setting the freshly allocated page uptodate before we've overwritten it with the data it's supposed to have in it will allow a simultaneous reader to see old data. Move the call to SetPageUptodate into ubifs_write_end(), which is after we copied the new data into the page. Fixes: 1e51764a3c2a ("UBIFS: add new flash file system") Cc: stable@vger.kernel.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Sasha Levin <sashal@kernel.org>
* fuse: fix VM_MAYSHARE and direct_io_allow_mmapBernd Schubert2024-04-031-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 9511176bbaee0ac60ecc84e7b01cf5972a59ea17 ] There were multiple issues with direct_io_allow_mmap: - fuse_link_write_file() was missing, resulting in warnings in fuse_write_file_get() and EIO from msync() - "vma->vm_ops = &fuse_file_vm_ops" was not set, but especially fuse_page_mkwrite is needed. The semantics of invalidate_inode_pages2() is so far not clearly defined in fuse_file_mmap. It dates back to commit 3121bfe76311 ("fuse: fix "direct_io" private mmap") Though, as direct_io_allow_mmap is a new feature, that was for MAP_PRIVATE only. As invalidate_inode_pages2() is calling into fuse_launder_folio() and writes out dirty pages, it should be safe to call invalidate_inode_pages2 for MAP_PRIVATE and MAP_SHARED as well. Cc: Hao Xu <howeyxu@tencent.com> Cc: stable@vger.kernel.org Fixes: e78662e818f9 ("fuse: add a new fuse init flag to relax restrictions in no cache mode") Signed-off-by: Bernd Schubert <bschubert@ddn.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* fat: fix uninitialized field in nostale filehandlesJan Kara2024-04-031-0/+6
| | | | | | | | | | | | | | | | | | | | | [ Upstream commit fde2497d2bc3a063d8af88b258dbadc86bd7b57c ] When fat_encode_fh_nostale() encodes file handle without a parent it stores only first 10 bytes of the file handle. However the length of the file handle must be a multiple of 4 so the file handle is actually 12 bytes long and the last two bytes remain uninitialized. This is not great at we potentially leak uninitialized information with the handle to userspace. Properly initialize the full handle length. Link: https://lkml.kernel.org/r/20240205122626.13701-1-jack@suse.cz Reported-by: syzbot+3ce5dea5b1539ff36769@syzkaller.appspotmail.com Fixes: ea3983ace6b7 ("fat: restructure export_operations") Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Amir Goldstein <amir73il@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* ext4: correct best extent lstart adjustment logicBaokun Li2024-04-031-6/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 4fbf8bc733d14bceb16dda46a3f5e19c6a9621c5 ] When yangerkun review commit 93cdf49f6eca ("ext4: Fix best extent lstart adjustment logic in ext4_mb_new_inode_pa()"), it was found that the best extent did not completely cover the original request after adjusting the best extent lstart in ext4_mb_new_inode_pa() as follows: original request: 2/10(8) normalized request: 0/64(64) best extent: 0/9(9) When we check if best ex can be kept at start of goal, ac_o_ex.fe_logical is 2 less than the adjusted best extent logical end 9, so we think the adjustment is done. But obviously 0/9(9) doesn't cover 2/10(8), so we should determine here if the original request logical end is less than or equal to the adjusted best extent logical end. In addition, add a comment stating when adjusted best_ex will not cover the original request, and remove the duplicate assertion because adjusting lstart makes no change to b_ex.fe_len. Link: https://lore.kernel.org/r/3630fa7f-b432-7afd-5f79-781bc3b2c5ea@huawei.com Fixes: 93cdf49f6eca ("ext4: Fix best extent lstart adjustment logic in ext4_mb_new_inode_pa()") Cc: <stable@kernel.org> Signed-off-by: yangerkun <yangerkun@huawei.com> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20240201141845.1879253-1-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
* ceph: stop copying to iter at EOF on sync readsXiubo Li2024-03-261-10/+13
| | | | | | | | | | | | | | | | | | | | | [ Upstream commit 1065da21e5df9d843d2c5165d5d576be000142a6 ] If EOF is encountered, ceph_sync_read() return value is adjusted down according to i_size, but the "to" iter is advanced by the actual number of bytes read. Then, when retrying, the remainder of the range may be skipped incorrectly. Ensure that the "to" iter is advanced only until EOF. [ idryomov: changelog ] Fixes: c3d8e0b5de48 ("ceph: return the real size read when it hits EOF") Reported-by: Frank Hsiao <frankhsiao@qnap.com> Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Tested-by: Frank Hsiao <frankhsiao@qnap.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* bcachefs: Fix BTREE_ITER_FILTER_SNAPSHOTS on inodes btreeKent Overstreet2024-03-261-1/+3
| | | | | | | | | | | | | | | | | | commit 204f45140faa0772d2ca1b3de96d1c0fb3db8e77 upstream. If we're in FILTER_SNAPSHOTS mode and we start scanning a range of the keyspace where no keys are visible in the current snapshot, we have a problem - we'll scan for a very long time before scanning terminates. Awhile back, this was fixed for most cases with peek_upto() (and assertions that enforce that it's being used). But the fix missed the fact that the inodes btree is different - every key offset is in a different snapshot tree, not just the inode field. Fixes: Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* bcachefs: fix simulateously upgrading & downgradingKent Overstreet2024-03-261-3/+12
| | | | | | | commit e7999235e6c437b99fad03d8301a4341be0d2bb5 upstream. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* bcachefs: check for failure to downgradeKent Overstreet2024-03-262-0/+13
| | | | | | | | | | | | commit 447c1c01051251853e851bd77f26546488cbc7b7 upstream. With the upcoming member seq patch, it's now critical that we don't ever write to a superblock that hasn't been version downgraded - failure to update member seq fields will cause split brain detection to fire erroniously. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* bcachefs: install fd later to avoid race with closeMathias Krause2024-03-261-2/+1
| | | | | | | | | | | | | | | | | | | commit dd839f31d7cd5e04f4111a219024268c6f6973f0 upstream. Calling fd_install() makes a file reachable for userland, including the possibility to close the file descriptor, which leads to calling its 'release' hook. If that happens before the code had a chance to bump the reference of the newly created task struct, the release callback will call put_task_struct() too early, leading to the premature destruction of the kernel thread. Avoid that race by calling fd_install() later, after all the setup is done. Fixes: 1c6fdbd8f246 ("bcachefs: Initial commit") Signed-off-by: Mathias Krause <minipli@grsecurity.net> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* bcachefs: Fix build on parisc by avoiding __multi3()Helge Deller2024-03-261-1/+1
| | | | | | | | | | | | | | | | | | | | | | commit eba38cc7578bef94865341c73608bdf49193a51d upstream. The gcc compiler on paric does support the __int128 type, although the architecture does not have native 128-bit support. The effect is, that the bcachefs u128_square() function will pull in the libgcc __multi3() helper, which breaks the kernel build when bcachefs is built as module since this function isn't currently exported in arch/parisc/kernel/parisc_ksyms.c. The build failure can be seen in the latest debian kernel build at: https://buildd.debian.org/status/fetch.php?pkg=linux&arch=hppa&ver=6.7.1-1%7Eexp1&stamp=1706132569&raw=0 We prefer to not export that symbol, so fall back to the optional 64-bit implementation provided by bcachefs and thus avoid usage of __multi3(). Signed-off-by: Helge Deller <deller@gmx.de> Cc: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* nfs: fix panic when nfs4_ff_layout_prepare_ds() failsJosef Bacik2024-03-261-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 719fcafe07c12646691bd62d7f8d94d657fa0766 ] We've been seeing the following panic in production BUG: kernel NULL pointer dereference, address: 0000000000000065 PGD 2f485f067 P4D 2f485f067 PUD 2cc5d8067 PMD 0 RIP: 0010:ff_layout_cancel_io+0x3a/0x90 [nfs_layout_flexfiles] Call Trace: <TASK> ? __die+0x78/0xc0 ? page_fault_oops+0x286/0x380 ? __rpc_execute+0x2c3/0x470 [sunrpc] ? rpc_new_task+0x42/0x1c0 [sunrpc] ? exc_page_fault+0x5d/0x110 ? asm_exc_page_fault+0x22/0x30 ? ff_layout_free_layoutreturn+0x110/0x110 [nfs_layout_flexfiles] ? ff_layout_cancel_io+0x3a/0x90 [nfs_layout_flexfiles] ? ff_layout_cancel_io+0x6f/0x90 [nfs_layout_flexfiles] pnfs_mark_matching_lsegs_return+0x1b0/0x360 [nfsv4] pnfs_error_mark_layout_for_return+0x9e/0x110 [nfsv4] ? ff_layout_send_layouterror+0x50/0x160 [nfs_layout_flexfiles] nfs4_ff_layout_prepare_ds+0x11f/0x290 [nfs_layout_flexfiles] ff_layout_pg_init_write+0xf0/0x1f0 [nfs_layout_flexfiles] __nfs_pageio_add_request+0x154/0x6c0 [nfs] nfs_pageio_add_request+0x26b/0x380 [nfs] nfs_do_writepage+0x111/0x1e0 [nfs] nfs_writepages_callback+0xf/0x30 [nfs] write_cache_pages+0x17f/0x380 ? nfs_pageio_init_write+0x50/0x50 [nfs] ? nfs_writepages+0x6d/0x210 [nfs] ? nfs_writepages+0x6d/0x210 [nfs] nfs_writepages+0x125/0x210 [nfs] do_writepages+0x67/0x220 ? generic_perform_write+0x14b/0x210 filemap_fdatawrite_wbc+0x5b/0x80 file_write_and_wait_range+0x6d/0xc0 nfs_file_fsync+0x81/0x170 [nfs] ? nfs_file_mmap+0x60/0x60 [nfs] __x64_sys_fsync+0x53/0x90 do_syscall_64+0x3d/0x90 entry_SYSCALL_64_after_hwframe+0x46/0xb0 Inspecting the core with drgn I was able to pull this >>> prog.crashed_thread().stack_trace()[0] #0 at 0xffffffffa079657a (ff_layout_cancel_io+0x3a/0x84) in ff_layout_cancel_io at fs/nfs/flexfilelayout/flexfilelayout.c:2021:27 >>> prog.crashed_thread().stack_trace()[0]['idx'] (u32)1 >>> prog.crashed_thread().stack_trace()[0]['flseg'].mirror_array[1].mirror_ds (struct nfs4_ff_layout_ds *)0xffffffffffffffed This is clear from the stack trace, we call nfs4_ff_layout_prepare_ds() which could error out initializing the mirror_ds, and then we go to clean it all up and our check is only for if (!mirror->mirror_ds). This is inconsistent with the rest of the users of mirror_ds, which have if (IS_ERR_OR_NULL(mirror_ds)) to keep from tripping over this exact scenario. Fix this up in ff_layout_cancel_io() to make sure we don't panic when we get an error. I also spot checked all the other instances of checking mirror_ds and we appear to be doing the correct checks everywhere, only unconditionally dereferencing mirror_ds when we know it would be valid. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Fixes: b739a5bd9d9f ("NFSv4/flexfiles: Cancel I/O if the layout is recalled or revoked") Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* afs: Revert "afs: Hide silly-rename files from userspace"David Howells2024-03-261-10/+0
| | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 0aec3847d044273733285dcff90afda89ad461d2 ] This reverts commit 57e9d49c54528c49b8bffe6d99d782ea051ea534. This undoes the hiding of .__afsXXXX silly-rename files. The problem with hiding them is that rm can't then manually delete them. This also reverts commit 5f7a07646655fb4108da527565dcdc80124b14c4 ("afs: Fix endless loop in directory parsing") as that's a bugfix for the above. Fixes: 57e9d49c5452 ("afs: Hide silly-rename files from userspace") Reported-by: Markus Suvanto <markus.suvanto@gmail.com> Link: https://lists.infradead.org/pipermail/linux-afs/2024-February/008102.html Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/3085695.1710328121@warthog.procyon.org.uk Reviewed-by: Jeffrey E Altman <jaltman@auristor.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* f2fs: zone: fix to remove pow2 check condition for zoned block deviceChao Yu2024-03-261-5/+0
| | | | | | | | | | | | | | | [ Upstream commit 11bec96afbfbc4679863db55258de440d786821e ] Commit 2e2c6e9b72ce ("f2fs: remove power-of-two limitation of zoned device") missed to remove pow2 check condition in init_blkz_info(), fix it. Fixes: 2e2c6e9b72ce ("f2fs: remove power-of-two limitation of zoned device") Signed-off-by: Feng Song <songfeng@oppo.com> Signed-off-by: Yongpeng Yang <yangyongpeng1@oppo.com> Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* f2fs: fix to truncate meta inode pages forcelyChao Yu2024-03-263-6/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 9f0c4a46be1fe9b97dbe66d49204c1371e3ece65 ] Below race case can cause data corruption: Thread A GC thread - gc_data_segment - ra_data_block - locked meta_inode page - f2fs_inplace_write_data - invalidate_mapping_pages : fail to invalidate meta_inode page due to lock failure or dirty|writeback status - f2fs_submit_page_bio : write last dirty data to old blkaddr - move_data_block - load old data from meta_inode page - f2fs_submit_page_write : write old data to new blkaddr Because invalidate_mapping_pages() will skip invalidating page which has unclear status including locked, dirty, writeback and so on, so we need to use truncate_inode_pages_range() instead of invalidate_mapping_pages() to make sure meta_inode page will be dropped. Fixes: 6aa58d8ad20a ("f2fs: readahead encrypted block during GC") Fixes: e3b49ea36802 ("f2fs: invalidate META_MAPPING before IPU/DIO write") Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>