summaryrefslogtreecommitdiffstats
path: root/fs
Commit message (Collapse)AuthorAgeFilesLines
* cifs: fix confusing debug messagePaulo Alcantara2023-01-071-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | commit a85ceafd41927e41a4103d228a993df7edd8823b upstream. Since rc was initialised to -ENOMEM in cifs_get_smb_ses(), when an existing smb session was found, free_xid() would be called and then print CIFS: fs/cifs/connect.c: Existing tcp session with server found CIFS: fs/cifs/connect.c: VFS: in cifs_get_smb_ses as Xid: 44 with uid: 0 CIFS: fs/cifs/connect.c: Existing smb sess found (status=1) CIFS: fs/cifs/connect.c: VFS: leaving cifs_get_smb_ses (xid = 44) rc = -12 Fix this by initialising rc to 0 and then let free_xid() print this instead CIFS: fs/cifs/connect.c: Existing tcp session with server found CIFS: fs/cifs/connect.c: VFS: in cifs_get_smb_ses as Xid: 14 with uid: 0 CIFS: fs/cifs/connect.c: Existing smb sess found (status=1) CIFS: fs/cifs/connect.c: VFS: leaving cifs_get_smb_ses (xid = 14) rc = 0 Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Cc: stable@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* pnode: terminate at peers of sourceChristian Brauner2023-01-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 11933cf1d91d57da9e5c53822a540bbdc2656c16 upstream. The propagate_mnt() function handles mount propagation when creating mounts and propagates the source mount tree @source_mnt to all applicable nodes of the destination propagation mount tree headed by @dest_mnt. Unfortunately it contains a bug where it fails to terminate at peers of @source_mnt when looking up copies of the source mount that become masters for copies of the source mount tree mounted on top of slaves in the destination propagation tree causing a NULL dereference. Once the mechanics of the bug are understood it's easy to trigger. Because of unprivileged user namespaces it is available to unprivileged users. While fixing this bug we've gotten confused multiple times due to unclear terminology or missing concepts. So let's start this with some clarifications: * The terms "master" or "peer" denote a shared mount. A shared mount belongs to a peer group. * A peer group is a set of shared mounts that propagate to each other. They are identified by a peer group id. The peer group id is available in @shared_mnt->mnt_group_id. Shared mounts within the same peer group have the same peer group id. The peers in a peer group can be reached via @shared_mnt->mnt_share. * The terms "slave mount" or "dependent mount" denote a mount that receives propagation from a peer in a peer group. IOW, shared mounts may have slave mounts and slave mounts have shared mounts as their master. Slave mounts of a given peer in a peer group are listed on that peers slave list available at @shared_mnt->mnt_slave_list. * The term "master mount" denotes a mount in a peer group. IOW, it denotes a shared mount or a peer mount in a peer group. The term "master mount" - or "master" for short - is mostly used when talking in the context of slave mounts that receive propagation from a master mount. A master mount of a slave identifies the closest peer group a slave mount receives propagation from. The master mount of a slave can be identified via @slave_mount->mnt_master. Different slaves may point to different masters in the same peer group. * Multiple peers in a peer group can have non-empty ->mnt_slave_lists. Non-empty ->mnt_slave_lists of peers don't intersect. Consequently, to ensure all slave mounts of a peer group are visited the ->mnt_slave_lists of all peers in a peer group have to be walked. * Slave mounts point to a peer in the closest peer group they receive propagation from via @slave_mnt->mnt_master (see above). Together with these peers they form a propagation group (see below). The closest peer group can thus be identified through the peer group id @slave_mnt->mnt_master->mnt_group_id of the peer/master that a slave mount receives propagation from. * A shared-slave mount is a slave mount to a peer group pg1 while also a peer in another peer group pg2. IOW, a peer group may receive propagation from another peer group. If a peer group pg1 is a slave to another peer group pg2 then all peers in peer group pg1 point to the same peer in peer group pg2 via ->mnt_master. IOW, all peers in peer group pg1 appear on the same ->mnt_slave_list. IOW, they cannot be slaves to different peer groups. * A pure slave mount is a slave mount that is a slave to a peer group but is not a peer in another peer group. * A propagation group denotes the set of mounts consisting of a single peer group pg1 and all slave mounts and shared-slave mounts that point to a peer in that peer group via ->mnt_master. IOW, all slave mounts such that @slave_mnt->mnt_master->mnt_group_id is equal to @shared_mnt->mnt_group_id. The concept of a propagation group makes it easier to talk about a single propagation level in a propagation tree. For example, in propagate_mnt() the immediate peers of @dest_mnt and all slaves of @dest_mnt's peer group form a propagation group propg1. So a shared-slave mount that is a slave in propg1 and that is a peer in another peer group pg2 forms another propagation group propg2 together with all slaves that point to that shared-slave mount in their ->mnt_master. * A propagation tree refers to all mounts that receive propagation starting from a specific shared mount. For example, for propagate_mnt() @dest_mnt is the start of a propagation tree. The propagation tree ecompasses all mounts that receive propagation from @dest_mnt's peer group down to the leafs. With that out of the way let's get to the actual algorithm. We know that @dest_mnt is guaranteed to be a pure shared mount or a shared-slave mount. This is guaranteed by a check in attach_recursive_mnt(). So propagate_mnt() will first propagate the source mount tree to all peers in @dest_mnt's peer group: for (n = next_peer(dest_mnt); n != dest_mnt; n = next_peer(n)) { ret = propagate_one(n); if (ret) goto out; } Notice, that the peer propagation loop of propagate_mnt() doesn't propagate @dest_mnt itself. @dest_mnt is mounted directly in attach_recursive_mnt() after we propagated to the destination propagation tree. The mount that will be mounted on top of @dest_mnt is @source_mnt. This copy was created earlier even before we entered attach_recursive_mnt() and doesn't concern us a lot here. It's just important to notice that when propagate_mnt() is called @source_mnt will not yet have been mounted on top of @dest_mnt. Thus, @source_mnt->mnt_parent will either still point to @source_mnt or - in the case @source_mnt is moved and thus already attached - still to its former parent. For each peer @m in @dest_mnt's peer group propagate_one() will create a new copy of the source mount tree and mount that copy @child on @m such that @child->mnt_parent points to @m after propagate_one() returns. propagate_one() will stash the last destination propagation node @m in @last_dest and the last copy it created for the source mount tree in @last_source. Hence, if we call into propagate_one() again for the next destination propagation node @m, @last_dest will point to the previous destination propagation node and @last_source will point to the previous copy of the source mount tree and mounted on @last_dest. Each new copy of the source mount tree is created from the previous copy of the source mount tree. This will become important later. The peer loop in propagate_mnt() is straightforward. We iterate through the peers copying and updating @last_source and @last_dest as we go through them and mount each copy of the source mount tree @child on a peer @m in @dest_mnt's peer group. After propagate_mnt() handled the peers in @dest_mnt's peer group propagate_mnt() will propagate the source mount tree down the propagation tree that @dest_mnt's peer group propagates to: for (m = next_group(dest_mnt, dest_mnt); m; m = next_group(m, dest_mnt)) { /* everything in that slave group */ n = m; do { ret = propagate_one(n); if (ret) goto out; n = next_peer(n); } while (n != m); } The next_group() helper will recursively walk the destination propagation tree, descending into each propagation group of the propagation tree. The important part is that it takes care to propagate the source mount tree to all peers in the peer group of a propagation group before it propagates to the slaves to those peers in the propagation group. IOW, it creates and mounts copies of the source mount tree that become masters before it creates and mounts copies of the source mount tree that become slaves to these masters. It is important to remember that propagating the source mount tree to each mount @m in the destination propagation tree simply means that we create and mount new copies @child of the source mount tree on @m such that @child->mnt_parent points to @m. Since we know that each node @m in the destination propagation tree headed by @dest_mnt's peer group will be overmounted with a copy of the source mount tree and since we know that the propagation properties of each copy of the source mount tree we create and mount at @m will mostly mirror the propagation properties of @m. We can use that information to create and mount the copies of the source mount tree that become masters before their slaves. The easy case is always when @m and @last_dest are peers in a peer group of a given propagation group. In that case we know that we can simply copy @last_source without having to figure out what the master for the new copy @child of the source mount tree needs to be as we've done that in a previous call to propagate_one(). The hard case is when we're dealing with a slave mount or a shared-slave mount @m in a destination propagation group that we need to create and mount a copy of the source mount tree on. For each propagation group in the destination propagation tree we propagate the source mount tree to we want to make sure that the copies @child of the source mount tree we create and mount on slaves @m pick an ealier copy of the source mount tree that we mounted on a master @m of the destination propagation group as their master. This is a mouthful but as far as we can tell that's the core of it all. But, if we keep track of the masters in the destination propagation tree @m we can use the information to find the correct master for each copy of the source mount tree we create and mount at the slaves in the destination propagation tree @m. Let's walk through the base case as that's still fairly easy to grasp. If we're dealing with the first slave in the propagation group that @dest_mnt is in then we don't yet have marked any masters in the destination propagation tree. We know the master for the first slave to @dest_mnt's peer group is simple @dest_mnt. So we expect this algorithm to yield a copy of the source mount tree that was mounted on a peer in @dest_mnt's peer group as the master for the copy of the source mount tree we want to mount at the first slave @m: for (n = m; ; n = p) { p = n->mnt_master; if (p == dest_master || IS_MNT_MARKED(p)) break; } For the first slave we walk the destination propagation tree all the way up to a peer in @dest_mnt's peer group. IOW, the propagation hierarchy can be walked by walking up the @mnt->mnt_master hierarchy of the destination propagation tree @m. We will ultimately find a peer in @dest_mnt's peer group and thus ultimately @dest_mnt->mnt_master. Btw, here the assumption we listed at the beginning becomes important. Namely, that peers in a peer group pg1 that are slaves in another peer group pg2 appear on the same ->mnt_slave_list. IOW, all slaves who are peers in peer group pg1 point to the same peer in peer group pg2 via their ->mnt_master. Otherwise the termination condition in the code above would be wrong and next_group() would be broken too. So the first iteration sets: n = m; p = n->mnt_master; such that @p now points to a peer or @dest_mnt itself. We walk up one more level since we don't have any marked mounts. So we end up with: n = dest_mnt; p = dest_mnt->mnt_master; If @dest_mnt's peer group is not slave to another peer group then @p is now NULL. If @dest_mnt's peer group is a slave to another peer group then @p now points to @dest_mnt->mnt_master points which is a master outside the propagation tree we're dealing with. Now we need to figure out the master for the copy of the source mount tree we're about to create and mount on the first slave of @dest_mnt's peer group: do { struct mount *parent = last_source->mnt_parent; if (last_source == first_source) break; done = parent->mnt_master == p; if (done && peers(n, parent)) break; last_source = last_source->mnt_master; } while (!done); We know that @last_source->mnt_parent points to @last_dest and @last_dest is the last peer in @dest_mnt's peer group we propagated to in the peer loop in propagate_mnt(). Consequently, @last_source is the last copy we created and mount on that last peer in @dest_mnt's peer group. So @last_source is the master we want to pick. We know that @last_source->mnt_parent->mnt_master points to @last_dest->mnt_master. We also know that @last_dest->mnt_master is either NULL or points to a master outside of the destination propagation tree and so does @p. Hence: done = parent->mnt_master == p; is trivially true in the base condition. We also know that for the first slave mount of @dest_mnt's peer group that @last_dest either points @dest_mnt itself because it was initialized to: last_dest = dest_mnt; at the beginning of propagate_mnt() or it will point to a peer of @dest_mnt in its peer group. In both cases it is guaranteed that on the first iteration @n and @parent are peers (Please note the check for peers here as that's important.): if (done && peers(n, parent)) break; So, as we expected, we select @last_source, which referes to the last copy of the source mount tree we mounted on the last peer in @dest_mnt's peer group, as the master of the first slave in @dest_mnt's peer group. The rest is taken care of by clone_mnt(last_source, ...). We'll skip over that part otherwise this becomes a blogpost. At the end of propagate_mnt() we now mark @m->mnt_master as the first master in the destination propagation tree that is distinct from @dest_mnt->mnt_master. IOW, we mark @dest_mnt itself as a master. By marking @dest_mnt or one of it's peers we are able to easily find it again when we later lookup masters for other copies of the source mount tree we mount copies of the source mount tree on slaves @m to @dest_mnt's peer group. This, in turn allows us to find the master we selected for the copies of the source mount tree we mounted on master in the destination propagation tree again. The important part is to realize that the code makes use of the fact that the last copy of the source mount tree stashed in @last_source was mounted on top of the previous destination propagation node @last_dest. What this means is that @last_source allows us to walk the destination propagation hierarchy the same way each destination propagation node @m does. If we take @last_source, which is the copy of @source_mnt we have mounted on @last_dest in the previous iteration of propagate_one(), then we know @last_source->mnt_parent points to @last_dest but we also know that as we walk through the destination propagation tree that @last_source->mnt_master will point to an earlier copy of the source mount tree we mounted one an earlier destination propagation node @m. IOW, @last_source->mnt_parent will be our hook into the destination propagation tree and each consecutive @last_source->mnt_master will lead us to an earlier propagation node @m via @last_source->mnt_master->mnt_parent. Hence, by walking up @last_source->mnt_master, each of which is mounted on a node that is a master @m in the destination propagation tree we can also walk up the destination propagation hierarchy. So, for each new destination propagation node @m we use the previous copy of @last_source and the fact it's mounted on the previous propagation node @last_dest via @last_source->mnt_master->mnt_parent to determine what the master of the new copy of @last_source needs to be. The goal is to find the _closest_ master that the new copy of the source mount tree we are about to create and mount on a slave @m in the destination propagation tree needs to pick. IOW, we want to find a suitable master in the propagation group. As the propagation structure of the source mount propagation tree we create mirrors the propagation structure of the destination propagation tree we can find @m's closest master - i.e., a marked master - which is a peer in the closest peer group that @m receives propagation from. We store that closest master of @m in @p as before and record the slave to that master in @n We then search for this master @p via @last_source by walking up the master hierarchy starting from the last copy of the source mount tree stored in @last_source that we created and mounted on the previous destination propagation node @m. We will try to find the master by walking @last_source->mnt_master and by comparing @last_source->mnt_master->mnt_parent->mnt_master to @p. If we find @p then we can figure out what earlier copy of the source mount tree needs to be the master for the new copy of the source mount tree we're about to create and mount at the current destination propagation node @m. If @last_source->mnt_master->mnt_parent and @n are peers then we know that the closest master they receive propagation from is @last_source->mnt_master->mnt_parent->mnt_master. If not then the closest immediate peer group that they receive propagation from must be one level higher up. This builds on the earlier clarification at the beginning that all peers in a peer group which are slaves of other peer groups all point to the same ->mnt_master, i.e., appear on the same ->mnt_slave_list, of the closest peer group that they receive propagation from. However, terminating the walk has corner cases. If the closest marked master for a given destination node @m cannot be found by walking up the master hierarchy via @last_source->mnt_master then we need to terminate the walk when we encounter @source_mnt again. This isn't an arbitrary termination. It simply means that the new copy of the source mount tree we're about to create has a copy of the source mount tree we created and mounted on a peer in @dest_mnt's peer group as its master. IOW, @source_mnt is the peer in the closest peer group that the new copy of the source mount tree receives propagation from. We absolutely have to stop @source_mnt because @last_source->mnt_master either points outside the propagation hierarchy we're dealing with or it is NULL because @source_mnt isn't a shared-slave. So continuing the walk past @source_mnt would cause a NULL dereference via @last_source->mnt_master->mnt_parent. And so we have to stop the walk when we encounter @source_mnt again. One scenario where this can happen is when we first handled a series of slaves of @dest_mnt's peer group and then encounter peers in a new peer group that is a slave to @dest_mnt's peer group. We handle them and then we encounter another slave mount to @dest_mnt that is a pure slave to @dest_mnt's peer group. That pure slave will have a peer in @dest_mnt's peer group as its master. Consequently, the new copy of the source mount tree will need to have @source_mnt as it's master. So we walk the propagation hierarchy all the way up to @source_mnt based on @last_source->mnt_master. So terminate on @source_mnt, easy peasy. Except, that the check misses something that the rest of the algorithm already handles. If @dest_mnt has peers in it's peer group the peer loop in propagate_mnt(): for (n = next_peer(dest_mnt); n != dest_mnt; n = next_peer(n)) { ret = propagate_one(n); if (ret) goto out; } will consecutively update @last_source with each previous copy of the source mount tree we created and mounted at the previous peer in @dest_mnt's peer group. So after that loop terminates @last_source will point to whatever copy of the source mount tree was created and mounted on the last peer in @dest_mnt's peer group. Furthermore, if there is even a single additional peer in @dest_mnt's peer group then @last_source will __not__ point to @source_mnt anymore. Because, as we mentioned above, @dest_mnt isn't even handled in this loop but directly in attach_recursive_mnt(). So it can't even accidently come last in that peer loop. So the first time we handle a slave mount @m of @dest_mnt's peer group the copy of the source mount tree we create will make the __last copy of the source mount tree we created and mounted on the last peer in @dest_mnt's peer group the master of the new copy of the source mount tree we create and mount on the first slave of @dest_mnt's peer group__. But this means that the termination condition that checks for @source_mnt is wrong. The @source_mnt cannot be found anymore by propagate_one(). Instead it will find the last copy of the source mount tree we created and mounted for the last peer of @dest_mnt's peer group again. And that is a peer of @source_mnt not @source_mnt itself. IOW, we fail to terminate the loop correctly and ultimately dereference @last_source->mnt_master->mnt_parent. When @source_mnt's peer group isn't slave to another peer group then @last_source->mnt_master is NULL causing the splat below. For example, assume @dest_mnt is a pure shared mount and has three peers in its peer group: =================================================================================== mount-id mount-parent-id peer-group-id =================================================================================== (@dest_mnt) mnt_master[216] 309 297 shared:216 \ (@source_mnt) mnt_master[218]: 609 609 shared:218 (1) mnt_master[216]: 607 605 shared:216 \ (P1) mnt_master[218]: 624 607 shared:218 (2) mnt_master[216]: 576 574 shared:216 \ (P2) mnt_master[218]: 625 576 shared:218 (3) mnt_master[216]: 545 543 shared:216 \ (P3) mnt_master[218]: 626 545 shared:218 After this sequence has been processed @last_source will point to (P3), the copy generated for the third peer in @dest_mnt's peer group we handled. So the copy of the source mount tree (P4) we create and mount on the first slave of @dest_mnt's peer group: =================================================================================== mount-id mount-parent-id peer-group-id =================================================================================== mnt_master[216] 309 297 shared:216 / / (S0) mnt_slave 483 481 master:216 \ \ (P3) mnt_master[218] 626 545 shared:218 \ / \/ (P4) mnt_slave 627 483 master:218 will pick the last copy of the source mount tree (P3) as master, not (S0). When walking the propagation hierarchy via @last_source's master hierarchy we encounter (P3) but not (S0), i.e., @source_mnt. We can fix this in multiple ways: (1) By setting @last_source to @source_mnt after we processed the peers in @dest_mnt's peer group right after the peer loop in propagate_mnt(). (2) By changing the termination condition that relies on finding exactly @source_mnt to finding a peer of @source_mnt. (3) By only moving @last_source when we actually venture into a new peer group or some clever variant thereof. The first two options are minimally invasive and what we want as a fix. The third option is more intrusive but something we'd like to explore in the near future. This passes all LTP tests and specifically the mount propagation testsuite part of it. It also holds up against all known reproducers of this issues. Final words. First, this is a clever but __worringly__ underdocumented algorithm. There isn't a single detailed comment to be found in next_group(), propagate_one() or anywhere else in that file for that matter. This has been a giant pain to understand and work through and a bug like this is insanely difficult to fix without a detailed understanding of what's happening. Let's not talk about the amount of time that was sunk into fixing this. Second, all the cool kids with access to unshare --mount --user --map-root --propagation=unchanged are going to have a lot of fun. IOW, triggerable by unprivileged users while namespace_lock() lock is held. [ 115.848393] BUG: kernel NULL pointer dereference, address: 0000000000000010 [ 115.848967] #PF: supervisor read access in kernel mode [ 115.849386] #PF: error_code(0x0000) - not-present page [ 115.849803] PGD 0 P4D 0 [ 115.850012] Oops: 0000 [#1] PREEMPT SMP PTI [ 115.850354] CPU: 0 PID: 15591 Comm: mount Not tainted 6.1.0-rc7 #3 [ 115.850851] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 115.851510] RIP: 0010:propagate_one.part.0+0x7f/0x1a0 [ 115.851924] Code: 75 eb 4c 8b 05 c2 25 37 02 4c 89 ca 48 8b 4a 10 49 39 d0 74 1e 48 3b 81 e0 00 00 00 74 26 48 8b 92 e0 00 00 00 be 01 00 00 00 <48> 8b 4a 10 49 39 d0 75 e2 40 84 f6 74 38 4c 89 05 84 25 37 02 4d [ 115.853441] RSP: 0018:ffffb8d5443d7d50 EFLAGS: 00010282 [ 115.853865] RAX: ffff8e4d87c41c80 RBX: ffff8e4d88ded780 RCX: ffff8e4da4333a00 [ 115.854458] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8e4d88ded780 [ 115.855044] RBP: ffff8e4d88ded780 R08: ffff8e4da4338000 R09: ffff8e4da43388c0 [ 115.855693] R10: 0000000000000002 R11: ffffb8d540158000 R12: ffffb8d5443d7da8 [ 115.856304] R13: ffff8e4d88ded780 R14: 0000000000000000 R15: 0000000000000000 [ 115.856859] FS: 00007f92c90c9800(0000) GS:ffff8e4dfdc00000(0000) knlGS:0000000000000000 [ 115.857531] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 115.858006] CR2: 0000000000000010 CR3: 0000000022f4c002 CR4: 00000000000706f0 [ 115.858598] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 115.859393] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 115.860099] Call Trace: [ 115.860358] <TASK> [ 115.860535] propagate_mnt+0x14d/0x190 [ 115.860848] attach_recursive_mnt+0x274/0x3e0 [ 115.861212] path_mount+0x8c8/0xa60 [ 115.861503] __x64_sys_mount+0xf6/0x140 [ 115.861819] do_syscall_64+0x5b/0x80 [ 115.862117] ? do_faccessat+0x123/0x250 [ 115.862435] ? syscall_exit_to_user_mode+0x17/0x40 [ 115.862826] ? do_syscall_64+0x67/0x80 [ 115.863133] ? syscall_exit_to_user_mode+0x17/0x40 [ 115.863527] ? do_syscall_64+0x67/0x80 [ 115.863835] ? do_syscall_64+0x67/0x80 [ 115.864144] ? do_syscall_64+0x67/0x80 [ 115.864452] ? exc_page_fault+0x70/0x170 [ 115.864775] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 115.865187] RIP: 0033:0x7f92c92b0ebe [ 115.865480] Code: 48 8b 0d 75 4f 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 42 4f 0c 00 f7 d8 64 89 01 48 [ 115.866984] RSP: 002b:00007fff000aa728 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5 [ 115.867607] RAX: ffffffffffffffda RBX: 000055a77888d6b0 RCX: 00007f92c92b0ebe [ 115.868240] RDX: 000055a77888d8e0 RSI: 000055a77888e6e0 RDI: 000055a77888e620 [ 115.868823] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001 [ 115.869403] R10: 0000000000001000 R11: 0000000000000246 R12: 000055a77888e620 [ 115.869994] R13: 000055a77888d8e0 R14: 00000000ffffffff R15: 00007f92c93e4076 [ 115.870581] </TASK> [ 115.870763] Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set rfkill nf_tables nfnetlink qrtr snd_intel8x0 sunrpc snd_ac97_codec ac97_bus snd_pcm snd_timer intel_rapl_msr intel_rapl_common snd vboxguest intel_powerclamp video rapl joydev soundcore i2c_piix4 wmi fuse zram xfs vmwgfx crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic drm_ttm_helper ttm e1000 ghash_clmulni_intel serio_raw ata_generic pata_acpi scsi_dh_rdac scsi_dh_emc scsi_dh_alua dm_multipath [ 115.875288] CR2: 0000000000000010 [ 115.875641] ---[ end trace 0000000000000000 ]--- [ 115.876135] RIP: 0010:propagate_one.part.0+0x7f/0x1a0 [ 115.876551] Code: 75 eb 4c 8b 05 c2 25 37 02 4c 89 ca 48 8b 4a 10 49 39 d0 74 1e 48 3b 81 e0 00 00 00 74 26 48 8b 92 e0 00 00 00 be 01 00 00 00 <48> 8b 4a 10 49 39 d0 75 e2 40 84 f6 74 38 4c 89 05 84 25 37 02 4d [ 115.878086] RSP: 0018:ffffb8d5443d7d50 EFLAGS: 00010282 [ 115.878511] RAX: ffff8e4d87c41c80 RBX: ffff8e4d88ded780 RCX: ffff8e4da4333a00 [ 115.879128] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8e4d88ded780 [ 115.879715] RBP: ffff8e4d88ded780 R08: ffff8e4da4338000 R09: ffff8e4da43388c0 [ 115.880359] R10: 0000000000000002 R11: ffffb8d540158000 R12: ffffb8d5443d7da8 [ 115.880962] R13: ffff8e4d88ded780 R14: 0000000000000000 R15: 0000000000000000 [ 115.881548] FS: 00007f92c90c9800(0000) GS:ffff8e4dfdc00000(0000) knlGS:0000000000000000 [ 115.882234] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 115.882713] CR2: 0000000000000010 CR3: 0000000022f4c002 CR4: 00000000000706f0 [ 115.883314] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 115.883966] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Fixes: f2ebb3a921c1 ("smarter propagate_mnt()") Fixes: 5ec0811d3037 ("propogate_mnt: Handle the first propogated copy being a slave") Cc: <stable@vger.kernel.org> Reported-by: Ditang Chen <ditang.c@gmail.com> Signed-off-by: Seth Forshee (Digital Ocean) <sforshee@kernel.org> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* hfsplus: fix bug causing custom uid and gid being unable to be assigned with ↵Aditya Garg2023-01-073-2/+8
| | | | | | | | | | | | | | | | mount commit 9f2b5debc07073e6dfdd774e3594d0224b991927 upstream. Despite specifying UID and GID in mount command, the specified UID and GID were not being assigned. This patch fixes this issue. Link: https://lkml.kernel.org/r/C0264BF5-059C-45CF-B8DA-3A3BD2C803A2@live.com Signed-off-by: Aditya Garg <gargaditya08@live.com> Reviewed-by: Viacheslav Dubeyko <slava@dubeyko.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* reiserfs: Add missing calls to reiserfs_security_free()Roberto Sassu2023-01-072-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 572302af1258459e124437b8f3369357447afac7 upstream. Commit 57fe60df6241 ("reiserfs: add atomic addition of selinux attributes during inode creation") defined reiserfs_security_free() to free the name and value of a security xattr allocated by the active LSM through security_old_inode_init_security(). However, this function is not called in the reiserfs code. Thus, add a call to reiserfs_security_free() whenever reiserfs_security_init() is called, and initialize value to NULL, to avoid to call kfree() on an uninitialized pointer. Finally, remove the kfree() for the xattr name, as it is not allocated anymore. Fixes: 57fe60df6241 ("reiserfs: add atomic addition of selinux attributes during inode creation") Cc: stable@vger.kernel.org Cc: Jeff Mahoney <jeffm@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: Mimi Zohar <zohar@linux.ibm.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com> Reviewed-by: Mimi Zohar <zohar@linux.ibm.com> Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* orangefs: Fix kmemleak in orangefs_prepare_debugfs_help_string()Zhang Xiaoxu2023-01-071-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit d23417a5bf3a3afc55de5442eb46e1e60458b0a1 ] When insert and remove the orangefs module, then debug_help_string will be leaked: unreferenced object 0xffff8881652ba000 (size 4096): comm "insmod", pid 1701, jiffies 4294893639 (age 13218.530s) hex dump (first 32 bytes): 43 6c 69 65 6e 74 20 44 65 62 75 67 20 4b 65 79 Client Debug Key 77 6f 72 64 73 20 61 72 65 20 75 6e 6b 6e 6f 77 words are unknow backtrace: [<0000000004e6f8e3>] kmalloc_trace+0x27/0xa0 [<0000000006f75d85>] orangefs_prepare_debugfs_help_string+0x5e/0x480 [orangefs] [<0000000091270a2a>] _sub_I_65535_1+0x57/0xf70 [crc_itu_t] [<000000004b1ee1a3>] do_one_initcall+0x87/0x2a0 [<000000001d0614ae>] do_init_module+0xdf/0x320 [<00000000efef068c>] load_module+0x2f98/0x3330 [<000000006533b44d>] __do_sys_finit_module+0x113/0x1b0 [<00000000a0da6f99>] do_syscall_64+0x35/0x80 [<000000007790b19b>] entry_SYSCALL_64_after_hwframe+0x46/0xb0 When remove the module, should always free debug_help_string. Should always free the allocated buffer when change the free_debug_help_string. Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* hfs: fix OOB Read in __hfs_brec_findZhangPeng2023-01-071-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 8d824e69d9f3fa3121b2dda25053bae71e2460d2 ] Syzbot reported a OOB read bug: ================================================================== BUG: KASAN: slab-out-of-bounds in hfs_strcmp+0x117/0x190 fs/hfs/string.c:84 Read of size 1 at addr ffff88807eb62c4e by task kworker/u4:1/11 CPU: 1 PID: 11 Comm: kworker/u4:1 Not tainted 6.1.0-rc6-syzkaller-00308-g644e9524388a #0 Workqueue: writeback wb_workfn (flush-7:0) Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x1b1/0x28e lib/dump_stack.c:106 print_address_description+0x74/0x340 mm/kasan/report.c:284 print_report+0x107/0x1f0 mm/kasan/report.c:395 kasan_report+0xcd/0x100 mm/kasan/report.c:495 hfs_strcmp+0x117/0x190 fs/hfs/string.c:84 __hfs_brec_find+0x213/0x5c0 fs/hfs/bfind.c:75 hfs_brec_find+0x276/0x520 fs/hfs/bfind.c:138 hfs_write_inode+0x34c/0xb40 fs/hfs/inode.c:462 write_inode fs/fs-writeback.c:1440 [inline] If the input inode of hfs_write_inode() is incorrect: struct inode struct hfs_inode_info struct hfs_cat_key struct hfs_name u8 len # len is greater than HFS_NAMELEN(31) which is the maximum length of an HFS filename OOB read occurred: hfs_write_inode() hfs_brec_find() __hfs_brec_find() hfs_cat_keycmp() hfs_strcmp() # OOB read occurred due to len is too large Fix this by adding a Check on len in hfs_write_inode() before calling hfs_brec_find(). Link: https://lkml.kernel.org/r/20221130065959.2168236-1-zhangpeng362@huawei.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Reported-by: <syzbot+e836ff7133ac02be825f@syzkaller.appspotmail.com> Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Viacheslav Dubeyko <slava@dubeyko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* nilfs2: fix shift-out-of-bounds/overflow in nilfs_sb2_bad_offset()Ryusuke Konishi2023-01-071-4/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 610a2a3d7d8be3537458a378ec69396a76c385b6 ] Patch series "nilfs2: fix UBSAN shift-out-of-bounds warnings on mount time". The first patch fixes a bug reported by syzbot, and the second one fixes the remaining bug of the same kind. Although they are triggered by the same super block data anomaly, I divided it into the above two because the details of the issues and how to fix it are different. Both are required to eliminate the shift-out-of-bounds issues at mount time. This patch (of 2): If the block size exponent information written in an on-disk superblock is corrupted, nilfs_sb2_bad_offset helper function can trigger shift-out-of-bounds warning followed by a kernel panic (if panic_on_warn is set): shift exponent 38983 is too large for 64-bit type 'unsigned long long' Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x1b1/0x28e lib/dump_stack.c:106 ubsan_epilogue lib/ubsan.c:151 [inline] __ubsan_handle_shift_out_of_bounds+0x33d/0x3b0 lib/ubsan.c:322 nilfs_sb2_bad_offset fs/nilfs2/the_nilfs.c:449 [inline] nilfs_load_super_block+0xdf5/0xe00 fs/nilfs2/the_nilfs.c:523 init_nilfs+0xb7/0x7d0 fs/nilfs2/the_nilfs.c:577 nilfs_fill_super+0xb1/0x5d0 fs/nilfs2/super.c:1047 nilfs_mount+0x613/0x9b0 fs/nilfs2/super.c:1317 ... In addition, since nilfs_sb2_bad_offset() performs multiplication without considering the upper bound, the computation may overflow if the disk layout parameters are not normal. This fixes these issues by inserting preliminary sanity checks for those parameters and by converting the comparison from one involving multiplication and left bit-shifting to one using division and right bit-shifting. Link: https://lkml.kernel.org/r/20221027044306.42774-1-konishi.ryusuke@gmail.com Link: https://lkml.kernel.org/r/20221027044306.42774-2-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+e91619dd4c11c4960706@syzkaller.appspotmail.com Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* fs: jfs: fix shift-out-of-bounds in dbDiscardAGHoi Pok Wu2023-01-071-0/+5
| | | | | | | | | | | | [ Upstream commit 25e70c6162f207828dd405b432d8f2a98dbf7082 ] This should be applied to most URSAN bugs found recently by syzbot, by guarding the dbMount. As syzbot feeding rubbish into the bmap descriptor. Signed-off-by: Hoi Pok Wu <wuhoipok@gmail.com> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* udf: Avoid double brelse() in udf_rename()Shigeru Yoshida2023-01-071-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit c791730f2554a9ebb8f18df9368dc27d4ebc38c2 ] syzbot reported a warning like below [1]: VFS: brelse: Trying to free free buffer WARNING: CPU: 2 PID: 7301 at fs/buffer.c:1145 __brelse+0x67/0xa0 ... Call Trace: <TASK> invalidate_bh_lru+0x99/0x150 smp_call_function_many_cond+0xe2a/0x10c0 ? generic_remap_file_range_prep+0x50/0x50 ? __brelse+0xa0/0xa0 ? __mutex_lock+0x21c/0x12d0 ? smp_call_on_cpu+0x250/0x250 ? rcu_read_lock_sched_held+0xb/0x60 ? lock_release+0x587/0x810 ? __brelse+0xa0/0xa0 ? generic_remap_file_range_prep+0x50/0x50 on_each_cpu_cond_mask+0x3c/0x80 blkdev_flush_mapping+0x13a/0x2f0 blkdev_put_whole+0xd3/0xf0 blkdev_put+0x222/0x760 deactivate_locked_super+0x96/0x160 deactivate_super+0xda/0x100 cleanup_mnt+0x222/0x3d0 task_work_run+0x149/0x240 ? task_work_cancel+0x30/0x30 do_exit+0xb29/0x2a40 ? reacquire_held_locks+0x4a0/0x4a0 ? do_raw_spin_lock+0x12a/0x2b0 ? mm_update_next_owner+0x7c0/0x7c0 ? rwlock_bug.part.0+0x90/0x90 ? zap_other_threads+0x234/0x2d0 do_group_exit+0xd0/0x2a0 __x64_sys_exit_group+0x3a/0x50 do_syscall_64+0x34/0xb0 entry_SYSCALL_64_after_hwframe+0x63/0xcd The cause of the issue is that brelse() is called on both ofibh.sbh and ofibh.ebh by udf_find_entry() when it returns NULL. However, brelse() is called by udf_rename(), too. So, b_count on buffer_head becomes unbalanced. This patch fixes the issue by not calling brelse() by udf_rename() when udf_find_entry() returns NULL. Link: https://syzkaller.appspot.com/bug?id=8297f45698159c6bca8a1f87dc983667c1a1c851 [1] Reported-by: syzbot+7902cd7684bc35306224@syzkaller.appspotmail.com Signed-off-by: Shigeru Yoshida <syoshida@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20221023095741.271430-1-syoshida@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* fs: jfs: fix shift-out-of-bounds in dbAllocAGDongliang Mu2023-01-071-6/+16
| | | | | | | | | | | | | | | | | | [ Upstream commit 898f706695682b9954f280d95e49fa86ffa55d08 ] Syzbot found a crash : UBSAN: shift-out-of-bounds in dbAllocAG. The underlying bug is the missing check of bmp->db_agl2size. The field can be greater than 64 and trigger the shift-out-of-bounds. Fix this bug by adding a check of bmp->db_agl2size in dbMount since this field is used in many following functions. The upper bound for this field is L2MAXL2SIZE - L2MAXAG, thanks for the help of Dave Kleikamp. Note that, for maintenance, I reorganized error handling code of dbMount. Reported-by: syzbot+15342c1aa6a00fb7a438@syzkaller.appspotmail.com Signed-off-by: Dongliang Mu <mudongliangabcd@gmail.com> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* binfmt_misc: fix shift-out-of-bounds in check_special_flagsLiu Shixin2023-01-071-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 6a46bf558803dd2b959ca7435a5c143efe837217 ] UBSAN reported a shift-out-of-bounds warning: left shift of 1 by 31 places cannot be represented in type 'int' Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x8d/0xcf lib/dump_stack.c:106 ubsan_epilogue+0xa/0x44 lib/ubsan.c:151 __ubsan_handle_shift_out_of_bounds+0x1e7/0x208 lib/ubsan.c:322 check_special_flags fs/binfmt_misc.c:241 [inline] create_entry fs/binfmt_misc.c:456 [inline] bm_register_write+0x9d3/0xa20 fs/binfmt_misc.c:654 vfs_write+0x11e/0x580 fs/read_write.c:582 ksys_write+0xcf/0x120 fs/read_write.c:637 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x34/0x80 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x4194e1 Since the type of Node's flags is unsigned long, we should define these macros with same type too. Signed-off-by: Liu Shixin <liushixin2@huawei.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221102025123.1117184-1-liushixin2@huawei.com Signed-off-by: Sasha Levin <sashal@kernel.org>
* nfsd: under NFSv4.1, fix double svc_xprt_put on rpc_create failureDan Aloni2023-01-071-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 3bc8edc98bd43540dbe648e4ef91f443d6d20a24 ] On error situation `clp->cl_cb_conn.cb_xprt` should not be given a reference to the xprt otherwise both client cleanup and the error handling path of the caller call to put it. Better to delay handing over the reference to a later branch. [ 72.530665] refcount_t: underflow; use-after-free. [ 72.531933] WARNING: CPU: 0 PID: 173 at lib/refcount.c:28 refcount_warn_saturate+0xcf/0x120 [ 72.533075] Modules linked in: nfsd(OE) nfsv4(OE) nfsv3(OE) nfs(OE) lockd(OE) compat_nfs_ssc(OE) nfs_acl(OE) rpcsec_gss_krb5(OE) auth_rpcgss(OE) rpcrdma(OE) dns_resolver fscache netfs grace rdma_cm iw_cm ib_cm sunrpc(OE) mlx5_ib mlx5_core mlxfw pci_hyperv_intf ib_uverbs ib_core xt_MASQUERADE nf_conntrack_netlink nft_counter xt_addrtype nft_compat br_netfilter bridge stp llc nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set overlay nf_tables nfnetlink crct10dif_pclmul crc32_pclmul ghash_clmulni_intel xfs serio_raw virtio_net virtio_blk net_failover failover fuse [last unloaded: sunrpc] [ 72.540389] CPU: 0 PID: 173 Comm: kworker/u16:5 Tainted: G OE 5.15.82-dan #1 [ 72.541511] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.16.0-3.module+el8.7.0+1084+97b81f61 04/01/2014 [ 72.542717] Workqueue: nfsd4_callbacks nfsd4_run_cb_work [nfsd] [ 72.543575] RIP: 0010:refcount_warn_saturate+0xcf/0x120 [ 72.544299] Code: 55 00 0f 0b 5d e9 01 50 98 00 80 3d 75 9e 39 08 00 0f 85 74 ff ff ff 48 c7 c7 e8 d1 60 8e c6 05 61 9e 39 08 01 e8 f6 51 55 00 <0f> 0b 5d e9 d9 4f 98 00 80 3d 4b 9e 39 08 00 0f 85 4c ff ff ff 48 [ 72.546666] RSP: 0018:ffffb3f841157cf0 EFLAGS: 00010286 [ 72.547393] RAX: 0000000000000026 RBX: ffff89ac6231d478 RCX: 0000000000000000 [ 72.548324] RDX: ffff89adb7c2c2c0 RSI: ffff89adb7c205c0 RDI: ffff89adb7c205c0 [ 72.549271] RBP: ffffb3f841157cf0 R08: 0000000000000000 R09: c0000000ffefffff [ 72.550209] R10: 0000000000000001 R11: ffffb3f841157ad0 R12: ffff89ac6231d180 [ 72.551142] R13: ffff89ac6231d478 R14: ffff89ac40c06180 R15: ffff89ac6231d4b0 [ 72.552089] FS: 0000000000000000(0000) GS:ffff89adb7c00000(0000) knlGS:0000000000000000 [ 72.553175] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 72.553934] CR2: 0000563a310506a8 CR3: 0000000109a66000 CR4: 0000000000350ef0 [ 72.554874] Call Trace: [ 72.555278] <TASK> [ 72.555614] svc_xprt_put+0xaf/0xe0 [sunrpc] [ 72.556276] nfsd4_process_cb_update.isra.11+0xb7/0x410 [nfsd] [ 72.557087] ? update_load_avg+0x82/0x610 [ 72.557652] ? cpuacct_charge+0x60/0x70 [ 72.558212] ? dequeue_entity+0xdb/0x3e0 [ 72.558765] ? queued_spin_unlock+0x9/0x20 [ 72.559358] nfsd4_run_cb_work+0xfc/0x270 [nfsd] [ 72.560031] process_one_work+0x1df/0x390 [ 72.560600] worker_thread+0x37/0x3b0 [ 72.561644] ? process_one_work+0x390/0x390 [ 72.562247] kthread+0x12f/0x150 [ 72.562710] ? set_kthread_struct+0x50/0x50 [ 72.563309] ret_from_fork+0x22/0x30 [ 72.563818] </TASK> [ 72.564189] ---[ end trace 031117b1c72ec616 ]--- [ 72.566019] list_add corruption. next->prev should be prev (ffff89ac4977e538), but was ffff89ac4763e018. (next=ffff89ac4763e018). [ 72.567647] ------------[ cut here ]------------ Fixes: a4abc6b12eb1 ("nfsd: Fix svc_xprt refcnt leak when setup callback client failed") Cc: Xiyu Yang <xiyuyang19@fudan.edu.cn> Cc: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Dan Aloni <dan.aloni@vastdata.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* chardev: fix error handling in cdev_device_add()Yang Yingliang2023-01-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 11fa7fefe3d8fac7da56bc9aa3dd5fb3081ca797 ] While doing fault injection test, I got the following report: ------------[ cut here ]------------ kobject: '(null)' (0000000039956980): is not initialized, yet kobject_put() is being called. WARNING: CPU: 3 PID: 6306 at kobject_put+0x23d/0x4e0 CPU: 3 PID: 6306 Comm: 283 Tainted: G W 6.1.0-rc2-00005-g307c1086d7c9 #1253 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 RIP: 0010:kobject_put+0x23d/0x4e0 Call Trace: <TASK> cdev_device_add+0x15e/0x1b0 __iio_device_register+0x13b4/0x1af0 [industrialio] __devm_iio_device_register+0x22/0x90 [industrialio] max517_probe+0x3d8/0x6b4 [max517] i2c_device_probe+0xa81/0xc00 When device_add() is injected fault and returns error, if dev->devt is not set, cdev_add() is not called, cdev_del() is not needed. Fix this by checking dev->devt in error path. Fixes: 233ed09d7fda ("chardev: add helper function to register char devs with a struct device") Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Link: https://lore.kernel.org/r/20221202030237.520280-1-yangyingliang@huawei.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* orangefs: Fix sysfs not cleanup when dev init failedZhang Xiaoxu2023-01-071-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit ea60a4ad0cf88b411cde6888b8c890935686ecd7 ] When the dev init failed, should cleanup the sysfs, otherwise, the module will never be loaded since can not create duplicate sysfs directory: sysfs: cannot create duplicate filename '/fs/orangefs' CPU: 1 PID: 6549 Comm: insmod Tainted: G W 6.0.0+ #44 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1.fc33 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x34/0x44 sysfs_warn_dup.cold+0x17/0x24 sysfs_create_dir_ns+0x16d/0x180 kobject_add_internal+0x156/0x3a0 kobject_init_and_add+0xcf/0x120 orangefs_sysfs_init+0x7e/0x3a0 [orangefs] orangefs_init+0xfe/0x1000 [orangefs] do_one_initcall+0x87/0x2a0 do_init_module+0xdf/0x320 load_module+0x2f98/0x3330 __do_sys_finit_module+0x113/0x1b0 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 kobject_add_internal failed for orangefs with -EEXIST, don't try to register things with the same name in the same directory. Fixes: 2f83ace37181 ("orangefs: put register_chrdev immediately before register_filesystem") Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* NFSv4: Fix a deadlock between nfs4_open_recover_helper() and delegreturnTrond Myklebust2023-01-071-7/+12
| | | | | | | | | | | | | | | [ Upstream commit 51069e4aef6257b0454057359faed0ab0c9af083 ] If we're asked to recover open state while a delegation return is outstanding, then the state manager thread cannot use a cached open, so if the server returns a delegation, we can end up deadlocked behind the pending delegreturn. To avoid this problem, let's just ask the server not to give us a delegation unless we're explicitly reclaiming one. Fixes: be36e185bd26 ("NFSv4: nfs4_open_recover_helper() must set share access") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* NFSv4.2: Fix a memory stomp in decode_attr_security_labelTrond Myklebust2023-01-071-6/+4
| | | | | | | | | | | [ Upstream commit 43c1031f7110967c240cb6e922adcfc4b8899183 ] We must not change the value of label->len if it is zero, since that indicates we stored a label. Fixes: b4487b935452 ("nfs: Fix getxattr kernel panic and memory overflow") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* hfs: Fix OOB Write in hfs_asc2macZhangPeng2023-01-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit c53ed55cb275344086e32a7080a6b19cb183650b ] Syzbot reported a OOB Write bug: loop0: detected capacity change from 0 to 64 ================================================================== BUG: KASAN: slab-out-of-bounds in hfs_asc2mac+0x467/0x9a0 fs/hfs/trans.c:133 Write of size 1 at addr ffff88801848314e by task syz-executor391/3632 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x1b1/0x28e lib/dump_stack.c:106 print_address_description+0x74/0x340 mm/kasan/report.c:284 print_report+0x107/0x1f0 mm/kasan/report.c:395 kasan_report+0xcd/0x100 mm/kasan/report.c:495 hfs_asc2mac+0x467/0x9a0 fs/hfs/trans.c:133 hfs_cat_build_key+0x92/0x170 fs/hfs/catalog.c:28 hfs_lookup+0x1ab/0x2c0 fs/hfs/dir.c:31 lookup_open fs/namei.c:3391 [inline] open_last_lookups fs/namei.c:3481 [inline] path_openat+0x10e6/0x2df0 fs/namei.c:3710 do_filp_open+0x264/0x4f0 fs/namei.c:3740 If in->len is much larger than HFS_NAMELEN(31) which is the maximum length of an HFS filename, a OOB write could occur in hfs_asc2mac(). In that case, when the dst reaches the boundary, the srclen is still greater than 0, which causes a OOB write. Fix this by adding a check on dstlen in while() before writing to dst address. Link: https://lkml.kernel.org/r/20221202030038.1391945-1-zhangpeng362@huawei.com Fixes: 328b92278650 ("[PATCH] hfs: NLS support") Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Reviewed-by: Viacheslav Dubeyko <slava@dubeyko.com> Reported-by: <syzbot+dc3b1cf9111ab5fe98e7@syzkaller.appspotmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* fs: sysv: Fix sysv_nblocks() returns wrong valueChen Zhongjin2023-01-071-1/+1
| | | | | | | | | | | | | | | | | | [ Upstream commit e0c49bd2b4d3cd1751491eb2d940bce968ac65e9 ] sysv_nblocks() returns 'blocks' rather than 'res', which only counting the number of triple-indirect blocks and causing sysv_getattr() gets a wrong result. [AV: this is actually a sysv counterpart of minixfs fix - 0fcd426de9d0 "[PATCH] minix block usage counting fix" in historical tree; mea culpa, should've thought to check fs/sysv back then...] Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
* libfs: add DEFINE_SIMPLE_ATTRIBUTE_SIGNED for signed valueAkinobu Mita2023-01-071-3/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 2e41f274f9aa71cdcc69dc1f26a3f9304a651804 ] Patch series "fix error when writing negative value to simple attribute files". The simple attribute files do not accept a negative value since the commit 488dac0c9237 ("libfs: fix error cast of negative value in simple_attr_write()"), but some attribute files want to accept a negative value. This patch (of 3): The simple attribute files do not accept a negative value since the commit 488dac0c9237 ("libfs: fix error cast of negative value in simple_attr_write()"), so we have to use a 64-bit value to write a negative value. This adds DEFINE_SIMPLE_ATTRIBUTE_SIGNED for a signed value. Link: https://lkml.kernel.org/r/20220919172418.45257-1-akinobu.mita@gmail.com Link: https://lkml.kernel.org/r/20220919172418.45257-2-akinobu.mita@gmail.com Fixes: 488dac0c9237 ("libfs: fix error cast of negative value in simple_attr_write()") Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Reported-by: Zhao Gongyi <zhaogongyi@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Oscar Salvador <osalvador@suse.de> Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Wei Yongjun <weiyongjun1@huawei.com> Cc: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* ocfs2: fix memory leak in ocfs2_stack_glue_init()Shang XiaoJing2023-01-071-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 13b6269dd022aaa69ca8d1df374ab327504121cf ] ocfs2_table_header should be free in ocfs2_stack_glue_init() if ocfs2_sysfs_init() failed, otherwise kmemleak will report memleak. BUG: memory leak unreferenced object 0xffff88810eeb5800 (size 128): comm "modprobe", pid 4507, jiffies 4296182506 (age 55.888s) hex dump (first 32 bytes): c0 40 14 a0 ff ff ff ff 00 00 00 00 01 00 00 00 .@.............. 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<000000001e59e1cd>] __register_sysctl_table+0xca/0xef0 [<00000000c04f70f7>] 0xffffffffa0050037 [<000000001bd12912>] do_one_initcall+0xdb/0x480 [<0000000064f766c9>] do_init_module+0x1cf/0x680 [<000000002ba52db0>] load_module+0x6441/0x6f20 [<000000009772580d>] __do_sys_finit_module+0x12f/0x1c0 [<00000000380c1f22>] do_syscall_64+0x3f/0x90 [<000000004cf473bc>] entry_SYSCALL_64_after_hwframe+0x63/0xcd Link: https://lkml.kernel.org/r/41651ca1-432a-db34-eb97-d35744559de1@linux.alibaba.com Fixes: 3878f110f71a ("ocfs2: Move the hb_ctl_path sysctl into the stack glue.") Signed-off-by: Shang XiaoJing <shangxiaojing@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* fs: don't audit the capability check in simple_xattr_list()Ondrej Mosnacek2023-01-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit e7eda157c4071cd1e69f4b1687b0fbe1ae5e6f46 ] The check being unconditional may lead to unwanted denials reported by LSMs when a process has the capability granted by DAC, but denied by an LSM. In the case of SELinux such denials are a problem, since they can't be effectively filtered out via the policy and when not silenced, they produce noise that may hide a true problem or an attack. Checking for the capability only if any trusted xattr is actually present wouldn't really address the issue, since calling listxattr(2) on such node on its own doesn't indicate an explicit attempt to see the trusted xattrs. Additionally, it could potentially leak the presence of trusted xattrs to an unprivileged user if they can check for the denials (e.g. through dmesg). Therefore, it's best (and simplest) to keep the check unconditional and instead use ns_capable_noaudit() that will silence any associated LSM denials. Fixes: 38f38657444d ("xattr: extract simple_xattr code from tmpfs") Reported-by: Martin Pitt <mpitt@redhat.com> Suggested-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Reviewed-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
* pstore: Avoid kcore oops by vmap()ing with VM_IOREMAPStephen Boyd2023-01-071-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit e6b842741b4f39007215fd7e545cb55aa3d358a2 ] An oops can be induced by running 'cat /proc/kcore > /dev/null' on devices using pstore with the ram backend because kmap_atomic() assumes lowmem pages are accessible with __va(). Unable to handle kernel paging request at virtual address ffffff807ff2b000 Mem abort info: ESR = 0x96000006 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x06: level 2 translation fault Data abort info: ISV = 0, ISS = 0x00000006 CM = 0, WnR = 0 swapper pgtable: 4k pages, 39-bit VAs, pgdp=0000000081d87000 [ffffff807ff2b000] pgd=180000017fe18003, p4d=180000017fe18003, pud=180000017fe18003, pmd=0000000000000000 Internal error: Oops: 96000006 [#1] PREEMPT SMP Modules linked in: dm_integrity CPU: 7 PID: 21179 Comm: perf Not tainted 5.15.67-10882-ge4eb2eb988cd #1 baa443fb8e8477896a370b31a821eb2009f9bfba Hardware name: Google Lazor (rev3 - 8) (DT) pstate: a0400009 (NzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : __memcpy+0x110/0x260 lr : vread+0x194/0x294 sp : ffffffc013ee39d0 x29: ffffffc013ee39f0 x28: 0000000000001000 x27: ffffff807ff2b000 x26: 0000000000001000 x25: ffffffc0085a2000 x24: ffffff802d4b3000 x23: ffffff80f8a60000 x22: ffffff802d4b3000 x21: ffffffc0085a2000 x20: ffffff8080b7bc68 x19: 0000000000001000 x18: 0000000000000000 x17: 0000000000000000 x16: 0000000000000000 x15: ffffffd3073f2e60 x14: ffffffffad588000 x13: 0000000000000000 x12: 0000000000000001 x11: 00000000000001a2 x10: 00680000fff2bf0b x9 : 03fffffff807ff2b x8 : 0000000000000001 x7 : 0000000000000000 x6 : 0000000000000000 x5 : ffffff802d4b4000 x4 : ffffff807ff2c000 x3 : ffffffc013ee3a78 x2 : 0000000000001000 x1 : ffffff807ff2b000 x0 : ffffff802d4b3000 Call trace: __memcpy+0x110/0x260 read_kcore+0x584/0x778 proc_reg_read+0xb4/0xe4 During early boot, memblock reserves the pages for the ramoops reserved memory node in DT that would otherwise be part of the direct lowmem mapping. Pstore's ram backend reuses those reserved pages to change the memory type (writeback or non-cached) by passing the pages to vmap() (see pfn_to_page() usage in persistent_ram_vmap() for more details) with specific flags. When read_kcore() starts iterating over the vmalloc region, it runs over the virtual address that vmap() returned for ramoops. In aligned_vread() the virtual address is passed to vmalloc_to_page() which returns the page struct for the reserved lowmem area. That lowmem page is passed to kmap_atomic(), which effectively calls page_to_virt() that assumes a lowmem page struct must be directly accessible with __va() and friends. These pages are mapped via vmap() though, and the lowmem mapping was never made, so accessing them via the lowmem virtual address oopses like above. Let's side-step this problem by passing VM_IOREMAP to vmap(). This will tell vread() to not include the ramoops region in the kcore. Instead the area will look like a bunch of zeros. The alternative is to teach kmap() about vmalloc areas that intersect with lowmem. Presumably such a change isn't a one-liner, and there isn't much interest in inspecting the ramoops region in kcore files anyway, so the most expedient route is taken for now. Cc: Brian Geffon <bgeffon@google.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: 404a6043385d ("staging: android: persistent_ram: handle reserving and mapping memory") Signed-off-by: Stephen Boyd <swboyd@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221205233136.3420802-1-swboyd@chromium.org Signed-off-by: Sasha Levin <sashal@kernel.org>
* udf: Fix extending file within last blockJan Kara2023-01-071-15/+17
| | | | | | | | | | | | | | | commit 1f3868f06855c97a4954c99b36f3fc9eb8f60326 upstream. When extending file within last block it can happen that the extent is already rounded to the blocksize and thus contains the offset we want to grow up to. In such case we would mistakenly expand the last extent and make it one block longer than it should be, exposing unallocated block in a file and causing data corruption. Fix the problem by properly detecting this case and bailing out. CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* udf: Do not bother looking for prealloc extents if i_lenExtents matches i_sizeJan Kara2023-01-071-1/+2
| | | | | | | | | | | | commit 6ad53f0f71c52871202a7bf096feb2c59db33fc5 upstream. If rounded block-rounded i_lenExtents matches block rounded i_size, there are no preallocation extents. Do not bother walking extent linked list. CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* udf: Fix preallocation discarding at indirect extent boundaryJan Kara2023-01-071-32/+13
| | | | | | | | | | | | commit cfe4c1b25dd6d2f056afc00b7c98bcb3dd0b1fc3 upstream. When preallocation extent is the first one in the extent block, the code would corrupt extent tree header instead. Fix the problem and use udf_delete_aext() for deleting extent to avoid some code duplication. CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* udf: Drop unused arguments of udf_delete_aext()Jan Kara2023-01-073-9/+7
| | | | | | | | | | commit 6c1e4d06a3808dc67dbce2d631f4c12574567dd5 upstream. udf_delete_aext() uses its last two arguments only as local variables. Drop them. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* udf: Discard preallocation before extending file with a holeJan Kara2023-01-071-28/+18
| | | | | | | | | | | | | | | | | | | | | commit 16d0556568148bdcaa45d077cac9f8f7077cf70a upstream. When extending file with a hole, we tried to preserve existing preallocation for the file. However that is not very useful and complicates code because the previous extent may need to be rounded to block boundary as well (which we forgot to do thus causing data corruption for sequence like: xfs_io -f -c "pwrite 0x75e63 11008" -c "truncate 0x7b24b" \ -c "truncate 0xabaa3" -c "pwrite 0xac70b 22954" \ -c "pwrite 0x93a43 11358" -c "pwrite 0xb8e65 52211" file with 512-byte block size. Just discard preallocation before extending file to simplify things and also fix this data corruption. CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* nilfs2: fix NULL pointer dereference in nilfs_palloc_commit_free_entry()ZhangPeng2022-12-081-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit f0a0ccda18d6fd826d7c7e7ad48a6ed61c20f8b4 upstream. Syzbot reported a null-ptr-deref bug: NILFS (loop0): segctord starting. Construction interval = 5 seconds, CP frequency < 30 seconds general protection fault, probably for non-canonical address 0xdffffc0000000002: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017] CPU: 1 PID: 3603 Comm: segctord Not tainted 6.1.0-rc2-syzkaller-00105-gb229b6ca5abb #0 Hardware name: Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022 RIP: 0010:nilfs_palloc_commit_free_entry+0xe5/0x6b0 fs/nilfs2/alloc.c:608 Code: 00 00 00 00 fc ff df 80 3c 02 00 0f 85 cd 05 00 00 48 b8 00 00 00 00 00 fc ff df 4c 8b 73 08 49 8d 7e 10 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 26 05 00 00 49 8b 46 10 be a6 00 00 00 48 c7 c7 RSP: 0018:ffffc90003dff830 EFLAGS: 00010212 RAX: dffffc0000000000 RBX: ffff88802594e218 RCX: 000000000000000d RDX: 0000000000000002 RSI: 0000000000002000 RDI: 0000000000000010 RBP: ffff888071880222 R08: 0000000000000005 R09: 000000000000003f R10: 000000000000000d R11: 0000000000000000 R12: ffff888071880158 R13: ffff88802594e220 R14: 0000000000000000 R15: 0000000000000004 FS: 0000000000000000(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fb1c08316a8 CR3: 0000000018560000 CR4: 0000000000350ee0 Call Trace: <TASK> nilfs_dat_commit_free fs/nilfs2/dat.c:114 [inline] nilfs_dat_commit_end+0x464/0x5f0 fs/nilfs2/dat.c:193 nilfs_dat_commit_update+0x26/0x40 fs/nilfs2/dat.c:236 nilfs_btree_commit_update_v+0x87/0x4a0 fs/nilfs2/btree.c:1940 nilfs_btree_commit_propagate_v fs/nilfs2/btree.c:2016 [inline] nilfs_btree_propagate_v fs/nilfs2/btree.c:2046 [inline] nilfs_btree_propagate+0xa00/0xd60 fs/nilfs2/btree.c:2088 nilfs_bmap_propagate+0x73/0x170 fs/nilfs2/bmap.c:337 nilfs_collect_file_data+0x45/0xd0 fs/nilfs2/segment.c:568 nilfs_segctor_apply_buffers+0x14a/0x470 fs/nilfs2/segment.c:1018 nilfs_segctor_scan_file+0x3f4/0x6f0 fs/nilfs2/segment.c:1067 nilfs_segctor_collect_blocks fs/nilfs2/segment.c:1197 [inline] nilfs_segctor_collect fs/nilfs2/segment.c:1503 [inline] nilfs_segctor_do_construct+0x12fc/0x6af0 fs/nilfs2/segment.c:2045 nilfs_segctor_construct+0x8e3/0xb30 fs/nilfs2/segment.c:2379 nilfs_segctor_thread_construct fs/nilfs2/segment.c:2487 [inline] nilfs_segctor_thread+0x3c3/0xf30 fs/nilfs2/segment.c:2570 kthread+0x2e4/0x3a0 kernel/kthread.c:376 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306 </TASK> ... If DAT metadata file is corrupted on disk, there is a case where req->pr_desc_bh is NULL and blocknr is 0 at nilfs_dat_commit_end() during a b-tree operation that cascadingly updates ancestor nodes of the b-tree, because nilfs_dat_commit_alloc() for a lower level block can initialize the blocknr on the same DAT entry between nilfs_dat_prepare_end() and nilfs_dat_commit_end(). If this happens, nilfs_dat_commit_end() calls nilfs_dat_commit_free() without valid buffer heads in req->pr_desc_bh and req->pr_bitmap_bh, and causes the NULL pointer dereference above in nilfs_palloc_commit_free_entry() function, which leads to a crash. Fix this by adding a NULL check on req->pr_desc_bh and req->pr_bitmap_bh before nilfs_palloc_commit_free_entry() in nilfs_dat_commit_free(). This also calls nilfs_error() in that case to notify that there is a fatal flaw in the filesystem metadata and prevent further operations. Link: https://lkml.kernel.org/r/00000000000097c20205ebaea3d6@google.com Link: https://lkml.kernel.org/r/20221114040441.1649940-1-zhangpeng362@huawei.com Link: https://lkml.kernel.org/r/20221119120542.17204-1-konishi.ryusuke@gmail.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+ebe05ee8e98f755f61d0@syzkaller.appspotmail.com Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* btrfs: qgroup: fix sleep from invalid context bug in btrfs_qgroup_inherit()ChenXiaoSong2022-12-081-8/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit f7e942b5bb35d8e3af54053d19a6bf04143a3955 ] Syzkaller reported BUG as follows: BUG: sleeping function called from invalid context at include/linux/sched/mm.h:274 Call Trace: <TASK> dump_stack_lvl+0xcd/0x134 __might_resched.cold+0x222/0x26b kmem_cache_alloc+0x2e7/0x3c0 update_qgroup_limit_item+0xe1/0x390 btrfs_qgroup_inherit+0x147b/0x1ee0 create_subvol+0x4eb/0x1710 btrfs_mksubvol+0xfe5/0x13f0 __btrfs_ioctl_snap_create+0x2b0/0x430 btrfs_ioctl_snap_create_v2+0x25a/0x520 btrfs_ioctl+0x2a1c/0x5ce0 __x64_sys_ioctl+0x193/0x200 do_syscall_64+0x35/0x80 Fix this by calling qgroup_dirty() on @dstqgroup, and update limit item in btrfs_run_qgroups() later outside of the spinlock context. CC: stable@vger.kernel.org # 4.9+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: ChenXiaoSong <chenxiaosong2@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* nilfs2: fix nilfs_sufile_mark_dirty() not set segment usage as dirtyChen Zhongjin2022-12-081-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 512c5ca01a3610ab14ff6309db363de51f1c13a6 upstream. When extending segments, nilfs_sufile_alloc() is called to get an unassigned segment, then mark it as dirty to avoid accidentally allocating the same segment in the future. But for some special cases such as a corrupted image it can be unreliable. If such corruption of the dirty state of the segment occurs, nilfs2 may reallocate a segment that is in use and pick the same segment for writing twice at the same time. This will cause the problem reported by syzkaller: https://syzkaller.appspot.com/bug?id=c7c4748e11ffcc367cef04f76e02e931833cbd24 This case started with segbuf1.segnum = 3, nextnum = 4 when constructed. It supposed segment 4 has already been allocated and marked as dirty. However the dirty state was corrupted and segment 4 usage was not dirty. For the first time nilfs_segctor_extend_segments() segment 4 was allocated again, which made segbuf2 and next segbuf3 had same segment 4. sb_getblk() will get same bh for segbuf2 and segbuf3, and this bh is added to both buffer lists of two segbuf. It makes the lists broken which causes NULL pointer dereference. Fix the problem by setting usage as dirty every time in nilfs_sufile_mark_dirty(), which is called during constructing current segment to be written out and before allocating next segment. [chenzhongjin@huawei.com: add lock protection per Ryusuke] Link: https://lkml.kernel.org/r/20221121091141.214703-1-chenzhongjin@huawei.com Link: https://lkml.kernel.org/r/20221118063304.140187-1-chenzhongjin@huawei.com Fixes: 9ff05123e3bf ("nilfs2: segment constructor") Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com> Reported-by: <syzbot+77e4f0...@syzkaller.appspotmail.com> Reported-by: Liu Shixin <liushixin2@huawei.com> Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ntfs: check overflow when iterating ATTR_RECORDsHawkins Jiawei2022-11-251-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | commit 63095f4f3af59322bea984a6ae44337439348fe0 upstream. Kernel iterates over ATTR_RECORDs in mft record in ntfs_attr_find(). Because the ATTR_RECORDs are next to each other, kernel can get the next ATTR_RECORD from end address of current ATTR_RECORD, through current ATTR_RECORD length field. The problem is that during iteration, when kernel calculates the end address of current ATTR_RECORD, kernel may trigger an integer overflow bug in executing `a = (ATTR_RECORD*)((u8*)a + le32_to_cpu(a->length))`. This may wrap, leading to a forever iteration on 32bit systems. This patch solves it by adding some checks on calculating end address of current ATTR_RECORD during iteration. Link: https://lkml.kernel.org/r/20220831160935.3409-4-yin31149@gmail.com Link: https://lore.kernel.org/all/20220827105842.GM2030@kadam/ Signed-off-by: Hawkins Jiawei <yin31149@gmail.com> Suggested-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: Anton Altaparmakov <anton@tuxera.com> Cc: chenxiaosong (A) <chenxiaosong2@huawei.com> Cc: syzkaller-bugs <syzkaller-bugs@googlegroups.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ntfs: fix out-of-bounds read in ntfs_attr_find()Hawkins Jiawei2022-11-251-4/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 36a4d82dddbbd421d2b8e79e1cab68c8126d5075 upstream. Kernel iterates over ATTR_RECORDs in mft record in ntfs_attr_find(). To ensure access on these ATTR_RECORDs are within bounds, kernel will do some checking during iteration. The problem is that during checking whether ATTR_RECORD's name is within bounds, kernel will dereferences the ATTR_RECORD name_offset field, before checking this ATTR_RECORD strcture is within bounds. This problem may result out-of-bounds read in ntfs_attr_find(), reported by Syzkaller: ================================================================== BUG: KASAN: use-after-free in ntfs_attr_find+0xc02/0xce0 fs/ntfs/attrib.c:597 Read of size 2 at addr ffff88807e352009 by task syz-executor153/3607 [...] Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 print_address_description mm/kasan/report.c:317 [inline] print_report.cold+0x2ba/0x719 mm/kasan/report.c:433 kasan_report+0xb1/0x1e0 mm/kasan/report.c:495 ntfs_attr_find+0xc02/0xce0 fs/ntfs/attrib.c:597 ntfs_attr_lookup+0x1056/0x2070 fs/ntfs/attrib.c:1193 ntfs_read_inode_mount+0x89a/0x2580 fs/ntfs/inode.c:1845 ntfs_fill_super+0x1799/0x9320 fs/ntfs/super.c:2854 mount_bdev+0x34d/0x410 fs/super.c:1400 legacy_get_tree+0x105/0x220 fs/fs_context.c:610 vfs_get_tree+0x89/0x2f0 fs/super.c:1530 do_new_mount fs/namespace.c:3040 [inline] path_mount+0x1326/0x1e20 fs/namespace.c:3370 do_mount fs/namespace.c:3383 [inline] __do_sys_mount fs/namespace.c:3591 [inline] __se_sys_mount fs/namespace.c:3568 [inline] __x64_sys_mount+0x27f/0x300 fs/namespace.c:3568 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd [...] </TASK> The buggy address belongs to the physical page: page:ffffea0001f8d400 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7e350 head:ffffea0001f8d400 order:3 compound_mapcount:0 compound_pincount:0 flags: 0xfff00000010200(slab|head|node=0|zone=1|lastcpupid=0x7ff) raw: 00fff00000010200 0000000000000000 dead000000000122 ffff888011842140 raw: 0000000000000000 0000000000040004 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff88807e351f00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff88807e351f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff88807e352000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff88807e352080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88807e352100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== This patch solves it by moving the ATTR_RECORD strcture's bounds checking earlier, then checking whether ATTR_RECORD's name is within bounds. What's more, this patch also add some comments to improve its maintainability. Link: https://lkml.kernel.org/r/20220831160935.3409-3-yin31149@gmail.com Link: https://lore.kernel.org/all/1636796c-c85e-7f47-e96f-e074fee3c7d3@huawei.com/ Link: https://groups.google.com/g/syzkaller-bugs/c/t_XdeKPGTR4/m/LECAuIGcBgAJ Signed-off-by: chenxiaosong (A) <chenxiaosong2@huawei.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Hawkins Jiawei <yin31149@gmail.com> Reported-by: syzbot+5f8dcabe4a3b2c51c607@syzkaller.appspotmail.com Tested-by: syzbot+5f8dcabe4a3b2c51c607@syzkaller.appspotmail.com Cc: Anton Altaparmakov <anton@tuxera.com> Cc: syzkaller-bugs <syzkaller-bugs@googlegroups.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ntfs: fix use-after-free in ntfs_attr_find()Hawkins Jiawei2022-11-251-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit d85a1bec8e8d552ab13163ca1874dcd82f3d1550 upstream. Patch series "ntfs: fix bugs about Attribute", v2. This patchset fixes three bugs relative to Attribute in record: Patch 1 adds a sanity check to ensure that, attrs_offset field in first mft record loading from disk is within bounds. Patch 2 moves the ATTR_RECORD's bounds checking earlier, to avoid dereferencing ATTR_RECORD before checking this ATTR_RECORD is within bounds. Patch 3 adds an overflow checking to avoid possible forever loop in ntfs_attr_find(). Without patch 1 and patch 2, the kernel triggersa KASAN use-after-free detection as reported by Syzkaller. Although one of patch 1 or patch 2 can fix this, we still need both of them. Because patch 1 fixes the root cause, and patch 2 not only fixes the direct cause, but also fixes the potential out-of-bounds bug. This patch (of 3): Syzkaller reported use-after-free read as follows: ================================================================== BUG: KASAN: use-after-free in ntfs_attr_find+0xc02/0xce0 fs/ntfs/attrib.c:597 Read of size 2 at addr ffff88807e352009 by task syz-executor153/3607 [...] Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 print_address_description mm/kasan/report.c:317 [inline] print_report.cold+0x2ba/0x719 mm/kasan/report.c:433 kasan_report+0xb1/0x1e0 mm/kasan/report.c:495 ntfs_attr_find+0xc02/0xce0 fs/ntfs/attrib.c:597 ntfs_attr_lookup+0x1056/0x2070 fs/ntfs/attrib.c:1193 ntfs_read_inode_mount+0x89a/0x2580 fs/ntfs/inode.c:1845 ntfs_fill_super+0x1799/0x9320 fs/ntfs/super.c:2854 mount_bdev+0x34d/0x410 fs/super.c:1400 legacy_get_tree+0x105/0x220 fs/fs_context.c:610 vfs_get_tree+0x89/0x2f0 fs/super.c:1530 do_new_mount fs/namespace.c:3040 [inline] path_mount+0x1326/0x1e20 fs/namespace.c:3370 do_mount fs/namespace.c:3383 [inline] __do_sys_mount fs/namespace.c:3591 [inline] __se_sys_mount fs/namespace.c:3568 [inline] __x64_sys_mount+0x27f/0x300 fs/namespace.c:3568 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd [...] </TASK> The buggy address belongs to the physical page: page:ffffea0001f8d400 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7e350 head:ffffea0001f8d400 order:3 compound_mapcount:0 compound_pincount:0 flags: 0xfff00000010200(slab|head|node=0|zone=1|lastcpupid=0x7ff) raw: 00fff00000010200 0000000000000000 dead000000000122 ffff888011842140 raw: 0000000000000000 0000000000040004 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff88807e351f00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff88807e351f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff88807e352000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff88807e352080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88807e352100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== Kernel will loads $MFT/$DATA's first mft record in ntfs_read_inode_mount(). Yet the problem is that after loading, kernel doesn't check whether attrs_offset field is a valid value. To be more specific, if attrs_offset field is larger than bytes_allocated field, then it may trigger the out-of-bounds read bug(reported as use-after-free bug) in ntfs_attr_find(), when kernel tries to access the corresponding mft record's attribute. This patch solves it by adding the sanity check between attrs_offset field and bytes_allocated field, after loading the first mft record. Link: https://lkml.kernel.org/r/20220831160935.3409-1-yin31149@gmail.com Link: https://lkml.kernel.org/r/20220831160935.3409-2-yin31149@gmail.com Signed-off-by: Hawkins Jiawei <yin31149@gmail.com> Cc: Anton Altaparmakov <anton@tuxera.com> Cc: ChenXiaoSong <chenxiaosong2@huawei.com> Cc: syzkaller-bugs <syzkaller-bugs@googlegroups.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* mm: fs: initialize fsdata passed to write_begin/write_end interfaceAlexander Potapenko2022-11-252-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 1468c6f4558b1bcd92aa0400f2920f9dc7588402 upstream. Functions implementing the a_ops->write_end() interface accept the `void *fsdata` parameter that is supposed to be initialized by the corresponding a_ops->write_begin() (which accepts `void **fsdata`). However not all a_ops->write_begin() implementations initialize `fsdata` unconditionally, so it may get passed uninitialized to a_ops->write_end(), resulting in undefined behavior. Fix this by initializing fsdata with NULL before the call to write_begin(), rather than doing so in all possible a_ops implementations. This patch covers only the following cases found by running x86 KMSAN under syzkaller: - generic_perform_write() - cont_expand_zero() and generic_cont_expand_simple() - page_symlink() Other cases of passing uninitialized fsdata may persist in the codebase. Link: https://lkml.kernel.org/r/20220915150417.722975-43-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Ilya Leoshkevich <iii@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <keescook@chromium.org> Cc: Marco Elver <elver@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* gfs2: Switch from strlcpy to strscpyAndreas Gruenbacher2022-11-251-2/+4
| | | | | | | | | | | | | | | commit 204c0300c4e99707e9fb6e57840aa1127060e63f upstream. Switch from strlcpy to strscpy and make sure that @count is the size of the smaller of the source and destination buffers. This prevents reading beyond the end of the source buffer when the source string isn't null terminated. Found by a modified version of syzkaller. Suggested-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* gfs2: Check sb_bsize_shift after reading superblockAndrew Price2022-11-251-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 670f8ce56dd0632dc29a0322e188cc73ce3c6b92 upstream. Fuzzers like to scribble over sb_bsize_shift but in reality it's very unlikely that this field would be corrupted on its own. Nevertheless it should be checked to avoid the possibility of messy mount errors due to bad calculations. It's always a fixed value based on the block size so we can just check that it's the expected value. Tested with: mkfs.gfs2 -O -p lock_nolock /dev/vdb for i in 0 -1 64 65 32 33; do gfs2_edit -p sb field sb_bsize_shift $i /dev/vdb mount /dev/vdb /mnt/test && umount /mnt/test done Before this patch we get a withdraw after [ 76.413681] gfs2: fsid=loop0.0: fatal: invalid metadata block [ 76.413681] bh = 19 (type: exp=5, found=4) [ 76.413681] function = gfs2_meta_buffer, file = fs/gfs2/meta_io.c, line = 492 and with UBSAN configured we also get complaints like [ 76.373395] UBSAN: shift-out-of-bounds in fs/gfs2/ops_fstype.c:295:19 [ 76.373815] shift exponent 4294967287 is too large for 64-bit type 'long unsigned int' After the patch, these complaints don't appear, mount fails immediately and we get an explanation in dmesg. Reported-by: syzbot+dcf33a7aae997956fe06@syzkaller.appspotmail.com Signed-off-by: Andrew Price <anprice@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* nilfs2: fix use-after-free bug of ns_writer on remountRyusuke Konishi2022-11-252-9/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 8cccf05fe857a18ee26e20d11a8455a73ffd4efd upstream. If a nilfs2 filesystem is downgraded to read-only due to metadata corruption on disk and is remounted read/write, or if emergency read-only remount is performed, detaching a log writer and synchronizing the filesystem can be done at the same time. In these cases, use-after-free of the log writer (hereinafter nilfs->ns_writer) can happen as shown in the scenario below: Task1 Task2 -------------------------------- ------------------------------ nilfs_construct_segment nilfs_segctor_sync init_wait init_waitqueue_entry add_wait_queue schedule nilfs_remount (R/W remount case) nilfs_attach_log_writer nilfs_detach_log_writer nilfs_segctor_destroy kfree finish_wait _raw_spin_lock_irqsave __raw_spin_lock_irqsave do_raw_spin_lock debug_spin_lock_before <-- use-after-free While Task1 is sleeping, nilfs->ns_writer is freed by Task2. After Task1 waked up, Task1 accesses nilfs->ns_writer which is already freed. This scenario diagram is based on the Shigeru Yoshida's post [1]. This patch fixes the issue by not detaching nilfs->ns_writer on remount so that this UAF race doesn't happen. Along with this change, this patch also inserts a few necessary read-only checks with superblock instance where only the ns_writer pointer was used to check if the filesystem is read-only. Link: https://syzkaller.appspot.com/bug?id=79a4c002e960419ca173d55e863bd09e8112df8b Link: https://lkml.kernel.org/r/20221103141759.1836312-1-syoshida@redhat.com [1] Link: https://lkml.kernel.org/r/20221104142959.28296-1-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+f816fa82f8783f7a02bb@syzkaller.appspotmail.com Reported-by: Shigeru Yoshida <syoshida@redhat.com> Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* cifs: Fix wrong return value checking when GETFLAGSZhang Xiaoxu2022-11-251-2/+2
| | | | | | | | | | | | [ Upstream commit 92bbd67a55fee50743b42825d1c016e7fd5c79f9 ] The return value of CIFSGetExtAttr is negative, should be checked with -EOPNOTSUPP rather than EOPNOTSUPP. Fixes: 64a5cfa6db94 ("Allow setting per-file compression via SMB2/3") Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* udf: Fix a slab-out-of-bounds write bug in udf_find_entry()ZhangPeng2022-11-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit c8af247de385ce49afabc3bf1cf4fd455c94bfe8 upstream. Syzbot reported a slab-out-of-bounds Write bug: loop0: detected capacity change from 0 to 2048 ================================================================== BUG: KASAN: slab-out-of-bounds in udf_find_entry+0x8a5/0x14f0 fs/udf/namei.c:253 Write of size 105 at addr ffff8880123ff896 by task syz-executor323/3610 CPU: 0 PID: 3610 Comm: syz-executor323 Not tainted 6.1.0-rc2-syzkaller-00105-gb229b6ca5abb #0 Hardware name: Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x1b1/0x28e lib/dump_stack.c:106 print_address_description+0x74/0x340 mm/kasan/report.c:284 print_report+0x107/0x1f0 mm/kasan/report.c:395 kasan_report+0xcd/0x100 mm/kasan/report.c:495 kasan_check_range+0x2a7/0x2e0 mm/kasan/generic.c:189 memcpy+0x3c/0x60 mm/kasan/shadow.c:66 udf_find_entry+0x8a5/0x14f0 fs/udf/namei.c:253 udf_lookup+0xef/0x340 fs/udf/namei.c:309 lookup_open fs/namei.c:3391 [inline] open_last_lookups fs/namei.c:3481 [inline] path_openat+0x10e6/0x2df0 fs/namei.c:3710 do_filp_open+0x264/0x4f0 fs/namei.c:3740 do_sys_openat2+0x124/0x4e0 fs/open.c:1310 do_sys_open fs/open.c:1326 [inline] __do_sys_creat fs/open.c:1402 [inline] __se_sys_creat fs/open.c:1396 [inline] __x64_sys_creat+0x11f/0x160 fs/open.c:1396 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7ffab0d164d9 Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffe1a7e6bb8 EFLAGS: 00000246 ORIG_RAX: 0000000000000055 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ffab0d164d9 RDX: 00007ffab0d164d9 RSI: 0000000000000000 RDI: 0000000020000180 RBP: 00007ffab0cd5a10 R08: 0000000000000000 R09: 0000000000000000 R10: 00005555573552c0 R11: 0000000000000246 R12: 00007ffab0cd5aa0 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 </TASK> Allocated by task 3610: kasan_save_stack mm/kasan/common.c:45 [inline] kasan_set_track+0x3d/0x60 mm/kasan/common.c:52 ____kasan_kmalloc mm/kasan/common.c:371 [inline] __kasan_kmalloc+0x97/0xb0 mm/kasan/common.c:380 kmalloc include/linux/slab.h:576 [inline] udf_find_entry+0x7b6/0x14f0 fs/udf/namei.c:243 udf_lookup+0xef/0x340 fs/udf/namei.c:309 lookup_open fs/namei.c:3391 [inline] open_last_lookups fs/namei.c:3481 [inline] path_openat+0x10e6/0x2df0 fs/namei.c:3710 do_filp_open+0x264/0x4f0 fs/namei.c:3740 do_sys_openat2+0x124/0x4e0 fs/open.c:1310 do_sys_open fs/open.c:1326 [inline] __do_sys_creat fs/open.c:1402 [inline] __se_sys_creat fs/open.c:1396 [inline] __x64_sys_creat+0x11f/0x160 fs/open.c:1396 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd The buggy address belongs to the object at ffff8880123ff800 which belongs to the cache kmalloc-256 of size 256 The buggy address is located 150 bytes inside of 256-byte region [ffff8880123ff800, ffff8880123ff900) The buggy address belongs to the physical page: page:ffffea000048ff80 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x123fe head:ffffea000048ff80 order:1 compound_mapcount:0 compound_pincount:0 flags: 0xfff00000010200(slab|head|node=0|zone=1|lastcpupid=0x7ff) raw: 00fff00000010200 ffffea00004b8500 dead000000000003 ffff888012041b40 raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected page_owner tracks the page as allocated page last allocated via order 0, migratetype Unmovable, gfp_mask 0x0(), pid 1, tgid 1 (swapper/0), ts 1841222404, free_ts 0 create_dummy_stack mm/page_owner.c:67 [inline] register_early_stack+0x77/0xd0 mm/page_owner.c:83 init_page_owner+0x3a/0x731 mm/page_owner.c:93 kernel_init_freeable+0x41c/0x5d5 init/main.c:1629 kernel_init+0x19/0x2b0 init/main.c:1519 page_owner free stack trace missing Memory state around the buggy address: ffff8880123ff780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff8880123ff800: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >ffff8880123ff880: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 06 ^ ffff8880123ff900: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff8880123ff980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ================================================================== Fix this by changing the memory size allocated for copy_name from UDF_NAME_LEN(254) to UDF_NAME_LEN_CS0(255), because the total length (lfi) of subsequent memcpy can be up to 255. CC: stable@vger.kernel.org Reported-by: syzbot+69c9fdccc6dd08961d34@syzkaller.appspotmail.com Fixes: 066b9cded00b ("udf: Use separate buffer for copying split names") Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20221109013542.442790-1-zhangpeng362@huawei.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* btrfs: selftests: fix wrong error check in btrfs_free_dummy_root()Zhang Xiaoxu2022-11-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 9b2f20344d450137d015b380ff0c2e2a6a170135 upstream. The btrfs_alloc_dummy_root() uses ERR_PTR as the error return value rather than NULL, if error happened, there will be a NULL pointer dereference: BUG: KASAN: null-ptr-deref in btrfs_free_dummy_root+0x21/0x50 [btrfs] Read of size 8 at addr 000000000000002c by task insmod/258926 CPU: 2 PID: 258926 Comm: insmod Tainted: G W 6.1.0-rc2+ #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1.fc33 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x34/0x44 kasan_report+0xb7/0x140 kasan_check_range+0x145/0x1a0 btrfs_free_dummy_root+0x21/0x50 [btrfs] btrfs_test_free_space_cache+0x1a8c/0x1add [btrfs] btrfs_run_sanity_tests+0x65/0x80 [btrfs] init_btrfs_fs+0xec/0x154 [btrfs] do_one_initcall+0x87/0x2a0 do_init_module+0xdf/0x320 load_module+0x3006/0x3390 __do_sys_finit_module+0x113/0x1b0 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 Fixes: aaedb55bc08f ("Btrfs: add tests for btrfs_get_extent") CC: stable@vger.kernel.org # 4.9+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* nilfs2: fix deadlock in nilfs_count_free_blocks()Ryusuke Konishi2022-11-251-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 8ac932a4921a96ca52f61935dbba64ea87bbd5dc upstream. A semaphore deadlock can occur if nilfs_get_block() detects metadata corruption while locating data blocks and a superblock writeback occurs at the same time: task 1 task 2 ------ ------ * A file operation * nilfs_truncate() nilfs_get_block() down_read(rwsem A) <-- nilfs_bmap_lookup_contig() ... generic_shutdown_super() nilfs_put_super() * Prepare to write superblock * down_write(rwsem B) <-- nilfs_cleanup_super() * Detect b-tree corruption * nilfs_set_log_cursor() nilfs_bmap_convert_error() nilfs_count_free_blocks() __nilfs_error() down_read(rwsem A) <-- nilfs_set_error() down_write(rwsem B) <-- *** DEADLOCK *** Here, nilfs_get_block() readlocks rwsem A (= NILFS_MDT(dat_inode)->mi_sem) and then calls nilfs_bmap_lookup_contig(), but if it fails due to metadata corruption, __nilfs_error() is called from nilfs_bmap_convert_error() inside the lock section. Since __nilfs_error() calls nilfs_set_error() unless the filesystem is read-only and nilfs_set_error() attempts to writelock rwsem B (= nilfs->ns_sem) to write back superblock exclusively, hierarchical lock acquisition occurs in the order rwsem A -> rwsem B. Now, if another task starts updating the superblock, it may writelock rwsem B during the lock sequence above, and can deadlock trying to readlock rwsem A in nilfs_count_free_blocks(). However, there is actually no need to take rwsem A in nilfs_count_free_blocks() because it, within the lock section, only reads a single integer data on a shared struct with nilfs_sufile_get_ncleansegs(). This has been the case after commit aa474a220180 ("nilfs2: add local variable to cache the number of clean segments"), that is, even before this bug was introduced. So, this resolves the deadlock problem by just not taking the semaphore in nilfs_count_free_blocks(). Link: https://lkml.kernel.org/r/20221029044912.9139-1-konishi.ryusuke@gmail.com Fixes: e828949e5b42 ("nilfs2: call nilfs_error inside bmap routines") Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+45d6ce7b7ad7ef455d03@syzkaller.appspotmail.com Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> [2.6.38+ Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ext4: fix warning in 'ext4_da_release_space'Ye Bin2022-11-101-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 1b8f787ef547230a3249bcf897221ef0cc78481b upstream. Syzkaller report issue as follows: EXT4-fs (loop0): Free/Dirty block details EXT4-fs (loop0): free_blocks=0 EXT4-fs (loop0): dirty_blocks=0 EXT4-fs (loop0): Block reservation details EXT4-fs (loop0): i_reserved_data_blocks=0 EXT4-fs warning (device loop0): ext4_da_release_space:1527: ext4_da_release_space: ino 18, to_free 1 with only 0 reserved data blocks ------------[ cut here ]------------ WARNING: CPU: 0 PID: 92 at fs/ext4/inode.c:1528 ext4_da_release_space+0x25e/0x370 fs/ext4/inode.c:1524 Modules linked in: CPU: 0 PID: 92 Comm: kworker/u4:4 Not tainted 6.0.0-syzkaller-09423-g493ffd6605b2 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/22/2022 Workqueue: writeback wb_workfn (flush-7:0) RIP: 0010:ext4_da_release_space+0x25e/0x370 fs/ext4/inode.c:1528 RSP: 0018:ffffc900015f6c90 EFLAGS: 00010296 RAX: 42215896cd52ea00 RBX: 0000000000000000 RCX: 42215896cd52ea00 RDX: 0000000000000000 RSI: 0000000080000001 RDI: 0000000000000000 RBP: 1ffff1100e907d96 R08: ffffffff816aa79d R09: fffff520002bece5 R10: fffff520002bece5 R11: 1ffff920002bece4 R12: ffff888021fd2000 R13: ffff88807483ecb0 R14: 0000000000000001 R15: ffff88807483e740 FS: 0000000000000000(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005555569ba628 CR3: 000000000c88e000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ext4_es_remove_extent+0x1ab/0x260 fs/ext4/extents_status.c:1461 mpage_release_unused_pages+0x24d/0xef0 fs/ext4/inode.c:1589 ext4_writepages+0x12eb/0x3be0 fs/ext4/inode.c:2852 do_writepages+0x3c3/0x680 mm/page-writeback.c:2469 __writeback_single_inode+0xd1/0x670 fs/fs-writeback.c:1587 writeback_sb_inodes+0xb3b/0x18f0 fs/fs-writeback.c:1870 wb_writeback+0x41f/0x7b0 fs/fs-writeback.c:2044 wb_do_writeback fs/fs-writeback.c:2187 [inline] wb_workfn+0x3cb/0xef0 fs/fs-writeback.c:2227 process_one_work+0x877/0xdb0 kernel/workqueue.c:2289 worker_thread+0xb14/0x1330 kernel/workqueue.c:2436 kthread+0x266/0x300 kernel/kthread.c:376 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306 </TASK> Above issue may happens as follows: ext4_da_write_begin ext4_create_inline_data ext4_clear_inode_flag(inode, EXT4_INODE_EXTENTS); ext4_set_inode_flag(inode, EXT4_INODE_INLINE_DATA); __ext4_ioctl ext4_ext_migrate -> will lead to eh->eh_entries not zero, and set extent flag ext4_da_write_begin ext4_da_convert_inline_data_to_extent ext4_da_write_inline_data_begin ext4_da_map_blocks ext4_insert_delayed_block if (!ext4_es_scan_clu(inode, &ext4_es_is_delonly, lblk)) if (!ext4_es_scan_clu(inode, &ext4_es_is_mapped, lblk)) ext4_clu_mapped(inode, EXT4_B2C(sbi, lblk)); -> will return 1 allocated = true; ext4_es_insert_delayed_block(inode, lblk, allocated); ext4_writepages mpage_map_and_submit_extent(handle, &mpd, &give_up_on_write); -> return -ENOSPC mpage_release_unused_pages(&mpd, give_up_on_write); -> give_up_on_write == 1 ext4_es_remove_extent ext4_da_release_space(inode, reserved); if (unlikely(to_free > ei->i_reserved_data_blocks)) -> to_free == 1 but ei->i_reserved_data_blocks == 0 -> then trigger warning as above To solve above issue, forbid inode do migrate which has inline data. Cc: stable@kernel.org Reported-by: syzbot+c740bb18df70ad00952e@syzkaller.appspotmail.com Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20221018022701.683489-1-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* btrfs: fix type of parameter generation in btrfs_get_dentryDavid Sterba2022-11-102-2/+2
| | | | | | | | | | | | | commit 2398091f9c2c8e0040f4f9928666787a3e8108a7 upstream. The type of parameter generation has been u32 since the beginning, however all callers pass a u64 generation, so unify the types to prevent potential loss. CC: stable@vger.kernel.org # 4.9+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* btrfs: fix ulist leaks in error paths of qgroup self testsFilipe Manana2022-11-101-5/+15
| | | | | | | | | | | | | | | [ Upstream commit d37de92b38932d40e4a251e876cc388f9aee5f42 ] In the test_no_shared_qgroup() and test_multiple_refs() qgroup self tests, if we fail to add the tree ref, remove the extent item or remove the extent ref, we are returning from the test function without freeing the "old_roots" ulist that was allocated by the previous calls to btrfs_find_all_roots(). Fix that by calling ulist_free() before returning. Fixes: 442244c96332 ("btrfs: qgroup: Switch self test to extent-oriented qgroup mechanism.") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* nfs4: Fix kmemleak when allocate slot failedZhang Xiaoxu2022-11-101-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 7e8436728e22181c3f12a5dbabd35ed3a8b8c593 ] If one of the slot allocate failed, should cleanup all the other allocated slots, otherwise, the allocated slots will leak: unreferenced object 0xffff8881115aa100 (size 64): comm ""mount.nfs"", pid 679, jiffies 4294744957 (age 115.037s) hex dump (first 32 bytes): 00 cc 19 73 81 88 ff ff 00 a0 5a 11 81 88 ff ff ...s......Z..... 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<000000007a4c434a>] nfs4_find_or_create_slot+0x8e/0x130 [<000000005472a39c>] nfs4_realloc_slot_table+0x23f/0x270 [<00000000cd8ca0eb>] nfs40_init_client+0x4a/0x90 [<00000000128486db>] nfs4_init_client+0xce/0x270 [<000000008d2cacad>] nfs4_set_client+0x1a2/0x2b0 [<000000000e593b52>] nfs4_create_server+0x300/0x5f0 [<00000000e4425dd2>] nfs4_try_get_tree+0x65/0x110 [<00000000d3a6176f>] vfs_get_tree+0x41/0xf0 [<0000000016b5ad4c>] path_mount+0x9b3/0xdd0 [<00000000494cae71>] __x64_sys_mount+0x190/0x1d0 [<000000005d56bdec>] do_syscall_64+0x35/0x80 [<00000000687c9ae4>] entry_SYSCALL_64_after_hwframe+0x46/0xb0 Fixes: abf79bb341bf ("NFS: Add a slot table to struct nfs_client for NFSv4.0 transport blocking") Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* NFSv4.1: We must always send RECLAIM_COMPLETE after a rebootTrond Myklebust2022-11-101-0/+1
| | | | | | | | | | | | | [ Upstream commit e59679f2b7e522ecad99974e5636291ffd47c184 ] Currently, we are only guaranteed to send RECLAIM_COMPLETE if we have open state to recover. Fix the client to always send RECLAIM_COMPLETE after setting up the lease. Fixes: fce5c838e133 ("nfs41: RECLAIM_COMPLETE functionality") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* NFSv4.1: Handle RECLAIM_COMPLETE trunking errorsTrond Myklebust2022-11-101-0/+1
| | | | | | | | | | | | [ Upstream commit 5d917cba3201e5c25059df96c29252fd99c4f6a7 ] If RECLAIM_COMPLETE sets the NFS4CLNT_BIND_CONN_TO_SESSION flag, then we need to loop back in order to handle it. Fixes: 0048fdd06614 ("NFSv4.1: RECLAIM_COMPLETE must handle NFS4ERR_CONN_NOT_BOUND_TO_SESSION") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* kernfs: fix use-after-free in __kernfs_removeChristian A. Ehrhardt2022-11-031-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 4abc99652812a2ddf932f137515d5c5a04723538 upstream. Syzkaller managed to trigger concurrent calls to kernfs_remove_by_name_ns() for the same file resulting in a KASAN detected use-after-free. The race occurs when the root node is freed during kernfs_drain(). To prevent this acquire an additional reference for the root of the tree that is removed before calling __kernfs_remove(). Found by syzkaller with the following reproducer (slab_nomerge is required): syz_mount_image$ext4(0x0, &(0x7f0000000100)='./file0\x00', 0x100000, 0x0, 0x0, 0x0, 0x0) r0 = openat(0xffffffffffffff9c, &(0x7f0000000080)='/proc/self/exe\x00', 0x0, 0x0) close(r0) pipe2(&(0x7f0000000140)={0xffffffffffffffff, <r1=>0xffffffffffffffff}, 0x800) mount$9p_fd(0x0, &(0x7f0000000040)='./file0\x00', &(0x7f00000000c0), 0x408, &(0x7f0000000280)={'trans=fd,', {'rfdno', 0x3d, r0}, 0x2c, {'wfdno', 0x3d, r1}, 0x2c, {[{@cache_loose}, {@mmap}, {@loose}, {@loose}, {@mmap}], [{@mask={'mask', 0x3d, '^MAY_EXEC'}}, {@fsmagic={'fsmagic', 0x3d, 0x10001}}, {@dont_hash}]}}) Sample report: ================================================================== BUG: KASAN: use-after-free in kernfs_type include/linux/kernfs.h:335 [inline] BUG: KASAN: use-after-free in kernfs_leftmost_descendant fs/kernfs/dir.c:1261 [inline] BUG: KASAN: use-after-free in __kernfs_remove.part.0+0x843/0x960 fs/kernfs/dir.c:1369 Read of size 2 at addr ffff8880088807f0 by task syz-executor.2/857 CPU: 0 PID: 857 Comm: syz-executor.2 Not tainted 6.0.0-rc3-00363-g7726d4c3e60b #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x6e/0x91 lib/dump_stack.c:106 print_address_description mm/kasan/report.c:317 [inline] print_report.cold+0x5e/0x5e5 mm/kasan/report.c:433 kasan_report+0xa3/0x130 mm/kasan/report.c:495 kernfs_type include/linux/kernfs.h:335 [inline] kernfs_leftmost_descendant fs/kernfs/dir.c:1261 [inline] __kernfs_remove.part.0+0x843/0x960 fs/kernfs/dir.c:1369 __kernfs_remove fs/kernfs/dir.c:1356 [inline] kernfs_remove_by_name_ns+0x108/0x190 fs/kernfs/dir.c:1589 sysfs_slab_add+0x133/0x1e0 mm/slub.c:5943 __kmem_cache_create+0x3e0/0x550 mm/slub.c:4899 create_cache mm/slab_common.c:229 [inline] kmem_cache_create_usercopy+0x167/0x2a0 mm/slab_common.c:335 p9_client_create+0xd4d/0x1190 net/9p/client.c:993 v9fs_session_init+0x1e6/0x13c0 fs/9p/v9fs.c:408 v9fs_mount+0xb9/0xbd0 fs/9p/vfs_super.c:126 legacy_get_tree+0xf1/0x200 fs/fs_context.c:610 vfs_get_tree+0x85/0x2e0 fs/super.c:1530 do_new_mount fs/namespace.c:3040 [inline] path_mount+0x675/0x1d00 fs/namespace.c:3370 do_mount fs/namespace.c:3383 [inline] __do_sys_mount fs/namespace.c:3591 [inline] __se_sys_mount fs/namespace.c:3568 [inline] __x64_sys_mount+0x282/0x300 fs/namespace.c:3568 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f725f983aed Code: 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f725f0f7028 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5 RAX: ffffffffffffffda RBX: 00007f725faa3f80 RCX: 00007f725f983aed RDX: 00000000200000c0 RSI: 0000000020000040 RDI: 0000000000000000 RBP: 00007f725f9f419c R08: 0000000020000280 R09: 0000000000000000 R10: 0000000000000408 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000006 R14: 00007f725faa3f80 R15: 00007f725f0d7000 </TASK> Allocated by task 855: kasan_save_stack+0x1e/0x40 mm/kasan/common.c:38 kasan_set_track mm/kasan/common.c:45 [inline] set_alloc_info mm/kasan/common.c:437 [inline] __kasan_slab_alloc+0x66/0x80 mm/kasan/common.c:470 kasan_slab_alloc include/linux/kasan.h:224 [inline] slab_post_alloc_hook mm/slab.h:727 [inline] slab_alloc_node mm/slub.c:3243 [inline] slab_alloc mm/slub.c:3251 [inline] __kmem_cache_alloc_lru mm/slub.c:3258 [inline] kmem_cache_alloc+0xbf/0x200 mm/slub.c:3268 kmem_cache_zalloc include/linux/slab.h:723 [inline] __kernfs_new_node+0xd4/0x680 fs/kernfs/dir.c:593 kernfs_new_node fs/kernfs/dir.c:655 [inline] kernfs_create_dir_ns+0x9c/0x220 fs/kernfs/dir.c:1010 sysfs_create_dir_ns+0x127/0x290 fs/sysfs/dir.c:59 create_dir lib/kobject.c:63 [inline] kobject_add_internal+0x24a/0x8d0 lib/kobject.c:223 kobject_add_varg lib/kobject.c:358 [inline] kobject_init_and_add+0x101/0x160 lib/kobject.c:441 sysfs_slab_add+0x156/0x1e0 mm/slub.c:5954 __kmem_cache_create+0x3e0/0x550 mm/slub.c:4899 create_cache mm/slab_common.c:229 [inline] kmem_cache_create_usercopy+0x167/0x2a0 mm/slab_common.c:335 p9_client_create+0xd4d/0x1190 net/9p/client.c:993 v9fs_session_init+0x1e6/0x13c0 fs/9p/v9fs.c:408 v9fs_mount+0xb9/0xbd0 fs/9p/vfs_super.c:126 legacy_get_tree+0xf1/0x200 fs/fs_context.c:610 vfs_get_tree+0x85/0x2e0 fs/super.c:1530 do_new_mount fs/namespace.c:3040 [inline] path_mount+0x675/0x1d00 fs/namespace.c:3370 do_mount fs/namespace.c:3383 [inline] __do_sys_mount fs/namespace.c:3591 [inline] __se_sys_mount fs/namespace.c:3568 [inline] __x64_sys_mount+0x282/0x300 fs/namespace.c:3568 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Freed by task 857: kasan_save_stack+0x1e/0x40 mm/kasan/common.c:38 kasan_set_track+0x21/0x30 mm/kasan/common.c:45 kasan_set_free_info+0x20/0x40 mm/kasan/generic.c:370 ____kasan_slab_free mm/kasan/common.c:367 [inline] ____kasan_slab_free mm/kasan/common.c:329 [inline] __kasan_slab_free+0x108/0x190 mm/kasan/common.c:375 kasan_slab_free include/linux/kasan.h:200 [inline] slab_free_hook mm/slub.c:1754 [inline] slab_free_freelist_hook mm/slub.c:1780 [inline] slab_free mm/slub.c:3534 [inline] kmem_cache_free+0x9c/0x340 mm/slub.c:3551 kernfs_put.part.0+0x2b2/0x520 fs/kernfs/dir.c:547 kernfs_put+0x42/0x50 fs/kernfs/dir.c:521 __kernfs_remove.part.0+0x72d/0x960 fs/kernfs/dir.c:1407 __kernfs_remove fs/kernfs/dir.c:1356 [inline] kernfs_remove_by_name_ns+0x108/0x190 fs/kernfs/dir.c:1589 sysfs_slab_add+0x133/0x1e0 mm/slub.c:5943 __kmem_cache_create+0x3e0/0x550 mm/slub.c:4899 create_cache mm/slab_common.c:229 [inline] kmem_cache_create_usercopy+0x167/0x2a0 mm/slab_common.c:335 p9_client_create+0xd4d/0x1190 net/9p/client.c:993 v9fs_session_init+0x1e6/0x13c0 fs/9p/v9fs.c:408 v9fs_mount+0xb9/0xbd0 fs/9p/vfs_super.c:126 legacy_get_tree+0xf1/0x200 fs/fs_context.c:610 vfs_get_tree+0x85/0x2e0 fs/super.c:1530 do_new_mount fs/namespace.c:3040 [inline] path_mount+0x675/0x1d00 fs/namespace.c:3370 do_mount fs/namespace.c:3383 [inline] __do_sys_mount fs/namespace.c:3591 [inline] __se_sys_mount fs/namespace.c:3568 [inline] __x64_sys_mount+0x282/0x300 fs/namespace.c:3568 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd The buggy address belongs to the object at ffff888008880780 which belongs to the cache kernfs_node_cache of size 128 The buggy address is located 112 bytes inside of 128-byte region [ffff888008880780, ffff888008880800) The buggy address belongs to the physical page: page:00000000732833f8 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x8880 flags: 0x100000000000200(slab|node=0|zone=1) raw: 0100000000000200 0000000000000000 dead000000000122 ffff888001147280 raw: 0000000000000000 0000000000150015 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888008880680: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb ffff888008880700: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc >ffff888008880780: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff888008880800: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb ffff888008880880: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc ================================================================== Acked-by: Tejun Heo <tj@kernel.org> Cc: stable <stable@kernel.org> # -rc3 Signed-off-by: Christian A. Ehrhardt <lk@c--e.de> Link: https://lore.kernel.org/r/20220913121723.691454-1-lk@c--e.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ocfs2: fix BUG when iput after ocfs2_mknod failsJoseph Qi2022-11-031-10/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 759a7c6126eef5635506453e9b9d55a6a3ac2084 upstream. Commit b1529a41f777 "ocfs2: should reclaim the inode if '__ocfs2_mknod_locked' returns an error" tried to reclaim the claimed inode if __ocfs2_mknod_locked() fails later. But this introduce a race, the freed bit may be reused immediately by another thread, which will update dinode, e.g. i_generation. Then iput this inode will lead to BUG: inode->i_generation != le32_to_cpu(fe->i_generation) We could make this inode as bad, but we did want to do operations like wipe in some cases. Since the claimed inode bit can only affect that an dinode is missing and will return back after fsck, it seems not a big problem. So just leave it as is by revert the reclaim logic. Link: https://lkml.kernel.org/r/20221017130227.234480-1-joseph.qi@linux.alibaba.com Fixes: b1529a41f777 ("ocfs2: should reclaim the inode if '__ocfs2_mknod_locked' returns an error") Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reported-by: Yan Wang <wangyan122@huawei.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ocfs2: clear dinode links count in case of errorJoseph Qi2022-11-031-2/+10
| | | | | | | | | | | | | | | | | | | | | | | | | commit 28f4821b1b53e0649706912e810c6c232fc506f9 upstream. In ocfs2_mknod(), if error occurs after dinode successfully allocated, ocfs2 i_links_count will not be 0. So even though we clear inode i_nlink before iput in error handling, it still won't wipe inode since we'll refresh inode from dinode during inode lock. So just like clear inode i_nlink, we clear ocfs2 i_links_count as well. Also do the same change for ocfs2_symlink(). Link: https://lkml.kernel.org/r/20221017130227.234480-2-joseph.qi@linux.alibaba.com Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reported-by: Yan Wang <wangyan122@huawei.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>