summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* block: simplify diskstats_showChristoph Hellwig2021-04-081-4/+6
| | | | | | | | | Just use xa_for_each to iterate over the partitions as there is no need to grab a reference to each partition. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210406062303.811835-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: simplify show_partitionChristoph Hellwig2021-04-081-6/+7
| | | | | | | | | Just use xa_for_each to iterate over the partitions as there is no need to grab a reference to each partition. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210406062303.811835-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: simplify printk_all_partitionsChristoph Hellwig2021-04-081-17/+16
| | | | | | | | | Just use xa_for_each to iterate over the partitions as there is no need to grab a reference to each partition. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210406062303.811835-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: simplify partition_overlapsChristoph Hellwig2021-04-081-10/+10
| | | | | | | | | Just use xa_for_each to iterate over the partitions as there is no need to grab a reference to each partition. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210406062303.811835-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: simplify partition removalChristoph Hellwig2021-04-081-4/+6
| | | | | | | | | Always look up the first available entry instead of the complicated stateful traversal. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210406062303.811835-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: take bd_mutex around delete_partitions in del_gendiskChristoph Hellwig2021-04-082-0/+6
| | | | | | | | | | There is nothing preventing an ioctl from trying do delete partition concurrenly with del_gendisk, so take open_mutex to serialize against that. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210406062303.811835-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: refactor blk_drop_partitionsChristoph Hellwig2021-04-085-26/+10
| | | | | | | | | | Move the busy check and disk-wide sync into the only caller, so that the remainder can be shared with del_gendisk. Also pass the gendisk instead of the bdev as that is all that is needed. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210406062303.811835-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: move more syncing and invalidation to delete_partitionChristoph Hellwig2021-04-082-7/+4
| | | | | | | | | | | | Move the calls to fsync_bdev and __invalidate_device from del_gendisk to delete_partition. For the other two callers that check that there are no openers for the delete partitions(s) the callouts are a no-op as no file system can be mounted, but this keeps all the cleanup in one place. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210406062303.811835-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: remove invalidate_partitionChristoph Hellwig2021-04-081-14/+11
| | | | | | | | | | invalidate_partition has two callers, one of which already performs the remove_inode_hash just after the call. Just open code the function in the two callsites. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210406062303.811835-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* dasd: use bdev_disk_changed instead of blk_drop_partitionsChristoph Hellwig2021-04-082-6/+1
| | | | | | | | | | | Use the more general interface - the behavior is the same except that now a change uevent is sent, which is the right thing to do when the device becomes unusable. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Stefan Haberland <sth@linux.ibm.com> Link: https://lore.kernel.org/r/20210406062303.811835-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-zoned: Remove the definition of blk_zone_start()Bart Van Assche2021-04-071-8/+0
| | | | | | | | | | | | | | Commit e76239a3748c ("block: add a report_zones method") removed the last blk_zone_start() call. Hence also remove the definition of this function. Cc: Christoph Hellwig <hch@lst.de> Cc: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210406200820.15180-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: set default elevator as deadline in case of hctx shared tagsetMing Lei2021-04-071-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | Yanhui found that write performance is degraded a lot after applying hctx shared tagset on one test machine with megaraid_sas. And turns out it is caused by none scheduler which becomes default elevator caused by hctx shared tagset patchset. Given more scsi HBAs will apply hctx shared tagset, and the similar performance exists for them too. So keep previous behavior by still using default mq-deadline for queues which apply hctx shared tagset, just like before. Fixes: 32bc15afed04 ("blk-mq: Facilitate a shared sbitmap per tagset") Reported-by: Yanhui Ma <yama@redhat.com> Cc: John Garry <john.garry@huawei.com> Cc: Hannes Reinecke <hare@suse.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: John Garry <john.garry@huawei.com> Link: https://lore.kernel.org/r/20210406031933.767228-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: stop calling blk_queue_bounce for passthrough requestsChristoph Hellwig2021-04-065-99/+36
| | | | | | | | | | | | | Instead of overloading the passthrough fast path with the deprecated block layer bounce buffering let the users that combine an old undermaintained driver with a highmem system pay the price by always falling back to copies in that case. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20210331073001.46776-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: refactor the bounce buffering codeChristoph Hellwig2021-04-065-90/+38
| | | | | | | | | | | | | | Get rid of all the PFN arithmetics and just use an enum for the two remaining options, and use PageHighMem for the actual bounce decision. Add a fast path to entirely avoid the call for the common case of a queue not using the legacy bouncing code. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20210331073001.46776-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: remove BLK_BOUNCE_ISA supportChristoph Hellwig2021-04-069-133/+35
| | | | | | | | | | Remove the BLK_BOUNCE_ISA support now that all users are gone. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20210331073001.46776-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* scsi: remove the unchecked_isa_dma flagChristoph Hellwig2021-04-0613-105/+25
| | | | | | | | | | Remove the unchecked_isa_dma now that all users are gone. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20210331073001.46776-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* advansys: remove ISA supportChristoph Hellwig2021-04-061-289/+32
| | | | | | | | | | | | | | This is the last piece in the kernel requiring the block layer ISA bounce buffering, and it does not actually look used. So remove it to see if anyone screams, in which case we'll need to find a solution to fix it back up. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20210331073001.46776-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* BusLogic: reject broken old firmware that requires ISA-style bounce bufferingChristoph Hellwig2021-04-062-16/+6
| | | | | | | | | | | | Warn on and don't support adapters that have a DMA bug that forces ISA-style bounce buffering. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Khalid Aziz <khalid@gonehiking.org> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20210331073001.46776-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Buslogic: remove ISA supportChristoph Hellwig2021-04-064-215/+7
| | | | | | | | | | | | | The ISA support in Buslogic has been broken for a long time, as all the I/O path expects a struct device for DMA mapping that is derived from the PCI device, which would simply crash for ISA adapters. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Khalid Aziz <khalid@gonehiking.org> Link: https://lore.kernel.org/r/20210331073001.46776-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* aha1542: use a local bounce bufferChristoph Hellwig2021-04-061-48/+57
| | | | | | | | | | | | | | | To remove the last user of the unchecked_isa_dma flag and thus the block layer ISA bounce buffering switch this driver to use its own local bounce buffer. This has the effect of not needing the chain indirection and supporting and unlimited number of segments. It does however limit the transfer size for each command to something that can be reasonable allocated by dma_alloc_coherent like 8K. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20210331073001.46776-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: Always use blk_mq_is_sbitmap_sharedNikolay Borisov2021-04-061-2/+2
| | | | | | | | | Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Link: https://lore.kernel.org/r/20210311081713.2763171-1-nborisov@suse.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: add sysfs entry for virt boundary maskMax Gurtovoy2021-04-061-1/+8
| | | | | | | | | | | This entry will expose the bio vector alignment mask for a specific block device. Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20210405132012.12504-1-mgurtovoy@nvidia.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block, bfq: merge bursts of newly-created queuesPaolo Valente2021-03-253-10/+266
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Many throughput-sensitive workloads are made of several parallel I/O flows, with all flows generated by the same application, or more generically by the same task (e.g., system boot). The most counterproductive action with these workloads is plugging I/O dispatch when one of the bfq_queues associated with these flows remains temporarily empty. To avoid this plugging, BFQ has been using a burst-handling mechanism for years now. This mechanism has proven effective for throughput, and not detrimental for service guarantees. This commit pushes this mechanism a little bit further, basing on the following two facts. First, all the I/O flows of a the same application or task contribute to the execution/completion of that common application or task. So the performance figures that matter are total throughput of the flows and task-wide I/O latency. In particular, these flows do not need to be protected from each other, in terms of individual bandwidth or latency. Second, the above fact holds regardless of the number of flows. Putting these two facts together, this commits merges stably the bfq_queues associated with these I/O flows, i.e., with the processes that generate these IO/ flows, regardless of how many the involved processes are. To decide whether a set of bfq_queues is actually associated with the I/O flows of a common application or task, and to merge these queues stably, this commit operates as follows: given a bfq_queue, say Q2, currently being created, and the last bfq_queue, say Q1, created before Q2, Q2 is merged stably with Q1 if - very little time has elapsed since when Q1 was created - Q2 has the same ioprio as Q1 - Q2 belongs to the same group as Q1 Merging bfq_queues also reduces scheduling overhead. A fio test with ten random readers on /dev/nullb shows a throughput boost of 40%, with a quadcore. Since BFQ's execution time amounts to ~50% of the total per-request processing time, the above throughput boost implies that BFQ's overhead is reduced by more than 50%. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Link: https://lore.kernel.org/r/20210304174627.161-7-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block, bfq: keep shared queues out of the waker mechanismPaolo Valente2021-03-251-1/+11
| | | | | | | | | | | | | | | Shared queues are likely to receive I/O at a high rate. This may deceptively let them be considered as wakers of other queues. But a false waker will unjustly steal bandwidth to its supposedly woken queue. So considering also shared queues in the waking mechanism may cause more control troubles than throughput benefits. This commit keeps shared queues out of the waker-detection mechanism. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Link: https://lore.kernel.org/r/20210304174627.161-6-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block, bfq: fix weight-raising resume with !low_latencyPaolo Valente2021-03-251-2/+8
| | | | | | | | | | | | | When the io_latency heuristic is off, bfq_queues must not start to be weight-raised. Unfortunately, by mistake, this may happen when the state of a previously weight-raised bfq_queue is resumed after a queue split. This commit fixes this error. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Link: https://lore.kernel.org/r/20210304174627.161-5-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block, bfq: make shared queues inherit wakersPaolo Valente2021-03-251-3/+39
| | | | | | | | | | | | | | | | Consider a bfq_queue bfqq that is about to be merged with another bfq_queue new_bfqq. The processes associated with bfqq are cooperators of the processes associated with new_bfqq. So, if bfqq has a waker, then it is reasonable (and beneficial for throughput) to assume that all these processes will be happy to let bfqq's waker freely inject I/O when they have no I/O. So this commit makes new_bfqq inherit bfqq's waker. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Link: https://lore.kernel.org/r/20210304174627.161-4-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block, bfq: put reqs of waker and woken in dispatch listPaolo Valente2021-03-251-1/+43
| | | | | | | | | | | | | | | | | Consider a new I/O request that arrives for a bfq_queue bfqq. If, when this happens, the only active bfq_queues are bfqq and either its waker bfq_queue or one of its woken bfq_queues, then there is no point in queueing this new I/O request in bfqq for service. In fact, the in-service queue and bfqq agree on serving this new I/O request as soon as possible. So this commit puts this new I/O request directly into the dispatch list. Tested-by: Jan Kara <jack@suse.cz> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Link: https://lore.kernel.org/r/20210304174627.161-3-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block, bfq: always inject I/O of queues blocked by wakersPaolo Valente2021-03-252-5/+35
| | | | | | | | | | | | | | | | | | Suppose that I/O dispatch is plugged, to wait for new I/O for the in-service bfq-queue, say bfqq. Suppose then that there is a further bfq_queue woken by bfqq, and that this woken queue has pending I/O. A woken queue does not steal bandwidth from bfqq, because it remains soon without I/O if bfqq is not served. So there is virtually no risk of loss of bandwidth for bfqq if this woken queue has I/O dispatched while bfqq is waiting for new I/O. In contrast, this extra I/O injection boosts throughput. This commit performs this extra injection. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Link: https://lore.kernel.org/r/20210304174627.161-2-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: Sentence reconstruct for better readabilityBhaskar Chowdhury2021-03-251-2/+2
| | | | | | | Sentence reconstruction for better readability. Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Linux 5.12-rc4v5.12-rc4Linus Torvalds2021-03-211-1/+1
|
* Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds2021-03-2111-72/+168
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Miscellaneous ext4 bug fixes for v5.12" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: initialize ret to suppress smatch warning ext4: stop inode update before return ext4: fix rename whiteout with fast commit ext4: fix timer use-after-free on failed mount ext4: fix potential error in ext4_do_update_inode ext4: do not try to set xattr into ea_inode if value is empty ext4: do not iput inode under running transaction in ext4_rename() ext4: find old entry again if failed to rename whiteout ext4: fix error handling in ext4_end_enable_verity() ext4: fix bh ref count on error paths fs/ext4: fix integer overflow in s_log_groups_per_flex ext4: add reclaim checks to xattr code ext4: shrink race window in ext4_should_retry_alloc()
| * ext4: initialize ret to suppress smatch warningTheodore Ts'o2021-03-211-1/+1
| | | | | | | | Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: stop inode update before returnPan Bian2021-03-211-1/+3
| | | | | | | | | | | | | | | | | | | | | | The inode update should be stopped before returing the error code. Signed-off-by: Pan Bian <bianpan2016@163.com> Link: https://lore.kernel.org/r/20210117085732.93788-1-bianpan2016@163.com Fixes: 8016e29f4362 ("ext4: fast commit recovery path") Cc: stable@kernel.org Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: fix rename whiteout with fast commitHarshad Shirwadkar2021-03-213-2/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds rename whiteout support in fast commits. Note that the whiteout object that gets created is actually char device. Which imples, the function ext4_inode_journal_mode(struct inode *inode) would return "JOURNAL_DATA" for this inode. This has a consequence in fast commit code that it will make creation of the whiteout object a fast-commit ineligible behavior and thus will fall back to full commits. With this patch, this can be observed by running fast commits with rename whiteout and seeing the stats generated by ext4_fc_stats tracepoint as follows: ext4_fc_stats: dev 254:32 fc ineligible reasons: XATTR:0, CROSS_RENAME:0, JOURNAL_FLAG_CHANGE:0, NO_MEM:0, SWAP_BOOT:0, RESIZE:0, RENAME_DIR:0, FALLOC_RANGE:0, INODE_JOURNAL_DATA:16; num_commits:6, ineligible: 6, numblks: 3 So in short, this patch guarantees that in case of rename whiteout, we fall back to full commits. Amir mentioned that instead of creating a new whiteout object for every rename, we can create a static whiteout object with irrelevant nlink. That will make fast commits to not fall back to full commit. But until this happens, this patch will ensure correctness by falling back to full commits. Fixes: 8016e29f4362 ("ext4: fast commit recovery path") Cc: stable@kernel.org Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20210316221921.1124955-1-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: fix timer use-after-free on failed mountJan Kara2021-03-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When filesystem mount fails because of corrupted filesystem we first cancel the s_err_report timer reminding fs errors every day and only then we flush s_error_work. However s_error_work may report another fs error and re-arm timer thus resulting in timer use-after-free. Fix the problem by first flushing the work and only after that canceling the s_err_report timer. Reported-by: syzbot+628472a2aac693ab0fcd@syzkaller.appspotmail.com Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available") CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210315165906.2175-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: fix potential error in ext4_do_update_inodeShijie Luo2021-03-211-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | If set_large_file = 1 and errors occur in ext4_handle_dirty_metadata(), the error code will be overridden, go to out_brelse to avoid this situation. Signed-off-by: Shijie Luo <luoshijie1@huawei.com> Link: https://lore.kernel.org/r/20210312065051.36314-1-luoshijie1@huawei.com Cc: stable@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: do not try to set xattr into ea_inode if value is emptyzhangyi (F)2021-03-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Syzbot report a warning that ext4 may create an empty ea_inode if set an empty extent attribute to a file on the file system which is no free blocks left. WARNING: CPU: 6 PID: 10667 at fs/ext4/xattr.c:1640 ext4_xattr_set_entry+0x10f8/0x1114 fs/ext4/xattr.c:1640 ... Call trace: ext4_xattr_set_entry+0x10f8/0x1114 fs/ext4/xattr.c:1640 ext4_xattr_block_set+0x1d0/0x1b1c fs/ext4/xattr.c:1942 ext4_xattr_set_handle+0x8a0/0xf1c fs/ext4/xattr.c:2390 ext4_xattr_set+0x120/0x1f0 fs/ext4/xattr.c:2491 ext4_xattr_trusted_set+0x48/0x5c fs/ext4/xattr_trusted.c:37 __vfs_setxattr+0x208/0x23c fs/xattr.c:177 ... Now, ext4 try to store extent attribute into an external inode if ext4_xattr_block_set() return -ENOSPC, but for the case of store an empty extent attribute, store the extent entry into the extent attribute block is enough. A simple reproduce below. fallocate test.img -l 1M mkfs.ext4 -F -b 2048 -O ea_inode test.img mount test.img /mnt dd if=/dev/zero of=/mnt/foo bs=2048 count=500 setfattr -n "user.test" /mnt/foo Reported-by: syzbot+98b881fdd8ebf45ab4ae@syzkaller.appspotmail.com Fixes: 9c6e7853c531 ("ext4: reserve space for xattr entries/names") Cc: stable@kernel.org Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Link: https://lore.kernel.org/r/20210305120508.298465-1-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: do not iput inode under running transaction in ext4_rename()zhangyi (F)2021-03-211-9/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In ext4_rename(), when RENAME_WHITEOUT failed to add new entry into directory, it ends up dropping new created whiteout inode under the running transaction. After commit <9b88f9fb0d2> ("ext4: Do not iput inode under running transaction"), we follow the assumptions that evict() does not get called from a transaction context but in ext4_rename() it breaks this suggestion. Although it's not a real problem, better to obey it, so this patch add inode to orphan list and stop transaction before final iput(). Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Link: https://lore.kernel.org/r/20210303131703.330415-2-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: find old entry again if failed to rename whiteoutzhangyi (F)2021-03-211-2/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we failed to add new entry on rename whiteout, we cannot reset the old->de entry directly, because the old->de could have moved from under us during make indexed dir. So find the old entry again before reset is needed, otherwise it may corrupt the filesystem as below. /dev/sda: Entry '00000001' in ??? (12) has deleted/unused inode 15. CLEARED. /dev/sda: Unattached inode 75 /dev/sda: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY. Fixes: 6b4b8e6b4ad ("ext4: fix bug for rename with RENAME_WHITEOUT") Cc: stable@vger.kernel.org Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Link: https://lore.kernel.org/r/20210303131703.330415-1-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: fix error handling in ext4_end_enable_verity()Eric Biggers2021-03-111-34/+55
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ext4 didn't properly clean up if verity failed to be enabled on a file: - It left verity metadata (pages past EOF) in the page cache, which would be exposed to userspace if the file was later extended. - It didn't truncate the verity metadata at all (either from cache or from disk) if an error occurred while setting the verity bit. Fix these bugs by adding a call to truncate_inode_pages() and ensuring that we truncate the verity metadata (both from cache and from disk) in all error paths. Also rework the code to cleanly separate the success path from the error paths, which makes it much easier to understand. Reported-by: Yunlei He <heyunlei@hihonor.com> Fixes: c93d8f885809 ("ext4: add basic fs-verity support") Cc: stable@vger.kernel.org # v5.4+ Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20210302200420.137977-2-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: fix bh ref count on error pathsZhaolong Zhang2021-03-061-3/+3
| | | | | | | | | | | | | | | | __ext4_journalled_writepage should drop bhs' ref count on error paths Signed-off-by: Zhaolong Zhang <zhangzl2013@126.com> Link: https://lore.kernel.org/r/1614678151-70481-1-git-send-email-zhangzl2013@126.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * fs/ext4: fix integer overflow in s_log_groups_per_flexSabyrzhan Tasbolatov2021-03-061-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | syzbot found UBSAN: shift-out-of-bounds in ext4_mb_init [1], when 1 << sbi->s_es->s_log_groups_per_flex is bigger than UINT_MAX, where sbi->s_mb_prefetch is unsigned integer type. 32 is the maximum allowed power of s_log_groups_per_flex. Following if check will also trigger UBSAN shift-out-of-bound: if (1 << sbi->s_es->s_log_groups_per_flex >= UINT_MAX) { So I'm checking it against the raw number, perhaps there is another way to calculate UINT_MAX max power. Also use min_t as to make sure it's uint type. [1] UBSAN: shift-out-of-bounds in fs/ext4/mballoc.c:2713:24 shift exponent 60 is too large for 32-bit type 'int' Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x137/0x1be lib/dump_stack.c:120 ubsan_epilogue lib/ubsan.c:148 [inline] __ubsan_handle_shift_out_of_bounds+0x432/0x4d0 lib/ubsan.c:395 ext4_mb_init_backend fs/ext4/mballoc.c:2713 [inline] ext4_mb_init+0x19bc/0x19f0 fs/ext4/mballoc.c:2898 ext4_fill_super+0xc2ec/0xfbe0 fs/ext4/super.c:4983 Reported-by: syzbot+a8b4b0c60155e87e9484@syzkaller.appspotmail.com Signed-off-by: Sabyrzhan Tasbolatov <snovitoll@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210224095800.3350002-1-snovitoll@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: add reclaim checks to xattr codeJan Kara2021-03-061-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Syzbot is reporting that ext4 can enter fs reclaim from kvmalloc() while the transaction is started like: fs_reclaim_acquire+0x117/0x150 mm/page_alloc.c:4340 might_alloc include/linux/sched/mm.h:193 [inline] slab_pre_alloc_hook mm/slab.h:493 [inline] slab_alloc_node mm/slub.c:2817 [inline] __kmalloc_node+0x5f/0x430 mm/slub.c:4015 kmalloc_node include/linux/slab.h:575 [inline] kvmalloc_node+0x61/0xf0 mm/util.c:587 kvmalloc include/linux/mm.h:781 [inline] ext4_xattr_inode_cache_find fs/ext4/xattr.c:1465 [inline] ext4_xattr_inode_lookup_create fs/ext4/xattr.c:1508 [inline] ext4_xattr_set_entry+0x1ce6/0x3780 fs/ext4/xattr.c:1649 ext4_xattr_ibody_set+0x78/0x2b0 fs/ext4/xattr.c:2224 ext4_xattr_set_handle+0x8f4/0x13e0 fs/ext4/xattr.c:2380 ext4_xattr_set+0x13a/0x340 fs/ext4/xattr.c:2493 This should be impossible since transaction start sets PF_MEMALLOC_NOFS. Add some assertions to the code to catch if something isn't working as expected early. Link: https://lore.kernel.org/linux-ext4/000000000000563a0205bafb7970@google.com/ Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210222171626.21884-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * ext4: shrink race window in ext4_should_retry_alloc()Eric Whitney2021-03-064-12/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When generic/371 is run on kvm-xfstests using 5.10 and 5.11 kernels, it fails at significant rates on the two test scenarios that disable delayed allocation (ext3conv and data_journal) and force actual block allocation for the fallocate and pwrite functions in the test. The failure rate on 5.10 for both ext3conv and data_journal on one test system typically runs about 85%. On 5.11, the failure rate on ext3conv sometimes drops to as low as 1% while the rate on data_journal increases to nearly 100%. The observed failures are largely due to ext4_should_retry_alloc() cutting off block allocation retries when s_mb_free_pending (used to indicate that a transaction in progress will free blocks) is 0. However, free space is usually available when this occurs during runs of generic/371. It appears that a thread attempting to allocate blocks is just missing transaction commits in other threads that increase the free cluster count and reset s_mb_free_pending while the allocating thread isn't running. Explicitly testing for free space availability avoids this race. The current code uses a post-increment operator in the conditional expression that determines whether the retry limit has been exceeded. This means that the conditional expression uses the value of the retry counter before it's increased, resulting in an extra retry cycle. The current code actually retries twice before hitting its retry limit rather than once. Increasing the retry limit to 3 from the current actual maximum retry count of 2 in combination with the change described above reduces the observed failure rate to less that 0.1% on both ext3conv and data_journal with what should be limited impact on users sensitive to the overhead caused by retries. A per filesystem percpu counter exported via sysfs is added to allow users or developers to track the number of times the retry limit is exceeded without resorting to debugging methods. This should provide some insight into worst case retry behavior. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Link: https://lore.kernel.org/r/20210218151132.19678-1-enwlinux@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | Merge tag 'io_uring-5.12-2021-03-21' of git://git.kernel.dk/linux-blockLinus Torvalds2021-03-213-7/+31
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull io_uring followup fixes from Jens Axboe: - The SIGSTOP change from Eric, so we properly ignore that for PF_IO_WORKER threads. - Disallow sending signals to PF_IO_WORKER threads in general, we're not interested in having them funnel back to the io_uring owning task. - Stable fix from Stefan, ensuring we properly break links for short send/sendmsg recv/recvmsg if MSG_WAITALL is set. - Catch and loop when needing to run task_work before a PF_IO_WORKER threads goes to sleep. * tag 'io_uring-5.12-2021-03-21' of git://git.kernel.dk/linux-block: io_uring: call req_set_fail_links() on short send[msg]()/recv[msg]() with MSG_WAITALL io-wq: ensure task is running before processing task_work signal: don't allow STOP on PF_IO_WORKER threads signal: don't allow sending any signals to PF_IO_WORKER threads
| * | io_uring: call req_set_fail_links() on short send[msg]()/recv[msg]() with ↵Stefan Metzmacher2021-03-211-4/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | MSG_WAITALL Without that it's not safe to use them in a linked combination with others. Now combinations like IORING_OP_SENDMSG followed by IORING_OP_SPLICE should be possible. We already handle short reads and writes for the following opcodes: - IORING_OP_READV - IORING_OP_READ_FIXED - IORING_OP_READ - IORING_OP_WRITEV - IORING_OP_WRITE_FIXED - IORING_OP_WRITE - IORING_OP_SPLICE - IORING_OP_TEE Now we have it for these as well: - IORING_OP_SENDMSG - IORING_OP_SEND - IORING_OP_RECVMSG - IORING_OP_RECV For IORING_OP_RECVMSG we also check for the MSG_TRUNC and MSG_CTRUNC flags in order to call req_set_fail_links(). There might be applications arround depending on the behavior that even short send[msg]()/recv[msg]() retuns continue an IOSQE_IO_LINK chain. It's very unlikely that such applications pass in MSG_WAITALL, which is only defined in 'man 2 recvmsg', but not in 'man 2 sendmsg'. It's expected that the low level sock_sendmsg() call just ignores MSG_WAITALL, as MSG_ZEROCOPY is also ignored without explicitly set SO_ZEROCOPY. We also expect the caller to know about the implicit truncation to MAX_RW_COUNT, which we don't detect. cc: netdev@vger.kernel.org Link: https://lore.kernel.org/r/c4e1a4cc0d905314f4d5dc567e65a7b09621aab3.1615908477.git.metze@samba.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io-wq: ensure task is running before processing task_workJens Axboe2021-03-211-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Mark the current task as running if we need to run task_work from the io-wq threads as part of work handling. If that is the case, then return as such so that the caller can appropriately loop back and reset if it was part of a going-to-sleep flush. Fixes: 3bfe6106693b ("io-wq: fork worker threads from original task") Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | signal: don't allow STOP on PF_IO_WORKER threadsEric W. Biederman2021-03-211-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Just like we don't allow normal signals to IO threads, don't deliver a STOP to a task that has PF_IO_WORKER set. The IO threads don't take signals in general, and have no means of flushing out a stop either. Longer term, we may want to look into allowing stop of these threads, as it relates to eg process freezing. For now, this prevents a spin issue if a SIGSTOP is delivered to the parent task. Reported-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | signal: don't allow sending any signals to PF_IO_WORKER threadsJens Axboe2021-03-211-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | They don't take signals individually, and even if they share signals with the parent task, don't allow them to be delivered through the worker thread. Linux does allow this kind of behavior for regular threads, but it's really a compatability thing that we need not care about for the IO threads. Reported-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | Merge tag 'staging-5.12-rc4' of ↵Linus Torvalds2021-03-2114-48/+75
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging Pull staging and IIO driver fixes from Greg KH: "Some small staging and IIO driver fixes: - MAINTAINERS changes for the move of the staging mailing list - comedi driver fixes to get request_irq() to work correctly - counter driver fixes for reported issues with iio devices - tiny iio driver fixes for reported issues. All of these have been in linux-next with no reported problems" * tag 'staging-5.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: staging: vt665x: fix alignment constraints staging: comedi: cb_pcidas64: fix request_irq() warn staging: comedi: cb_pcidas: fix request_irq() warn MAINTAINERS: move the staging subsystem to lists.linux.dev MAINTAINERS: move some real subsystems off of the staging mailing list iio: gyro: mpu3050: Fix error handling in mpu3050_trigger_handler iio: hid-sensor-temperature: Fix issues of timestamp channel iio: hid-sensor-humidity: Fix alignment issue of timestamp channel counter: stm32-timer-cnt: fix ceiling miss-alignment with reload register counter: stm32-timer-cnt: fix ceiling write max value counter: stm32-timer-cnt: Report count function when SLAVE_MODE_DISABLED iio: adc: ab8500-gpadc: Fix off by 10 to 3 iio:adc:stm32-adc: Add HAS_IOMEM dependency iio: adis16400: Fix an error code in adis16400_initial_setup() iio: adc: adi-axi-adc: add proper Kconfig dependencies iio: adc: ad7949: fix wrong ADC result due to incorrect bit mask iio: hid-sensor-prox: Fix scale not correct issue iio:adc:qcom-spmi-vadc: add default scale to LR_MUX2_BAT_ID channel