summaryrefslogtreecommitdiffstats
path: root/drivers/md/raid1.c
Commit message (Collapse)AuthorAgeFilesLines
* block: remove bio_set_op_attrsChristoph Hellwig2022-12-071-6/+6
| | | | | | | | | | | | This macro is obsolete, so replace the last few uses with open coded bi_opf assignments. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Coly Li <colyli@suse.de <mailto:colyli@suse.de>> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20221206144057.720846-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* md/raid1: stop mdx_raid1 thread when raid1 array run failedJiang Li2022-11-141-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | fail run raid1 array when we assemble array with the inactive disk only, but the mdx_raid1 thread were not stop, Even if the associated resources have been released. it will caused a NULL dereference when we do poweroff. This causes the following Oops: [ 287.587787] BUG: kernel NULL pointer dereference, address: 0000000000000070 [ 287.594762] #PF: supervisor read access in kernel mode [ 287.599912] #PF: error_code(0x0000) - not-present page [ 287.605061] PGD 0 P4D 0 [ 287.607612] Oops: 0000 [#1] SMP NOPTI [ 287.611287] CPU: 3 PID: 5265 Comm: md0_raid1 Tainted: G U 5.10.146 #0 [ 287.619029] Hardware name: xxxxxxx/To be filled by O.E.M, BIOS 5.19 06/16/2022 [ 287.626775] RIP: 0010:md_check_recovery+0x57/0x500 [md_mod] [ 287.632357] Code: fe 01 00 00 48 83 bb 10 03 00 00 00 74 08 48 89 ...... [ 287.651118] RSP: 0018:ffffc90000433d78 EFLAGS: 00010202 [ 287.656347] RAX: 0000000000000000 RBX: ffff888105986800 RCX: 0000000000000000 [ 287.663491] RDX: ffffc90000433bb0 RSI: 00000000ffffefff RDI: ffff888105986800 [ 287.670634] RBP: ffffc90000433da0 R08: 0000000000000000 R09: c0000000ffffefff [ 287.677771] R10: 0000000000000001 R11: ffffc90000433ba8 R12: ffff888105986800 [ 287.684907] R13: 0000000000000000 R14: fffffffffffffe00 R15: ffff888100b6b500 [ 287.692052] FS: 0000000000000000(0000) GS:ffff888277f80000(0000) knlGS:0000000000000000 [ 287.700149] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 287.705897] CR2: 0000000000000070 CR3: 000000000320a000 CR4: 0000000000350ee0 [ 287.713033] Call Trace: [ 287.715498] raid1d+0x6c/0xbbb [raid1] [ 287.719256] ? __schedule+0x1ff/0x760 [ 287.722930] ? schedule+0x3b/0xb0 [ 287.726260] ? schedule_timeout+0x1ed/0x290 [ 287.730456] ? __switch_to+0x11f/0x400 [ 287.734219] md_thread+0xe9/0x140 [md_mod] [ 287.738328] ? md_thread+0xe9/0x140 [md_mod] [ 287.742601] ? wait_woken+0x80/0x80 [ 287.746097] ? md_register_thread+0xe0/0xe0 [md_mod] [ 287.751064] kthread+0x11a/0x140 [ 287.754300] ? kthread_park+0x90/0x90 [ 287.757974] ret_from_fork+0x1f/0x30 In fact, when raid1 array run fail, we need to do md_unregister_thread() before raid1_free(). Signed-off-by: Jiang Li <jiang.li@ugreen.com> Signed-off-by: Song Liu <song@kernel.org>
* md/raid1: Use the new blk_opf_t typeBart Van Assche2022-07-141-2/+2
| | | | | | | | | | Improve static type checking by using the new blk_opf_t type for variables that represent request flags. Acked-by: Song Liu <song@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-35-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
* md/core: Combine two sync_page_io() argumentsBart Van Assche2022-07-141-4/+4
| | | | | | | | | | Improve uniformity in the kernel of handling of request operation and flags by passing these as a single argument. Cc: Song Liu <song@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-32-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: remove bdevnameChristoph Hellwig2022-07-141-1/+1
| | | | | | | | | | | | Replace the remaining calls of bdevname with snprintf using the %pg format specifier. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220713055317.1888500-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* md: remove most calls to bdevnameChristoph Hellwig2022-05-221-14/+10
| | | | | | | Use the %pg format specifier to save on stack consumption and code size. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>
* md: Set MD_BROKEN for RAID1 and RAID10Mariusz Tkaczyk2022-04-251-17/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is no direct mechanism to determine raid failure outside personality. It is done by checking rdev->flags after executing md_error(). If "faulty" flag is not set then -EBUSY is returned to userspace. -EBUSY means that array will be failed after drive removal. Mdadm has special routine to handle the array failure and it is executed if -EBUSY is returned by md. There are at least two known reasons to not consider this mechanism as correct: 1. drive can be removed even if array will be failed[1]. 2. -EBUSY seems to be wrong status. Array is not busy, but removal process cannot proceed safe. -EBUSY expectation cannot be removed without breaking compatibility with userspace. In this patch first issue is resolved by adding support for MD_BROKEN flag for RAID1 and RAID10. Support for RAID456 is added in next commit. The idea is to set the MD_BROKEN if we are sure that raid is in failed state now. This is done in each error_handler(). In md_error() MD_BROKEN flag is checked. If is set, then -EBUSY is returned to userspace. As in previous commit, it causes that #mdadm --set-faulty is able to fail array. Previously proposed workaround is valid if optional functionality[1] is disabled. [1] commit 9a567843f7ce("md: allow last device to be forcibly removed from RAID1/RAID10.") Reviewd-by: Xiao Ni <xni@redhat.com> Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Signed-off-by: Song Liu <song@kernel.org>
* block: remove QUEUE_FLAG_DISCARDChristoph Hellwig2022-04-171-15/+1
| | | | | | | | | | | | | | | | | | | Just use a non-zero max_discard_sectors as an indicator for discard support, similar to what is done for write zeroes. The only places where needs special attention is the RAID5 driver, which must clear discard support for security reasons by default, even if the default stacking rules would allow for it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd] Acked-by: Jan Höppner <hoeppner@linux.ibm.com> [s390] Acked-by: Coly Li <colyli@suse.de> [bcache] Acked-by: David Sterba <dsterba@suse.com> [btrfs] Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220415045258.199825-25-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: add a bdev_nonrot helperChristoph Hellwig2022-04-171-1/+1
| | | | | | | | | | | | Add a helper to check the nonrot flag based on the block_device instead of having to poke into the block layer internal request_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: David Sterba <dsterba@suse.com> [btrfs] Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220415045258.199825-12-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: turn bio_kmalloc into a simple kmalloc wrapperChristoph Hellwig2022-04-171-4/+8
| | | | | | | | | | | | | | Remove the magic autofree semantics and require the callers to explicitly call bio_init to initialize the bio. This allows bio_free to catch accidental bio_put calls on bio_init()ed bios as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Coly Li <colyli@suse.de> Acked-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20220406061228.410163-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge tag 'for-5.18/write-streams-2022-03-18' of git://git.kernel.dk/linux-blockLinus Torvalds2022-03-261-2/+0
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NVMe write streams removal from Jens Axboe: "This removes the write streams support in NVMe. No vendor ever really shipped working support for this, and they are not interested in supporting it. With the NVMe support gone, we have nothing in the tree that supports this. Remove passing around of the hints. The only discussion point in this patchset imho is the fact that the file specific write hint setting/getting fcntl helpers will now return -1/EINVAL like they did before we supported write hints. No known applications use these functions, I only know of one prototype that I help do for RocksDB, and that's not used. That said, with a change like this, it's always a bit controversial. Alternatively, we could just make them return 0 and pretend it worked. It's placement based hints after all" * tag 'for-5.18/write-streams-2022-03-18' of git://git.kernel.dk/linux-block: fs: remove fs.f_write_hint fs: remove kiocb.ki_hint block: remove the per-bio/request write hint nvme: remove support or stream based temperature hint
| * block: remove the per-bio/request write hintChristoph Hellwig2022-03-071-2/+0
| | | | | | | | | | | | | | | | | | With the NVMe support for this gone, there are no consumers of these hints left, so remove them. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220304175556.407719-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsiLinus Torvalds2022-03-241-3/+1
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull SCSI updates from James Bottomley: "This series consists of the usual driver updates (qla2xxx, pm8001, libsas, smartpqi, scsi_debug, lpfc, iscsi, mpi3mr) plus minor updates and bug fixes. The high blast radius core update is the removal of write same, which affects block and several non-SCSI devices. The other big change, which is more local, is the removal of the SCSI pointer" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (281 commits) scsi: scsi_ioctl: Drop needless assignment in sg_io() scsi: bsg: Drop needless assignment in scsi_bsg_sg_io_fn() scsi: lpfc: Copyright updates for 14.2.0.0 patches scsi: lpfc: Update lpfc version to 14.2.0.0 scsi: lpfc: SLI path split: Refactor BSG paths scsi: lpfc: SLI path split: Refactor Abort paths scsi: lpfc: SLI path split: Refactor SCSI paths scsi: lpfc: SLI path split: Refactor CT paths scsi: lpfc: SLI path split: Refactor misc ELS paths scsi: lpfc: SLI path split: Refactor VMID paths scsi: lpfc: SLI path split: Refactor FDISC paths scsi: lpfc: SLI path split: Refactor LS_RJT paths scsi: lpfc: SLI path split: Refactor LS_ACC paths scsi: lpfc: SLI path split: Refactor the RSCN/SCR/RDF/EDC/FARPR paths scsi: lpfc: SLI path split: Refactor PLOGI/PRLI/ADISC/LOGO paths scsi: lpfc: SLI path split: Refactor base ELS paths and the FLOGI path scsi: lpfc: SLI path split: Introduce lpfc_prep_wqe scsi: lpfc: SLI path split: Refactor fast and slow paths to native SLI4 scsi: lpfc: SLI path split: Refactor lpfc_iocbq scsi: lpfc: Use kcalloc() ...
| * | scsi: md: Remove WRITE_SAME supportChristoph Hellwig2022-02-221-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | There are no more end-users of REQ_OP_WRITE_SAME left, so we can start deleting it. Link: https://lore.kernel.org/r/20220209082828.2629273-6-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
* | | Merge tag 'for-5.18/drivers-2022-03-18' of git://git.kernel.dk/linux-blockLinus Torvalds2022-03-211-11/+0
|\ \ \ | |_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block driver updates from Jens Axboe: - NVMe updates via Christoph: - add vectored-io support for user-passthrough (Kanchan Joshi) - add verbose error logging (Alan Adamson) - support buffered I/O on block devices in nvmet (Chaitanya Kulkarni) - central discovery controller support (Martin Belanger) - fix and extended the globally unique idenfier validation (Christoph) - move away from the deprecated IDA APIs (Sagi Grimberg) - misc code cleanup (Keith Busch, Max Gurtovoy, Qinghua Jin, Chaitanya Kulkarni) - add lockdep annotations for in-kernel sockets (Chris Leech) - use vmalloc for ANA log buffer (Hannes Reinecke) - kerneldoc fixes (Chaitanya Kulkarni) - cleanups (Guoqing Jiang, Chaitanya Kulkarni, Christoph) - warn about shared namespaces without multipathing (Christoph) - MD updates via Song with a set of cleanups (Christoph, Mariusz, Paul, Erik, Dirk) - loop cleanups and queue depth configuration (Chaitanya) - null_blk cleanups and fixes (Chaitanya) - Use descriptive init/exit names in virtio_blk (Randy) - Use bvec_kmap_local() in drivers (Christoph) - bcache fixes (Mingzhe) - xen blk-front persistent grant speedups (Juergen) - rnbd fix and cleanup (Gioh) - Misc fixes (Christophe, Colin) * tag 'for-5.18/drivers-2022-03-18' of git://git.kernel.dk/linux-block: (76 commits) virtio_blk: eliminate anonymous module_init & module_exit nvme: warn about shared namespaces without CONFIG_NVME_MULTIPATH nvme: remove nvme_alloc_request and nvme_alloc_request_qid nvme: cleanup how disk->disk_name is assigned nvmet: move the call to nvmet_ns_changed out of nvmet_ns_revalidate nvmet: use snprintf() with PAGE_SIZE in configfs nvmet: don't fold lines nvmet-rdma: fix kernel-doc warning for nvmet_rdma_device_removal nvmet-fc: fix kernel-doc warning for nvmet_fc_unregister_targetport nvmet-fc: fix kernel-doc warning for nvmet_fc_register_targetport nvme-tcp: lockdep: annotate in-kernel sockets nvme-tcp: don't fold the line nvme-tcp: don't initialize ret variable nvme-multipath: call bio_io_error in nvme_ns_head_submit_bio nvme-multipath: use vmalloc for ANA log buffer xen/blkfront: speed up purge_persistent_grants() raid5: initialize the stripe_head embeeded bios as needed raid5-cache: statically allocate the recovery ra bio raid5-cache: fully initialize flush_bio when needed raid5-ppl: fully initialize the bio in ppl_new_iounit ...
| * | md: raid1/raid10: drop pending_cntMariusz Tkaczyk2022-03-081-11/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Those counters are not necessary after commit 11bb45e8aaf6 ("md: drop queue limitation for RAID1 and RAID10"). Remove them from all code (conf and plug structs). raid1_plug_cb and raid10_plug_cb are identical, so move definition of raid1_plug_cb to common raid1-10 definitions and use it for RAID10 too. Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Signed-off-by: Song Liu <song@kernel.org>
* | | raid1: stop using bio_devnameChristoph Hellwig2022-03-071-3/+2
|/ / | | | | | | | | | | | | | | | | | | | | Use the %pg format specifier to save on stack consuption and code size. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Song Liu <song@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220304180105.409765-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | block: pass a block_device to bio_clone_fastChristoph Hellwig2022-02-041-17/+17
| | | | | | | | | | | | | | | | | | | | Pass a block_device to bio_clone_fast and __bio_clone_fast and give the functions more suitable names. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Link: https://lore.kernel.org/r/20220202160109.108149-14-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | block: pass a block_device and opf to bio_resetChristoph Hellwig2022-02-021-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | Pass the block_device that we plan to use this bio for and the operation to bio_reset to optimize the assigment. A NULL block_device can be passed, both for the passthrough case on a raw request_queue and to temporarily avoid refactoring some nasty code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220124091107.642561-20-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | block: pass a block_device and opf to bio_alloc_biosetChristoph Hellwig2022-02-021-1/+2
|/ | | | | | | | | | | | | | | Pass the block_device and operation that we plan to use this bio for to bio_alloc_bioset to optimize the assigment. NULL/0 can be passed, both for the passthrough case on a raw request_queue and to temporarily avoid refactoring some nasty code. Also move the gfp_mask argument after the nr_vecs argument for a much more logical calling convention matching what most of the kernel does. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220124091107.642561-16-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge tag 'for-5.17/drivers-2022-01-11' of git://git.kernel.dk/linux-blockLinus Torvalds2022-01-121-27/+56
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block driver updates from Jens Axboe: - mtip32xx pci cleanups (Bjorn) - mtip32xx conversion to generic power management (Vaibhav) - rsxx pci powermanagement cleanups (Bjorn) - Remove the rsxx driver. This hardware never saw much adoption, and it's been end of lifed for a while. (Christoph) - MD pull request from Song: - REQ_NOWAIT support (Vishal Verma) - raid6 benchmark optimization (Dirk Müller) - Fix for acct bioset (Xiao Ni) - Clean up max_queued_requests (Mariusz Tkaczyk) - PREEMPT_RT optimization (Davidlohr Bueso) - Use default_groups in kobj_type (Greg Kroah-Hartman) - Use attribute groups in pktcdvd and rnbd (Greg) - NVMe pull request from Christoph: - increment request genctr on completion (Keith Busch, Geliang Tang) - add a 'iopolicy' module parameter (Hannes Reinecke) - print out valid arguments when reading from /dev/nvme-fabrics (Hannes Reinecke) - Use struct_group() in drbd (Kees) - null_blk fixes (Ming) - Get rid of congestion logic in pktcdvd (Neil) - Floppy ejection hang fix (Tasos) - Floppy max user request size fix (Xiongwei) - Loop locking fix (Tetsuo) * tag 'for-5.17/drivers-2022-01-11' of git://git.kernel.dk/linux-block: (32 commits) md: use default_groups in kobj_type md: Move alloc/free acct bioset in to personality lib/raid6: Use strict priority ranking for pq gen() benchmarking lib/raid6: skip benchmark of non-chosen xor_syndrome functions md: fix spelling of "its" md: raid456 add nowait support md: raid10 add nowait support md: raid1 add nowait support md: add support for REQ_NOWAIT md: drop queue limitation for RAID1 and RAID10 md/raid5: play nice with PREEMPT_RT block/rnbd-clt-sysfs: use default_groups in kobj_type pktcdvd: convert to use attribute groups block: null_blk: only set set->nr_maps as 3 if active poll_queues is > 0 nvme: add 'iopolicy' module parameter nvme: drop unused variable ctrl in nvme_setup_cmd nvme: increment request genctr on completion nvme-fabrics: print out valid arguments when reading from /dev/nvme-fabrics block: remove the rsxx driver rsxx: Drop PCI legacy power management ...
| * md: raid1 add nowait supportVishal Verma2022-01-061-20/+56
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds nowait support to the RAID1 driver. It makes RAID1 driver return with EAGAIN for situations where it could wait for eg: - Waiting for the barrier, wait_barrier() fn is modified to return bool to support error for wait barriers. It returns true in case of wait or if wait is not required and returns false if wait was required but not performed to support nowait. Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Vishal Verma <vverma@digitalocean.com> Signed-off-by: Song Liu <song@kernel.org>
| * md: drop queue limitation for RAID1 and RAID10Mariusz Tkaczyk2022-01-061-7/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As suggested by Neil Brown[1], this limitation seems to be deprecated. With plugging in use, writes are processed behind the raid thread and conf->pending_count is not increased. This limitation occurs only if caller doesn't use plugs. It can be avoided and often it is (with plugging). There are no reports that queue is growing to enormous size so remove queue limitation for non-plugged IOs too. [1] https://lore.kernel.org/linux-raid/162496301481.7211.18031090130574610495@noble.neil.brown.name Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Signed-off-by: Song Liu <song@kernel.org>
* | md/raid1: fix missing bitmap update w/o WriteMostly devicesSong Liu2022-01-031-1/+2
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit [1] causes missing bitmap updates when there isn't any WriteMostly devices. Detailed steps to reproduce by Norbert (which somehow didn't make to lore): # setup md10 (raid1) with two drives (1 GByte sparse files) dd if=/dev/zero of=disk1 bs=1024k seek=1024 count=0 dd if=/dev/zero of=disk2 bs=1024k seek=1024 count=0 losetup /dev/loop11 disk1 losetup /dev/loop12 disk2 mdadm --create /dev/md10 --level=1 --raid-devices=2 /dev/loop11 /dev/loop12 # add bitmap (aka write-intent log) mdadm /dev/md10 --grow --bitmap=internal echo check > /sys/block/md10/md/sync_action root:# cat /sys/block/md10/md/mismatch_cnt 0 root:# # remove member drive disk2 (loop12) mdadm /dev/md10 -f loop12 ; mdadm /dev/md10 -r loop12 # modify degraded md device dd if=/dev/urandom of=/dev/md10 bs=512 count=1 # no blocks recorded as out of sync on the remaining member disk1/loop11 root:# mdadm -X /dev/loop11 | grep Bitmap Bitmap : 16 bits (chunks), 0 dirty (0.0%) root:# # re-add disk2, nothing synced because of empty bitmap mdadm /dev/md10 --re-add /dev/loop12 # check integrity again echo check > /sys/block/md10/md/sync_action # disk1 and disk2 are no longer in sync, reads return differend data root:# cat /sys/block/md10/md/mismatch_cnt 128 root:# # clean up mdadm -S /dev/md10 losetup -d /dev/loop11 losetup -d /dev/loop12 rm disk1 disk2 Fix this by moving the WriteMostly check to the if condition for alloc_behind_master_bio(). [1] commit fd3b6975e9c1 ("md/raid1: only allocate write behind bio for WriteMostly device") Fixes: fd3b6975e9c1 ("md/raid1: only allocate write behind bio for WriteMostly device") Cc: stable@vger.kernel.org # v5.12+ Cc: Guoqing Jiang <guoqing.jiang@linux.dev> Cc: Jens Axboe <axboe@kernel.dk> Reported-by: Norbert Warmuth <nwarmuth@t-online.de> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Song Liu <song@kernel.org>
* md/raid1: use rdev in raid1_write_request directlyGuoqing Jiang2021-10-181-6/+5
| | | | | | | | | We already get rdev from conf->mirrors[i].rdev at the beginning of the loop, so just use it. Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* md/raid1: only allocate write behind bio for WriteMostly deviceGuoqing Jiang2021-10-181-1/+1
| | | | | | | | | | | | | | | | | | | | | | Commit 6607cd319b6b91bff94e90f798a61c031650b514 ("raid1: ensure write behind bio has less than BIO_MAX_VECS sectors") tried to guarantee the size of behind bio is not bigger than BIO_MAX_VECS sectors. Unfortunately the same calltrace still could happen since an array could enable write-behind without write mostly device. To match the manpage of mdadm (which says "write-behind is only attempted on drives marked as write-mostly"), we need to check WriteMostly flag to avoid such unexpected behavior. [1]. https://bugzilla.kernel.org/show_bug.cgi?id=213181#c25 Cc: stable@vger.kernel.org # v5.12+ Cc: Jens Stutte <jens@chianterastutte.eu> Reported-by: Jens Stutte <jens@chianterastutte.eu> Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge tag 'for-5.15/drivers-2021-08-30' of git://git.kernel.dk/linux-blockLinus Torvalds2021-08-301-0/+19
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block driver updates from Jens Axboe: "Sitting on top of the core block changes, here are the driver changes for the 5.15 merge window: - NVMe updates via Christoph: - suspend improvements for devices with an HMB (Keith Busch) - handle double completions more gacefull (Sagi Grimberg) - cleanup the selects for the nvme core code a bit (Sagi Grimberg) - don't update queue count when failing to set io queues (Ruozhu Li) - various nvmet connect fixes (Amit Engel) - cleanup lightnvm leftovers (Keith Busch, me) - small cleanups (Colin Ian King, Hou Pu) - add tracing for the Set Features command (Hou Pu) - CMB sysfs cleanups (Keith Busch) - add a mutex_destroy call (Keith Busch) - remove lightnvm subsystem. It's served its purpose and ultimately led to zoned nvme support, we no longer need it (Christoph) - revert floppy O_NDELAY fix (Denis) - nbd fixes (Hou, Pavel, Baokun) - nbd locking fixes (Tetsuo) - nbd device removal fixes (Christoph) - raid10 rcu warning fix (Xiao) - raid1 write behind fix (Guoqing) - rnbd fixes (Gioh, Md Haris) - misc fixes (Colin)" * tag 'for-5.15/drivers-2021-08-30' of git://git.kernel.dk/linux-block: (42 commits) Revert "floppy: reintroduce O_NDELAY fix" raid1: ensure write behind bio has less than BIO_MAX_VECS sectors md/raid10: Remove unnecessary rcu_dereference in raid10_handle_discard nbd: remove nbd->destroy_complete nbd: only return usable devices from nbd_find_unused nbd: set nbd->index before releasing nbd_index_mutex nbd: prevent IDR lookups from finding partially initialized devices nbd: reset NBD to NULL when restarting in nbd_genl_connect nbd: add missing locking to the nbd_dev_add error path nvme: remove the unused NVME_NS_* enum nvme: remove nvm_ndev from ns nvme: Have NVME_FABRICS select NVME_CORE instead of transport drivers block: nbd: add sanity check for first_minor nvmet: check that host sqsize does not exceed ctrl MQES nvmet: avoid duplicate qid in connect cmd nvmet: pass back cntlid on successful completion nvme-rdma: don't update queue count when failing to set io queues nvme-tcp: don't update queue count when failing to set io queues nvme-tcp: pair send_mutex init with destroy nvme: allow user toggling hmb usage ...
| * raid1: ensure write behind bio has less than BIO_MAX_VECS sectorsGuoqing Jiang2021-08-271-0/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We can't split write behind bio with more than BIO_MAX_VECS sectors, otherwise the below call trace was triggered because we could allocate oversized write behind bio later. [ 8.097936] bvec_alloc+0x90/0xc0 [ 8.098934] bio_alloc_bioset+0x1b3/0x260 [ 8.099959] raid1_make_request+0x9ce/0xc50 [raid1] [ 8.100988] ? __bio_clone_fast+0xa8/0xe0 [ 8.102008] md_handle_request+0x158/0x1d0 [md_mod] [ 8.103050] md_submit_bio+0xcd/0x110 [md_mod] [ 8.104084] submit_bio_noacct+0x139/0x530 [ 8.105127] submit_bio+0x78/0x1d0 [ 8.106163] ext4_io_submit+0x48/0x60 [ext4] [ 8.107242] ext4_writepages+0x652/0x1170 [ext4] [ 8.108300] ? do_writepages+0x41/0x100 [ 8.109338] ? __ext4_mark_inode_dirty+0x240/0x240 [ext4] [ 8.110406] do_writepages+0x41/0x100 [ 8.111450] __filemap_fdatawrite_range+0xc5/0x100 [ 8.112513] file_write_and_wait_range+0x61/0xb0 [ 8.113564] ext4_sync_file+0x73/0x370 [ext4] [ 8.114607] __x64_sys_fsync+0x33/0x60 [ 8.115635] do_syscall_64+0x33/0x40 [ 8.116670] entry_SYSCALL_64_after_hwframe+0x44/0xae Thanks for the comment from Christoph. [1]. https://bugs.archlinux.org/task/70992 Cc: stable@vger.kernel.org # v5.12+ Reported-by: Jens Stutte <jens@chianterastutte.eu> Tested-by: Jens Stutte <jens@chianterastutte.eu> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn> Signed-off-by: Song Liu <songliubraving@fb.com>
* | Merge branch 'md-fixes' of ↵Jens Axboe2021-08-041-2/+0
|\ \ | |/ |/| | | | | | | | | https://git.kernel.org/pub/scm/linux/kernel/git/song/md into block-5.14 * 'md-fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: md/raid10: properly indicate failure when ending a failed write request
| * md/raid10: properly indicate failure when ending a failed write requestWei Shuyu2021-07-231-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | Similar to [1], this patch fixes the same bug in raid10. Also cleanup the comments. [1] commit 2417b9869b81 ("md/raid1: properly indicate failure when ending a failed write request") Cc: stable@vger.kernel.org Fixes: 7cee6d4e6035 ("md/raid10: end bio when the device faulty") Signed-off-by: Wei Shuyu <wsy@dogben.com> Acked-by: Guoqing Jiang <jiangguoqing@kylinos.cn> Signed-off-by: Song Liu <song@kernel.org>
* | md/raid1: enable io accountingGuoqing Jiang2021-06-141-0/+7
| | | | | | | | | | | | | | | | | | | | For raid1, we record the start time between split bio and clone bio, and finish the accounting in the final endio. Also introduce start_time in r1bio accordingly. Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn> Signed-off-by: Song Liu <song@kernel.org>
* | md/raid1: rename print_msg with r1bio_existedGuoqing Jiang2021-06-141-4/+4
|/ | | | | | | | | The caller of raid1_read_request could pass NULL or a valid pointer for "struct r1bio *r1_bio", so it actually means whether r1_bio is existed or not. Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn> Signed-off-by: Song Liu <song@kernel.org>
* md/raid1: properly indicate failure when ending a failed write requestPaul Clements2021-04-231-0/+2
| | | | | | | | | | | | | This patch addresses a data corruption bug in raid1 arrays using bitmaps. Without this fix, the bitmap bits for the failed I/O end up being cleared. Since we are in the failure leg of raid1_end_write_request, the request either needs to be retried (R1BIO_WriteError) or failed (R1BIO_Degraded). Fixes: eeba6809d8d5 ("md/raid1: end bio when the device faulty") Cc: stable@vger.kernel.org # v5.2+ Signed-off-by: Paul Clements <paul.clements@us.sios.com> Signed-off-by: Song Liu <song@kernel.org>
* md: remove bio_alloc_mddevChristoph Hellwig2021-01-271-1/+1
| | | | | | | | | | | | | bio_alloc_mddev is never called with a NULL mddev, and ->bio_set is initialized in md_run, so it always must be initialized as well. Just open code the remaining call to bio_alloc_bioset. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Song Liu <song@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Acked-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: store a block_device pointer in struct bioChristoph Hellwig2021-01-241-3/+3
| | | | | | | | | | | Replace the gendisk pointer in struct bio with a pointer to the newly improved struct block device. From that the gendisk can be trivially accessed with an extra indirection, but it also allows to directly look up all information related to partition remapping. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: remove the request_queue argument to the block_bio_remap tracepointChristoph Hellwig2020-12-041-4/+3
| | | | | | | | | | | The request_queue can trivially be derived from the bio. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* writeback: remove bdi->congested_fnChristoph Hellwig2020-07-081-31/+0
| | | | | | | | | | | | | Except for pktdvd, the only places setting congested bits are file systems that allocate their own backing_dev_info structures. And pktdvd is a deprecated driver that isn't useful in stack setup either. So remove the dead congested_fn stacking infrastructure. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Song Liu <song@kernel.org> Acked-by: David Sterba <dsterba@suse.com> [axboe: fixup unused variables in bcache/request.c] Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: rename generic_make_request to submit_bio_noacctChristoph Hellwig2020-07-011-7/+7
| | | | | | | | | generic_make_request has always been very confusingly misnamed, so rename it to submit_bio_noacct to make it clear that it is submit_bio minus accounting and a few checks. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* md/raid1: release pending accounting for an I/O only after write-behind is ↵David Jeffery2020-05-131-6/+7
| | | | | | | | | | | | | | | | | | | | | | | | | also finished When using RAID1 and write-behind, md can deadlock when errors occur. With write-behind, r1bio structs can be accounted by raid1 as queued but not counted as pending. The pending count is dropped when the original bio is returned complete but write-behind for the r1bio may still be active. This breaks the accounting used in some conditions to know when the raid1 md device has reached an idle state. It can result in calls to freeze_array deadlocking. freeze_array will never complete from a negative "unqueued" value being calculated due to a queued count larger than the pending count. To properly account for write-behind, move the call to allow_barrier from call_bio_endio to raid_end_bio_io. When using write-behind, md can call call_bio_endio before all write-behind I/O is complete. Using raid_end_bio_io for the point to call allow_barrier will release the pending count at a point where all I/O for an r1bio, even write-behind, is done. Signed-off-by: David Jeffery <djeffery@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
* md/raid1: introduce wait_for_serializationGuoqing Jiang2020-01-131-19/+22
| | | | | | | | | | | | Previously, we call check_and_add_serial when serialization is enabled for write IO, but it could allocate and free memory back and forth. Now, let's just get an element from memory pool with the new function, then insert node to rb tree if no collision happens. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
* md/raid1: use bucket based mechanism for IO serializationGuoqing Jiang2020-01-131-3/+6
| | | | | | | | | | | | | | | | | Since raid1 had already used bucket based mechanism to reduce the conflict between write IO and resync IO, it is possible to speed up performance for io serialization with refer to the same mechanism. To align with the barrier bucket mechanism, we created arrays (with the same number of BARRIER_BUCKETS_NR) for spinlock, rb tree and waitqueue. Then we can reduce lock competition with multiple spinlocks, boost search performance with multiple rb trees and also reduce thundering herd problem with multiple waitqueues. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
* md: introduce a new struct for IO serializationGuoqing Jiang2020-01-131-28/+33
| | | | | | | | | | | | | | | | | | | | | | | Obviously, IO serialization could cause the degradation of performance a lot. In order to reduce the degradation, so a rb interval tree is added in raid1 to speed up the check of collision. So, a rb root is needed in md_rdev, then abstract all the serialize related members to a new struct (serial_in_rdev), embed it into md_rdev. Of course, we need to free the struct if it is not needed anymore, so rdev/rdevs_uninit_serial are added accordingly. And they should be called when destroty memory pool or can't alloc memory. And we need to consider to call mddev_destroy_serial_pool in case serialize_policy/write-behind is disabled, bitmap is destroyed or in __md_stop_writes. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
* raid1: serialize the overlap writeGuoqing Jiang2020-01-131-14/+13
| | | | | | | | | | Before dispatch write bio, raid1 array which enables serialize_policy need to check if overlap exists between this bio and previous on-flying bios. If there is overlap, then it has to wait until the collision is disappeared. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
* md: rename wb stuffsGuoqing Jiang2020-01-131-21/+22
| | | | | | | | | | | | | Previously, wb_info_pool and wb_list stuffs are introduced to address potential data inconsistence issue for write behind device. Now rename them to serial related name, since the same mechanism will be used to address reorder overlap write issue for raid1. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
* md: raid1: check rdev before reference in raid1_sync_request funcZhiqiang Liu2019-12-111-1/+1
| | | | | | | In raid1_sync_request func, rdev should be checked before reference. Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>
* md/raid1: avoid soft lockup under high loadHannes Reinecke2019-11-111-0/+1
| | | | | | | | As all I/O is being pushed through a kernel thread the softlockup watchdog might be triggered under high load. Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Song Liu <songliubraving@fb.com>
* md: improve handling of bio with REQ_PREFLUSH in md_flush_request()David Jeffery2019-10-241-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | If pers->make_request fails in md_flush_request(), the bio is lost. To fix this, pass back a bool to indicate if the original make_request call should continue to handle the I/O and instead of assuming the flush logic will push it to completion. Convert md_flush_request to return a bool and no longer calls the raid driver's make_request function. If the return is true, then the md flush logic has or will complete the bio and the md make_request call is done. If false, then the md make_request function needs to keep processing like it is a normal bio. Let the original call to md_handle_request handle any need to retry sending the bio to the raid driver's make_request function should it be needed. Also mark md_flush_request and the make_request function pointer as __must_check to issue warnings should these critical return values be ignored. Fixes: 2bc13b83e629 ("md: batch flush requests.") Cc: stable@vger.kernel.org # # v4.19+ Cc: NeilBrown <neilb@suse.com> Signed-off-by: David Jeffery <djeffery@redhat.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
* md/raid1: fail run raid1 array when active disk less than oneYufen Yu2019-09-031-1/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When run test case: mdadm -CR /dev/md1 -l 1 -n 4 /dev/sd[a-d] --assume-clean --bitmap=internal mdadm -S /dev/md1 mdadm -A /dev/md1 /dev/sd[b-c] --run --force mdadm --zero /dev/sda mdadm /dev/md1 -a /dev/sda echo offline > /sys/block/sdc/device/state echo offline > /sys/block/sdb/device/state sleep 5 mdadm -S /dev/md1 echo running > /sys/block/sdb/device/state echo running > /sys/block/sdc/device/state mdadm -A /dev/md1 /dev/sd[a-c] --run --force mdadm run fail with kernel message as follow: [ 172.986064] md: kicking non-fresh sdb from array! [ 173.004210] md: kicking non-fresh sdc from array! [ 173.022383] md/raid1:md1: active with 0 out of 4 mirrors [ 173.022406] md1: failed to create bitmap (-5) In fact, when active disk in raid1 array less than one, we need to return fail in raid1_run(). Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>
* raid1: factor out a common routine to handle the completion of sync writeHou Tao2019-08-071-21/+18
| | | | | | | It's just code clean-up. Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>
* md: allow last device to be forcibly removed from RAID1/RAID10.Guoqing Jiang2019-08-071-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | When the 'last' device in a RAID1 or RAID10 reports an error, we do not mark it as failed. This would serve little purpose as there is no risk of losing data beyond that which is obviously lost (as there is with RAID5), and there could be other sectors on the device which are readable, and only readable from this device. This in general this maximises access to data. However the current implementation also stops an admin from removing the last device by direct action. This is rarely useful, but in many case is not harmful and can make automation easier by removing special cases. Also, if an attempt to write metadata fails the device must be marked as faulty, else an infinite loop will result, attempting to update the metadata on all non-faulty devices. So add 'fail_last_dev' member to 'struct mddev', then we can bypasses the 'last disk' checks for RAID1 and RAID10, and control the behavior per array by change sysfs node. Signed-off-by: NeilBrown <neilb@suse.de> [add sysfs node for fail_last_dev by Guoqing] Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>