summaryrefslogtreecommitdiffstats
path: root/drivers/md
Commit message (Collapse)AuthorAgeFilesLines
* block: Generic bio chainingKent Overstreet2013-11-235-4/+15
| | | | | | | | | | | | | | | | | This adds a generic mechanism for chaining bio completions. This is going to be used for a bio_split() replacement, and it turns out to be very useful in a fair amount of driver code - a fair number of drivers were implementing this in their own roundabout ways, often painfully. Note that this means it's no longer to call bio_endio() more than once on the same bio! This can cause problems for drivers that save/restore bi_end_io. Arguably they shouldn't be saving/restoring bi_end_io at all - in all but the simplest cases they'd be better off just cloning the bio, and immutable biovecs is making bio cloning cheaper. But for now, we add a bio_endio_nodec() for these cases. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk>
* block: Remove bi_idx hacksKent Overstreet2013-11-231-45/+2
| | | | | | | | | | | | | | | | Now that drivers have been converted to the new bvec_iter primitives, there's no need to trim the bvec before we submit it; and we can't trim it once we start sharing bvecs. It used to be that passing a partially completed bio (i.e. one with nonzero bi_idx) to generic_make_request() was a dangerous thing - various drivers would choke on such things. But with immutable biovecs and our new bio splitting that shares the biovecs, submitting partially completed bios has to work (and should work, now that all the drivers have been completed to the new primitives) Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk>
* dm: Refactor for new bio cloning/splittingKent Overstreet2013-11-232-179/+20
| | | | | | | | | | | | | | | | | | | We need to convert the dm code to the new bvec_iter primitives which respect bi_bvec_done; they also allow us to drastically simplify dm's bio splitting code. Also, it's no longer necessary to save/restore the bvec array anymore - driver conversions for immutable bvecs are done, so drivers should never be modifying it. Also kill bio_sector_offset(), dm was the only user and it doesn't make much sense anymore. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Alasdair Kergon <agk@redhat.com> Cc: dm-devel@redhat.com Reviewed-by: Mike Snitzer <snitzer@redhat.com>
* block: Add bio_clone_fast()Kent Overstreet2013-11-231-6/+2
| | | | | | | | | | bio_clone() just got more expensive - however, most users of bio_clone() don't actually need to modify the biovec. If they aren't modifying the biovec, and they can guarantee that the original bio isn't freed before the clone (also true in most cases), we can just point the clone at the original bio's biovec. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* block: Convert drivers to immutable biovecsKent Overstreet2013-11-234-87/+53
| | | | | | | | | | | | Now that we've got a mechanism for immutable biovecs - bi_iter.bi_bvec_done - we need to convert drivers to use primitives that respect it instead of using the bvec array directly. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: NeilBrown <neilb@suse.de> Cc: Alasdair Kergon <agk@redhat.com> Cc: dm-devel@redhat.com
* block: Kill bio_segments()/bi_vcnt usageKent Overstreet2013-11-233-32/+25
| | | | | | | | | | | | | | | | | | | | | | When we start sharing biovecs, keeping bi_vcnt accurate for splits is going to be error prone - and unnecessary, if we refactor some code. So bio_segments() has to go - but most of the existing users just needed to know if the bio had multiple segments, which is easier - add a bio_multiple_segments() for them. (Two of the current uses of bio_segments() are going to go away in a couple patches, but the current implementation of bio_segments() is unsafe as soon as we start doing driver conversions for immutable biovecs - so implement a dumb version for bisectability, it'll go away in a couple patches) Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Neil Brown <neilb@suse.de> Cc: Nagalakshmi Nandigama <Nagalakshmi.Nandigama@lsi.com> Cc: Sreekanth Reddy <Sreekanth.Reddy@lsi.com> Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
* block: Convert bio_for_each_segment() to bvec_iterKent Overstreet2013-11-235-70/+60
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | More prep work for immutable biovecs - with immutable bvecs drivers won't be able to use the biovec directly, they'll need to use helpers that take into account bio->bi_iter.bi_bvec_done. This updates callers for the new usage without changing the implementation yet. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "Ed L. Cashin" <ecashin@coraid.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Paul Clements <Paul.Clements@steeleye.com> Cc: Jim Paris <jim@jtan.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Yehuda Sadeh <yehuda@inktank.com> Cc: Sage Weil <sage@inktank.com> Cc: Alex Elder <elder@inktank.com> Cc: ceph-devel@vger.kernel.org Cc: Joshua Morris <josh.h.morris@us.ibm.com> Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Neil Brown <neilb@suse.de> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux390@de.ibm.com Cc: Nagalakshmi Nandigama <Nagalakshmi.Nandigama@lsi.com> Cc: Sreekanth Reddy <Sreekanth.Reddy@lsi.com> Cc: support@lsi.com Cc: "James E.J. Bottomley" <JBottomley@parallels.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com> Cc: Tejun Heo <tj@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Guo Chao <yan@linux.vnet.ibm.com> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Stephen Hemminger <shemminger@vyatta.com> Cc: Quoc-Son Anh <quoc-sonx.anh@intel.com> Cc: Sebastian Ott <sebott@linux.vnet.ibm.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Seth Jennings <sjenning@linux.vnet.ibm.com> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Jan Kara <jack@suse.cz> Cc: linux-m68k@lists.linux-m68k.org Cc: linuxppc-dev@lists.ozlabs.org Cc: drbd-user@lists.linbit.com Cc: nbd-general@lists.sourceforge.net Cc: cbe-oss-dev@lists.ozlabs.org Cc: xen-devel@lists.xensource.com Cc: virtualization@lists.linux-foundation.org Cc: linux-raid@vger.kernel.org Cc: linux-s390@vger.kernel.org Cc: DL-MPTFusionLinux@lsi.com Cc: linux-scsi@vger.kernel.org Cc: devel@driverdev.osuosl.org Cc: linux-fsdevel@vger.kernel.org Cc: cluster-devel@redhat.com Cc: linux-mm@kvack.org Acked-by: Geoff Levand <geoff@infradead.org>
* block: Convert bio_iovec() to bvec_iterKent Overstreet2013-11-232-7/+8
| | | | | | | | | | | | | | | | | | | | For immutable biovecs, we'll be introducing a new bio_iovec() that uses our new bvec iterator to construct a biovec, taking into account bvec_iter->bi_bvec_done - this patch updates existing users for the new usage. Some of the existing users really do need a pointer into the bvec array - those uses are all going to be removed, but we'll need the functionality from immutable to remove them - so for now rename the existing bio_iovec() -> __bio_iovec(), and it'll be removed in a couple patches. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Ed L. Cashin" <ecashin@coraid.com> Cc: Alasdair Kergon <agk@redhat.com> Cc: dm-devel@redhat.com Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
* dm: Use bvec_iter for dm_bio_record()Kent Overstreet2013-11-231-9/+3
| | | | | | | | | | | This patch doesn't itself have any functional changes, but immutable biovecs are going to add a bi_bvec_done member to bi_iter, which will need to be saved too here. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Alasdair Kergon <agk@redhat.com> Cc: dm-devel@redhat.com Reviewed-by: Mike Snitzer <snitzer@redhat.com>
* block: Abstract out bvec iteratorKent Overstreet2013-11-2335-300/+333
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Immutable biovecs are going to require an explicit iterator. To implement immutable bvecs, a later patch is going to add a bi_bvec_done member to this struct; for now, this patch effectively just renames things. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "Ed L. Cashin" <ecashin@coraid.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Yehuda Sadeh <yehuda@inktank.com> Cc: Sage Weil <sage@inktank.com> Cc: Alex Elder <elder@inktank.com> Cc: ceph-devel@vger.kernel.org Cc: Joshua Morris <josh.h.morris@us.ibm.com> Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Neil Brown <neilb@suse.de> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux390@de.ibm.com Cc: Boaz Harrosh <bharrosh@panasas.com> Cc: Benny Halevy <bhalevy@tonian.com> Cc: "James E.J. Bottomley" <JBottomley@parallels.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Chris Mason <chris.mason@fusionio.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Dave Kleikamp <shaggy@kernel.org> Cc: Joern Engel <joern@logfs.org> Cc: Prasad Joshi <prasadjoshi.linux@gmail.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Ben Myers <bpm@sgi.com> Cc: xfs@oss.sgi.com Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Len Brown <len.brown@intel.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Guo Chao <yan@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Cc: "Roger Pau Monné" <roger.pau@citrix.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Cc: Ian Campbell <Ian.Campbell@citrix.com> Cc: Sebastian Ott <sebott@linux.vnet.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Jerome Marchand <jmarchand@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Peng Tao <tao.peng@emc.com> Cc: Andy Adamson <andros@netapp.com> Cc: fanchaoting <fanchaoting@cn.fujitsu.com> Cc: Jie Liu <jeff.liu@oracle.com> Cc: Sunil Mushran <sunil.mushran@gmail.com> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Namjae Jeon <namjae.jeon@samsung.com> Cc: Pankaj Kumar <pankaj.km@samsung.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Mel Gorman <mgorman@suse.de>6
* bcache: Kill unaligned bvec hackKent Overstreet2013-11-233-35/+7
| | | | | | | | | | Bcache has a hack to avoid cloning the biovec if it's all full pages - but with immutable biovecs coming this won't be necessary anymore. For now, we remove the special case and always clone the bvec array so that the immutable biovec patches are simpler. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* block: submit_bio_wait() conversionsKent Overstreet2013-11-231-13/+1
| | | | | | | | | | | | It was being open coded in a few places. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joern Engel <joern@logfs.org> Cc: Prasad Joshi <prasadjoshi.linux@gmail.com> Cc: Neil Brown <neilb@suse.de> Cc: Chris Mason <chris.mason@fusionio.com> Acked-by: NeilBrown <neilb@suse.de>
* Merge tag 'md/3.13' of git://neil.brown.name/mdLinus Torvalds2013-11-206-186/+566
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull md update from Neil Brown: "Mostly optimisations and obscure bug fixes. - raid5 gets less lock contention - raid1 gets less contention between normal-io and resync-io during resync" * tag 'md/3.13' of git://neil.brown.name/md: md/raid5: Use conf->device_lock protect changing of multi-thread resources. md/raid5: Before freeing old multi-thread worker, it should flush them. md/raid5: For stripe with R5_ReadNoMerge, we replace REQ_FLUSH with REQ_NOMERGE. UAPI: include <asm/byteorder.h> in linux/raid/md_p.h raid1: Rewrite the implementation of iobarrier. raid1: Add some macros to make code clearly. raid1: Replace raise_barrier/lower_barrier with freeze_array/unfreeze_array when reconfiguring the array. raid1: Add a field array_frozen to indicate whether raid in freeze state. md: Convert use of typedef ctl_table to struct ctl_table md/raid5: avoid deadlock when raid5 array has unack badblocks during md_stop_writes. md: use MD_RECOVERY_INTR instead of kthread_should_stop in resync thread. md: fix some places where mddev_lock return value is not checked. raid5: Retry R5_ReadNoMerge flag when hit a read error. raid5: relieve lock contention in get_active_stripe() raid5: relieve lock contention in get_active_stripe() wait: add wait_event_cmd() md/raid5.c: add proper locking to error path of raid5_start_reshape. md: fix calculation of stacking limits on level change. raid5: Use slow_path to release stripe when mddev->thread is null
| * md/raid5: Use conf->device_lock protect changing of multi-thread resources.majianpeng2013-11-191-24/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we change group_thread_cnt from sysfs entry, it can OOPS. The kernel messages are: [ 135.299021] BUG: unable to handle kernel NULL pointer dereference at (null) [ 135.299073] IP: [<ffffffff815188ab>] handle_active_stripes+0x32b/0x440 [ 135.299107] PGD 0 [ 135.299122] Oops: 0000 [#1] SMP [ 135.299144] Modules linked in: netconsole e1000e ptp pps_core [ 135.299188] CPU: 3 PID: 2225 Comm: md0_raid5 Not tainted 3.12.0+ #24 [ 135.299214] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS 080015 11/09/2011 [ 135.299255] task: ffff8800b9638f80 ti: ffff8800b77a4000 task.ti: ffff8800b77a4000 [ 135.299283] RIP: 0010:[<ffffffff815188ab>] [<ffffffff815188ab>] handle_active_stripes+0x32b/0x440 [ 135.299323] RSP: 0018:ffff8800b77a5c48 EFLAGS: 00010002 [ 135.299344] RAX: ffff880037bb5c70 RBX: 0000000000000000 RCX: 0000000000000008 [ 135.299371] RDX: ffff880037bb5cb8 RSI: 0000000000000001 RDI: ffff880037bb5c00 [ 135.299398] RBP: ffff8800b77a5d08 R08: 0000000000000001 R09: 0000000000000000 [ 135.299425] R10: ffff8800b77a5c98 R11: 00000000ffffffff R12: ffff880037bb5c00 [ 135.299452] R13: 0000000000000000 R14: 0000000000000000 R15: ffff880037bb5c70 [ 135.299479] FS: 0000000000000000(0000) GS:ffff88013fd80000(0000) knlGS:0000000000000000 [ 135.299510] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 135.299532] CR2: 0000000000000000 CR3: 0000000001c0b000 CR4: 00000000000407e0 [ 135.299559] Stack: [ 135.299570] ffff8800b77a5c88 ffffffff8107383e ffff8800b77a5c88 ffff880037a64300 [ 135.299611] 000000000000ec08 ffff880037bb5cb8 ffff8800b77a5c98 ffffffffffffffd8 [ 135.299654] 000000000000ec08 ffff880037bb5c60 ffff8800b77a5c98 ffff8800b77a5c98 [ 135.299696] Call Trace: [ 135.299711] [<ffffffff8107383e>] ? __wake_up+0x4e/0x70 [ 135.299733] [<ffffffff81518f88>] raid5d+0x4c8/0x680 [ 135.299756] [<ffffffff817174ed>] ? schedule_timeout+0x15d/0x1f0 [ 135.299781] [<ffffffff81524c9f>] md_thread+0x11f/0x170 [ 135.299804] [<ffffffff81069cd0>] ? wake_up_bit+0x40/0x40 [ 135.299826] [<ffffffff81524b80>] ? md_rdev_init+0x110/0x110 [ 135.299850] [<ffffffff81069656>] kthread+0xc6/0xd0 [ 135.299871] [<ffffffff81069590>] ? kthread_freezable_should_stop+0x70/0x70 [ 135.299899] [<ffffffff81722ffc>] ret_from_fork+0x7c/0xb0 [ 135.299923] [<ffffffff81069590>] ? kthread_freezable_should_stop+0x70/0x70 [ 135.299951] Code: ff ff ff 0f 84 d7 fe ff ff e9 5c fe ff ff 66 90 41 8b b4 24 d8 01 00 00 45 31 ed 85 f6 0f 8e 7b fd ff ff 49 8b 9c 24 d0 01 00 00 <48> 3b 1b 49 89 dd 0f 85 67 fd ff ff 48 8d 43 28 31 d2 eb 17 90 [ 135.300005] RIP [<ffffffff815188ab>] handle_active_stripes+0x32b/0x440 [ 135.300005] RSP <ffff8800b77a5c48> [ 135.300005] CR2: 0000000000000000 [ 135.300005] ---[ end trace 504854e5bb7562ed ]--- [ 135.300005] Kernel panic - not syncing: Fatal exception This is because raid5d() can be running when the multi-thread resources are changed via system. We see need to provide locking. mddev->device_lock is suitable, but we cannot simple call alloc_thread_groups under this lock as we cannot allocate memory while holding a spinlock. So change alloc_thread_groups() to allocate and return the data structures, then raid5_store_group_thread_cnt() can take the lock while updating the pointers to the data structures. This fixes a bug introduced in 3.12 and so is suitable for the 3.12.x stable series. Fixes: b721420e8719131896b009b11edbbd27 Cc: stable@vger.kernel.org (3.12) Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Shaohua Li <shli@kernel.org>
| * md/raid5: Before freeing old multi-thread worker, it should flush them.majianpeng2013-11-191-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When changing group_thread_cnt from sysfs entry, the kernel can oops. The kernel messages are: [ 740.961389] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 [ 740.961444] IP: [<ffffffff81062570>] process_one_work+0x30/0x500 [ 740.961476] PGD b9013067 PUD b651e067 PMD 0 [ 740.961503] Oops: 0000 [#1] SMP [ 740.961525] Modules linked in: netconsole e1000e ptp pps_core [ 740.961577] CPU: 0 PID: 3683 Comm: kworker/u8:5 Not tainted 3.12.0+ #23 [ 740.961602] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS 080015 11/09/2011 [ 740.961646] task: ffff88013abe0000 ti: ffff88013a246000 task.ti: ffff88013a246000 [ 740.961673] RIP: 0010:[<ffffffff81062570>] [<ffffffff81062570>] process_one_work+0x30/0x500 [ 740.961708] RSP: 0018:ffff88013a247e08 EFLAGS: 00010086 [ 740.961730] RAX: ffff8800b912b400 RBX: ffff88013a61e680 RCX: ffff8800b912b400 [ 740.961757] RDX: ffff8800b912b600 RSI: ffff8800b912b600 RDI: ffff88013a61e680 [ 740.961782] RBP: ffff88013a247e48 R08: ffff88013a246000 R09: 000000000002c09d [ 740.961808] R10: 000000000000010f R11: 0000000000000000 R12: ffff88013b00cc00 [ 740.961833] R13: 0000000000000000 R14: ffff88013b00cf80 R15: ffff88013a61e6b0 [ 740.961861] FS: 0000000000000000(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000 [ 740.961893] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 740.962001] CR2: 00000000000000b8 CR3: 00000000b24fe000 CR4: 00000000000407f0 [ 740.962001] Stack: [ 740.962001] 0000000000000008 ffff8800b912b600 ffff88013b00cc00 ffff88013a61e680 [ 740.962001] ffff88013b00cc00 ffff88013b00cc18 ffff88013b00cf80 ffff88013a61e6b0 [ 740.962001] ffff88013a247eb8 ffffffff810639c6 0000000000012a80 ffff88013a247fd8 [ 740.962001] Call Trace: [ 740.962001] [<ffffffff810639c6>] worker_thread+0x206/0x3f0 [ 740.962001] [<ffffffff810637c0>] ? manage_workers+0x2c0/0x2c0 [ 740.962001] [<ffffffff81069656>] kthread+0xc6/0xd0 [ 740.962001] [<ffffffff81069590>] ? kthread_freezable_should_stop+0x70/0x70 [ 740.962001] [<ffffffff81722ffc>] ret_from_fork+0x7c/0xb0 [ 740.962001] [<ffffffff81069590>] ? kthread_freezable_should_stop+0x70/0x70 [ 740.962001] Code: 89 e5 41 57 41 56 41 55 45 31 ed 41 54 53 48 89 fb 48 83 ec 18 48 8b 06 4c 8b 67 48 48 89 c1 30 c9 a8 04 4c 0f 45 e9 80 7f 58 00 <49> 8b 45 08 44 8b b0 00 01 00 00 78 0c 41 f6 44 24 10 04 0f 84 [ 740.962001] RIP [<ffffffff81062570>] process_one_work+0x30/0x500 [ 740.962001] RSP <ffff88013a247e08> [ 740.962001] CR2: 0000000000000008 [ 740.962001] ---[ end trace 39181460000748de ]--- [ 740.962001] Kernel panic - not syncing: Fatal exception This can happen if there are some stripes left, fewer than MAX_STRIPE_BATCH. A worker is queued to handle them. But before calling raid5_do_work, raid5d handles those stripes making conf->active_stripe = 0. So mddev_suspend() can return. We might then free old worker resources before the queued raid5_do_work() handled them. When it runs, it crashes. raid5d() raid5_store_group_thread_cnt() queue_work mddev_suspend() handle_strips active_stripe=0 free(old worker resources) process_one_work raid5_do_work To avoid this, we should only flush the worker resources before freeing them. This fixes a bug introduced in 3.12 so is suitable for the 3.12.x stable series. Cc: stable@vger.kernel.org (3.12) Fixes: b721420e8719131896b009b11edbbd27 Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Shaohua Li <shli@kernel.org>
| * md/raid5: For stripe with R5_ReadNoMerge, we replace REQ_FLUSH with REQ_NOMERGE.majianpeng2013-11-191-1/+1
| | | | | | | | | | | | | | | | | | For R5_ReadNoMerge,it mean this bio can't merge with other bios or request.It used REQ_FLUSH to achieve this. But REQ_NOMERGE can do the same work. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * raid1: Rewrite the implementation of iobarrier.majianpeng2013-11-192-13/+129
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is an iobarrier in raid1 because of contention between normal IO and resync IO. It suspends all normal IO when resync/recovery happens. However if normal IO is out side the resync window, there is no contention. So this patch changes the barrier mechanism to only block IO that could contend with the resync that is currently happening. We partition the whole space into five parts. |---------|-----------|------------|----------------|-------| start next_resync start_next_window end_window start + RESYNC_WINDOW = next_resync next_resync + NEXT_NORMALIO_DISTANCE = start_next_window start_next_window + NEXT_NORMALIO_DISTANCE = end_window Firstly we introduce some concepts: 1 - RESYNC_WINDOW: For resync, there are 32 resync requests at most at the same time. A sync request is RESYNC_BLOCK_SIZE(64*1024). So the RESYNC_WINDOW is 32 * RESYNC_BLOCK_SIZE, that is 2MB. 2 - NEXT_NORMALIO_DISTANCE: the distance between next_resync and start_next_window. It also indicates the distance between start_next_window and end_window. It is currently 3 * RESYNC_WINDOW_SIZE but could be tuned if this turned out not to be optimal. 3 - next_resync: the next sector at which we will do sync IO. 4 - start: a position which is at most RESYNC_WINDOW before next_resync. 5 - start_next_window: a position which is NEXT_NORMALIO_DISTANCE beyond next_resync. Normal-io after this position doesn't need to wait for resync-io to complete. 6 - end_window: a position which is 2 * NEXT_NORMALIO_DISTANCE beyond next_resync. This also doesn't need to wait, but is counted differently. 7 - current_window_requests: the count of normalIO between start_next_window and end_window. 8 - next_window_requests: the count of normalIO after end_window. NormalIO will be partitioned into four types: NormIO1: the end sector of bio is smaller or equal the start NormIO2: the start sector of bio larger or equal to end_window NormIO3: the start sector of bio larger or equal to start_next_window. NormIO4: the location between start_next_window and end_window |--------|-----------|--------------------|----------------|-------------| | start | next_resync | start_next_window | end_window | NormIO1 NormIO4 NormIO4 NormIO3 NormIO2 For NormIO1, we don't need any io barrier. For NormIO4, we used a similar approach to the original iobarrier mechanism. The normalIO and resyncIO must be kept separate. For NormIO2/3, we add two fields to struct r1conf: "current_window_requests" and "next_window_requests". They indicate the count of active requests in the two window. For these, we don't wait for resync io to complete. For resync action, if there are NormIO4s, we must wait for it. If not, we can proceed. But if resync action reaches start_next_window and current_window_requests > 0 (that is there are NormIO3s), we must wait until the current_window_requests becomes zero. When current_window_requests becomes zero, start_next_window also moves forward. Then current_window_requests will replaced by next_window_requests. There is a problem which when and how to change from NormIO2 to NormIO3. Only then can sync action progress. We add a field in struct r1conf "start_next_window". A: if start_next_window == MaxSector, it means there are no NormIO2/3. So start_next_window = next_resync + NEXT_NORMALIO_DISTANCE B: if current_window_requests == 0 && next_window_requests != 0, it means start_next_window move to end_window There is another problem which how to differentiate between old NormIO2(now it is NormIO3) and NormIO2. For example, there are many bios which are NormIO2 and a bio which is NormIO3. NormIO3 firstly completed, so the bios of NormIO2 became NormIO3. We add a field in struct r1bio "start_next_window". This is used to record the position conf->start_next_window when the call to wait_barrier() is made in make_request(). In allow_barrier(), we check the conf->start_next_window. If r1bio->stat_next_window == conf->start_next_window, it means there is no transition between NormIO2 and NormIO3. If r1bio->start_next_window != conf->start_next_window, it mean there was a transition between NormIO2 and NormIO3. There can only have been one transition. So it only means the bio is old NormIO2. For one bio, there may be many r1bio's. So we make sure all the r1bio->start_next_window are the same value. If we met blocked_dev in make_request(), it must call allow_barrier and wait_barrier. So the former and the later value of conf->start_next_window will be change. If there are many r1bio's with differnet start_next_window, for the relevant bio, it depend on the last value of r1bio. It will cause error. To avoid this, we must wait for previous r1bios to complete. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * raid1: Add some macros to make code clearly.majianpeng2013-11-191-4/+4
| | | | | | | | | | | | | | | | In a subsequent patch, we'll use some const parameters. Using macros will make the code clearly. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * raid1: Replace raise_barrier/lower_barrier with freeze_array/unfreeze_array ↵majianpeng2013-11-191-5/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | when reconfiguring the array. We used to use raise_barrier to suspend normal IO while we reconfigure the array. However raise_barrier will soon only suspend some normal IO, not all. So we need something else. Change it to use freeze_array. But freeze_array not only suspends normal io, it also suspends resync io. For the place where call raise_barrier for reconfigure, it isn't a problem. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * raid1: Add a field array_frozen to indicate whether raid in freeze state.majianpeng2013-11-192-8/+8
| | | | | | | | | | | | | | | | | | Because the following patch will rewrite the content between normal IO and resync IO. So we used a parameter to indicate whether raid is in freeze array. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * md: Convert use of typedef ctl_table to struct ctl_tableJoe Perches2013-11-191-3/+3
| | | | | | | | | | | | | | This typedef is unnecessary and should just be removed. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * md/raid5: avoid deadlock when raid5 array has unack badblocks during ↵NeilBrown2013-11-191-19/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | md_stop_writes. When raid5 recovery hits a fresh badblock, this badblock will flagged as unack badblock until md_update_sb() is called. But md_stop will take reconfig lock which means raid5d can't call md_update_sb() in md_check_recovery(), the badblock will always be unack, so raid5d thread enters an infinite loop and md_stop_write() can never stop sync_thread. This causes deadlock. To solve this, when STOP_ARRAY ioctl is issued and sync_thread is running, we need set md->recovery FROZEN and INTR flags and wait for sync_thread to stop before we (re)take reconfig lock. This requires that raid5 reshape_request notices MD_RECOVERY_INTR (which it probably should have noticed anyway) and stops waiting for a metadata update in that case. Reported-by: Jianpeng Ma <majianpeng@gmail.com> Reported-by: Bian Yu <bianyu@kedacom.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * md: use MD_RECOVERY_INTR instead of kthread_should_stop in resync thread.NeilBrown2013-11-193-31/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We currently use kthread_should_stop() in various places in the sync/reshape code to abort early. However some places set MD_RECOVERY_INTR but don't immediately call md_reap_sync_thread() (and we will shortly get another one). When this happens we are relying on md_check_recovery() to reap the thread and that only happen when it finishes normally. So MD_RECOVERY_INTR must lead to a normal finish without the kthread_should_stop() test. So replace all relevant tests, and be more careful when the thread is interrupted not to acknowledge that latest step in a reshape as it may not be fully committed yet. Also add a test on MD_RECOVERY_INTR in the 'is_mddev_idle' loop so we don't wait have to wait for the speed to drop before we can abort. Signed-off-by: NeilBrown <neilb@suse.de>
| * md: fix some places where mddev_lock return value is not checked.NeilBrown2013-11-191-5/+13
| | | | | | | | | | | | | | | | Sometimes we need to lock and mddev and cannot cope with failure due to interrupt. In these cases we should use mutex_lock, not mutex_lock_interruptible. Signed-off-by: NeilBrown <neilb@suse.de>
| * raid5: Retry R5_ReadNoMerge flag when hit a read error.Bian Yu2013-11-191-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Because of block layer merge, one bio fails will cause other bios which belongs to the same request fails, so raid5_end_read_request will record all these bios as badblocks. If retry request with R5_ReadNoMerge flag to avoid bios merge, badblocks can only record sector which is bad exactly. test: hdparm --yes-i-know-what-i-am-doing --make-bad-sector 300000 /dev/sdb mdadm -C /dev/md0 -l5 -n3 /dev/sd[bcd] --assume-clean mdadm /dev/md0 -f /dev/sdd mdadm /dev/md0 -r /dev/sdd mdadm --zero-superblock /dev/sdd mdadm /dev/md0 -a /dev/sdd 1. Without this patch: cat /sys/block/md0/md/rd*/bad_blocks 299776 256 299776 256 2. With this patch: cat /sys/block/md0/md/rd*/bad_blocks 300000 8 300000 8 Signed-off-by: Bian Yu <bianyu@kedacom.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * raid5: relieve lock contention in get_active_stripe()Shaohua Li2013-11-192-1/+8
| | | | | | | | | | | | | | | | track empty inactive list count, so md_raid5_congested() can use it to make decision. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * raid5: relieve lock contention in get_active_stripe()Shaohua Li2013-11-142-73/+259
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | get_active_stripe() is the last place we have lock contention. It has two paths. One is stripe isn't found and new stripe is allocated, the other is stripe is found. The first path basically calls __find_stripe and init_stripe. It accesses conf->generation, conf->previous_raid_disks, conf->raid_disks, conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded, conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except stripe_hashtbl and inactive_list, other fields are changed very rarely. With this patch, we split inactive_list and add new hash locks. Each free stripe belongs to a specific inactive list. Which inactive list is determined by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a lock_hash assigned. Stripe's inactive list is protected by a hash lock, which is determined by it's lock_hash too. The lock_hash is derivied from current stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl list too. The goal of the new hash locks introduced is we can only use the new locks in the first path of get_active_stripe(). Since we have several hash locks, lock contention is relieved significantly. The first path of get_active_stripe() accesses other fields, since they are changed rarely, changing them now need take conf->device_lock and all hash locks. For a slow path, this isn't a problem. If we need lock device_lock and hash lock, we always lock hash lock first. The tricky part is release_stripe and friends. We need take device_lock first. Neil's suggestion is we put inactive stripes to a temporary list and readd it to inactive_list after device_lock is released. In this way, we add stripes to temporary list with device_lock hold and remove stripes from the list with hash lock hold. So we don't allow concurrent access to the temporary list, which means we need allocate temporary list for all participants of release_stripe. One downside is free stripes are maintained in their inactive list, they can't across between the lists. By default, we have total 256 stripes and 8 lists, so each list will have 32 stripes. It's possible one list has free stripe but other list hasn't. The chance should be rare because stripes allocation are even distributed. And we can always allocate more stripes for cache, several mega bytes memory isn't a big deal. This completely removes the lock contention of the first path of get_active_stripe(). It slows down the second code path a little bit though because we now need takes two locks, but since the hash lock isn't contended, the overhead should be quite small (several atomic instructions). The second path of get_active_stripe() (basically sequential write or big request size randwrite) still has lock contentions. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * md/raid5.c: add proper locking to error path of raid5_start_reshape.NeilBrown2013-11-141-0/+6
| | | | | | | | | | | | | | | | If raid5_start_reshape errors out, we need to reset all the fields that were updated (not just some), and need to use the seq_counter to ensure make_request() doesn't use an inconsitent state. Signed-off-by: NeilBrown <neilb@suse.de>
| * md: fix calculation of stacking limits on level change.NeilBrown2013-11-141-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The various ->run routines of md personalities assume that the 'queue' has been initialised by the blk_set_stacking_limits() call in md_alloc(). However when the level is changed (by level_store()) the ->run routine for the new level is called for an array which has already had the stacking limits modified. This can result in incorrect final settings. So call blk_set_stacking_limits() before ->run in level_store(). A specific consequence of this bug is that it causes discard_granularity to be set incorrectly when reshaping a RAID4 to a RAID0. This is suitable for any -stable kernel since 3.3 in which blk_set_stacking_limits() was introduced. Cc: stable@vger.kernel.org (3.3+) Reported-and-tested-by: "Baldysiak, Pawel" <pawel.baldysiak@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * raid5: Use slow_path to release stripe when mddev->thread is nullmajianpeng2013-11-141-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | When release_stripe() is called in grow_one_stripe(), the mddev->thread is null. So it will omit one wakeup this thread to release stripe. For this condition, use slow_path to release stripe. Bug was introduced in 3.12 Cc: stable@vger.kernel.org (3.12+) Fixes: 773ca82fa1ee58dd1bf88b Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
* | Merge branch 'for-linus' of ↵Linus Torvalds2013-11-151-1/+1
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree updates from Jiri Kosina: "Usual earth-shaking, news-breaking, rocket science pile from trivial.git" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (23 commits) doc: usb: Fix typo in Documentation/usb/gadget_configs.txt doc: add missing files to timers/00-INDEX timekeeping: Fix some trivial typos in comments mm: Fix some trivial typos in comments irq: Fix some trivial typos in comments NUMA: fix typos in Kconfig help text mm: update 00-INDEX doc: Documentation/DMA-attributes.txt fix typo DRM: comment: `halve' -> `half' Docs: Kconfig: `devlopers' -> `developers' doc: typo on word accounting in kprobes.c in mutliple architectures treewide: fix "usefull" typo treewide: fix "distingush" typo mm/Kconfig: Grammar s/an/a/ kexec: Typo s/the/then/ Documentation/kvm: Update cpuid documentation for steal time and pv eoi treewide: Fix common typo in "identify" __page_to_pfn: Fix typo in comment Correct some typos for word frequency clk: fixed-factor: Fix a trivial typo ...
| * | treewide: fix "distingush" typoMichael Opdenacker2013-10-141-1/+1
| | | | | | | | | | | | | | | Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
* | | Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds2013-11-1525-3005/+2587
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull second round of block driver updates from Jens Axboe: "As mentioned in the original pull request, the bcache bits were pulled because of their dependency on the immutable bio vecs. Kent re-did this part and resubmitted it, so here's the 2nd round of (mostly) driver updates for 3.13. It contains: - The bcache work from Kent. - Conversion of virtio-blk to blk-mq. This removes the bio and request path, and substitutes with the blk-mq path instead. The end result almost 200 deleted lines. Patch is acked by Asias and Christoph, who both did a bunch of testing. - A removal of bootmem.h include from Grygorii Strashko, part of a larger series of his killing the dependency on that header file. - Removal of __cpuinit from blk-mq from Paul Gortmaker" * 'for-linus' of git://git.kernel.dk/linux-block: (56 commits) virtio_blk: blk-mq support blk-mq: remove newly added instances of __cpuinit bcache: defensively handle format strings bcache: Bypass torture test bcache: Delete some slower inline asm bcache: Use ida for bcache block dev minor bcache: Fix sysfs splat on shutdown with flash only devs bcache: Better full stripe scanning bcache: Have btree_split() insert into parent directly bcache: Move spinlock into struct time_stats bcache: Kill sequential_merge option bcache: Kill bch_next_recurse_key() bcache: Avoid deadlocking in garbage collection bcache: Incremental gc bcache: Add make_btree_freeing_key() bcache: Add btree_node_write_sync() bcache: PRECEDING_KEY() bcache: bch_(btree|extent)_ptr_invalid() bcache: Don't bother with bucket refcount for btree node allocations bcache: Debug code improvements ...
| * | | bcache: defensively handle format stringsKees Cook2013-11-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Just to be safe, call the error reporting function with "%s" to avoid any possible future format string leak. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Kent Overstreet <kmo@daterainc.com>
| * | | bcache: Bypass torture testKent Overstreet2013-11-105-8/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | More testing ftw! Also, now verify mode doesn't break if you read dirty data. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
| * | | bcache: Delete some slower inline asmKent Overstreet2013-11-101-8/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Never saw a profile of bset_search_tree() where it wasn't bottlenecked on memory until I got my new Haswell machine, but when I tried it there it was suddenly burning 20% of the cpu in the inner loop on shrd... Turns out, the version of shrd that takes 64 bit operands has a 9 cycle latency. hah. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
| * | | bcache: Use ida for bcache block dev minorKent Overstreet2013-11-101-6/+20
| | | | | | | | | | | | | | | | Signed-off-by: Kent Overstreet <kmo@daterainc.com>
| * | | bcache: Fix sysfs splat on shutdown with flash only devsKent Overstreet2013-11-106-33/+30
| | | | | | | | | | | | | | | | | | | | | | | | Whoops. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
| * | | bcache: Better full stripe scanningKent Overstreet2013-11-105-57/+99
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The old scanning-by-stripe code burned too much CPU, this should be better. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
| * | | bcache: Have btree_split() insert into parent directlyKent Overstreet2013-11-101-46/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The flow control in btree_insert_node() was... fragile... before, this'll use more stack (but since our btrees are never more than depth 1, that shouldn't matter) and it should be significantly clearer and less fragile. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
| * | | bcache: Move spinlock into struct time_statsKent Overstreet2013-11-106-16/+17
| | | | | | | | | | | | | | | | | | | | | | | | Minor cleanup. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
| * | | bcache: Kill sequential_merge optionKent Overstreet2013-11-104-31/+18
| | | | | | | | | | | | | | | | | | | | | | | | It never really made sense to expose this, so just kill it. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
| * | | bcache: Kill bch_next_recurse_key()Kent Overstreet2013-11-103-21/+11
| | | | | | | | | | | | | | | | | | | | | | | | This dates from before the btree iterator, and now it's finally gone Signed-off-by: Kent Overstreet <kmo@daterainc.com>
| * | | bcache: Avoid deadlocking in garbage collectionKent Overstreet2013-11-103-12/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Not a complete fix - we could still deadlock if btree_insert_node() has to split... Signed-off-by: Kent Overstreet <kmo@daterainc.com>
| * | | bcache: Incremental gcKent Overstreet2013-11-104-167/+226
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Big garbage collection rewrite; now, garbage collection uses the same mechanisms as used elsewhere for inserting/updating btree node pointers, instead of rewriting interior btree nodes in place. This makes the code significantly cleaner and less fragile, and means we can now make garbage collection incremental - it doesn't have to hold a write lock on the root of the btree for the entire duration of garbage collection. This means that there's less of a latency hit for doing garbage collection, which means we can gc more frequently (and do a better job of reclaiming from the cache), and we can coalesce across more btree nodes (improving our space efficiency). Signed-off-by: Kent Overstreet <kmo@daterainc.com>
| * | | bcache: Add make_btree_freeing_key()Kent Overstreet2013-11-101-13/+18
| | | | | | | | | | | | | | | | | | | | | | | | Refactoring, prep work for incremental garbage collection. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
| * | | bcache: Add btree_node_write_sync()Kent Overstreet2013-11-101-19/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | More refactoring - mostly making the interfaces more explicit about what we actually want to do. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
| * | | bcache: PRECEDING_KEY()Kent Overstreet2013-11-102-7/+20
| | | | | | | | | | | | | | | | | | | | | | | | btree_insert_key() was open coding this, this is just refactoring. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
| * | | bcache: bch_(btree|extent)_ptr_invalid()Kent Overstreet2013-11-104-21/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Trying to treat btree pointers and leaf node pointers the same way was a mistake - going to start being more explicit about the type of key/pointer we're dealing with. This is the first part of that refactoring; this patch shouldn't change any actual behaviour. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
| * | | bcache: Don't bother with bucket refcount for btree node allocationsKent Overstreet2013-11-104-27/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The bucket refcount (dropped with bkey_put()) is only needed to prevent the newly allocated bucket from being garbage collected until we've added a pointer to it somewhere. But for btree node allocations, the fact that we have btree nodes locked is enough to guard against races with garbage collection. Eventually the per bucket refcount is going to be replaced with something specific to bch_alloc_sectors(). Signed-off-by: Kent Overstreet <kmo@daterainc.com>