summaryrefslogtreecommitdiffstats
path: root/drivers/md/bcache/bcache.h
Commit message (Collapse)AuthorAgeFilesLines
* bcache: fix for data collapse after re-attaching an attached deviceTang Junhui2018-05-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 73ac105be390c1de42a2f21643c9778a5e002930 ] back-end device sdm has already attached a cache_set with ID f67ebe1f-f8bc-4d73-bfe5-9dc88607f119, then try to attach with another cache set, and it returns with an error: [root]# cd /sys/block/sdm/bcache [root]# echo 5ccd0a63-148e-48b8-afa2-aca9cbd6279f > attach -bash: echo: write error: Invalid argument After that, execute a command to modify the label of bcache device: [root]# echo data_disk1 > label Then we reboot the system, when the system power on, the back-end device can not attach to cache_set, a messages show in the log: Feb 5 12:05:52 ceph152 kernel: [922385.508498] bcache: bch_cached_dev_attach() couldn't find uuid for sdm in set In sysfs_attach(), dc->sb.set_uuid was assigned to the value which input through sysfs, no matter whether it is success or not in bch_cached_dev_attach(). For example, If the back-end device has already attached to an cache set, bch_cached_dev_attach() would fail, but dc->sb.set_uuid was changed. Then modify the label of bcache device, it will call bch_write_bdev_super(), which would write the dc->sb.set_uuid to the super block, so we record a wrong cache set ID in the super block, after the system reboot, the cache set couldn't find the uuid of the back-end device, so the bcache device couldn't exist and use any more. In this patch, we don't assigned cache set ID to dc->sb.set_uuid in sysfs_attach() directly, but input it into bch_cached_dev_attach(), and assigned dc->sb.set_uuid to the cache set ID after the back-end device attached to the cache set successful. Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Reviewed-by: Michael Lyle <mlyle@lyle.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* bcache: fix for gc and write-back raceTang Junhui2017-09-271-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 9baf30972b5568d8b5bc8b3c46a6ec5b58100463 upstream. gc and write-back get raced (see the email "bcache get stucked" I sended before): gc thread write-back thread | |bch_writeback_thread() |bch_gc_thread() | | |==>read_dirty() |==>bch_btree_gc() | |==>btree_root() //get btree root | | //node write locker | |==>bch_btree_gc_root() | | |==>read_dirty_submit() | |==>write_dirty() | |==>continue_at(cl, | | write_dirty_finish, | | system_wq); | |==>write_dirty_finish()//excute | | //in system_wq | |==>bch_btree_insert() | |==>bch_btree_map_leaf_nodes() | |==>__bch_btree_map_nodes() | |==>btree_root //try to get btree | | //root node read | | //lock | |-----stuck here |==>bch_btree_set_root() |==>bch_journal_meta() |==>bch_journal() |==>journal_try_write() |==>journal_write_unlocked() //journal_full(&c->journal) | //condition satisfied |==>continue_at(cl, journal_write, system_wq); //try to excute | //journal_write in system_wq | //but work queue is excuting | //write_dirty_finish() |==>closure_sync(); //wait journal_write execute | //over and wake up gc, |-------------stuck here |==>release root node write locker This patch alloc a separate work-queue for write-back thread to avoid such race. (Commit log re-organized by Coly Li to pass checkpatch.pl checking) Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Acked-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* bcache: Make gc wakeup sane, remove set_task_state()Kent Overstreet2017-02-231-2/+2
| | | | | | | | commit be628be09563f8f6e81929efbd7cf3f45c344416 upstream. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* bcache: remove driver private bio splitting codeKent Overstreet2015-08-131-18/+0
| | | | | | | | | | | | | The bcache driver has always accepted arbitrarily large bios and split them internally. Now that every driver must accept arbitrarily large bios this code isn't nessecary anymore. Cc: linux-bcache@vger.kernel.org Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> [dpark: add more description in commit message] Signed-off-by: Dongsu Park <dpark@posteo.net> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* bcache: fix crash with incomplete cache setSlava Pestov2014-08-041-0/+4
| | | | Change-Id: I6abde52afe917633480caaf4e2518f42a816d886
* arch: Mass conversion of smp_mb__*()Peter Zijlstra2014-04-181-1/+1
| | | | | | | | | | | Mostly scripted conversion of the smp_mb__* barriers. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-arch@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* bcache: Kill bucket->gc_genKent Overstreet2014-03-181-2/+1
| | | | | | | | gc_gen was a temporary used to recalculate last_gc, but since we only need bucket->last_gc when gc isn't running (gc_mark_valid = 1), we can just update last_gc directly. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Kill unused freelistKent Overstreet2014-03-181-23/+5
| | | | | | | | This was originally added as at optimization that for various reasons isn't needed anymore, but it does add a lot of nasty corner cases (and it was responsible for some recently fixed bugs). Just get rid of it now. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Rework btree cache reserve handlingKent Overstreet2014-03-181-10/+5
| | | | | | | | | | | | | | | | | | | This changes the bucket allocation reserves to use _real_ reserves - separate freelists - instead of watermarks, which if nothing else makes the current code saner to reason about and is going to be important in the future when we add support for multiple btrees. It also adds btree_check_reserve(), which checks (and locks) the reserves for both bucket allocation and memory allocation for btree nodes; the old code just kinda sorta assumed that since (e.g. for btree node splits) it had the root locked and that meant no other threads could try to make use of the same reserve; this technically should have been ok for memory allocation (we should always have a reserve for memory allocation (the btree node cache is used as a reserve and we preallocate it)), but multiple btrees will mean that locking the root won't be sufficient anymore, and for the bucket allocation reserve it was technically possible for the old code to deadlock. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Kill btree_io_wqKent Overstreet2014-03-181-2/+0
| | | | | | | | With the locking rework in the last patch, this shouldn't be needed anymore - btree_node_write_work() only takes b->write_lock which is never held for very long. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Add a real GC_MARK_RECLAIMABLEKent Overstreet2014-03-181-3/+3
| | | | | | | This means the garbage collection code can better check for data and metadata pointers to the same buckets. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Fix moving_gc deadlocking with a foreground writeNicholas Swenson2014-03-181-0/+2
| | | | | | | | | | | | | | | Deadlock happened because a foreground write slept, waiting for a bucket to be allocated. Normally the gc would mark buckets available for invalidation. But the moving_gc was stuck waiting for outstanding writes to complete. These writes used the bcache_wq, the same queue foreground writes used. This fix gives moving_gc its own work queue, so it was still finish moving even if foreground writes are stuck waiting for allocation. It also makes work queue a parameter to the data_insert path, so moving_gc can use its workqueue for writes. Signed-off-by: Nicholas Swenson <nks@daterainc.com> Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: fix BUG_ON due to integer overflow with GC_SECTORS_USEDDarrick J. Wong2014-01-291-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | The BUG_ON at the end of __bch_btree_mark_key can be triggered due to an integer overflow error: BITMASK(GC_SECTORS_USED, struct bucket, gc_mark, 2, 13); ... SET_GC_SECTORS_USED(g, min_t(unsigned, GC_SECTORS_USED(g) + KEY_SIZE(k), (1 << 14) - 1)); BUG_ON(!GC_SECTORS_USED(g)); In bcache.h, the SECTORS_USED bitfield is defined to be 13 bits wide. While the SET_ code tries to ensure that the field doesn't overflow by clamping it to (1<<14)-1 == 16383, this is incorrect because 16383 requires 14 bits. Therefore, if GC_SECTORS_USED() + KEY_SIZE() = 8192, the SET_ statement tries to store 8192 into a 13-bit field. In a 13-bit field, 8192 becomes zero, thus triggering the BUG_ON. Therefore, create a field width constant and a max value constant, and use those to create the bitfield and check the inputs to SET_GC_SECTORS_USED. Arguably the BITMASK() template ought to have BUG_ON checks for too-large values, but that's a separate patch. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* bcache: Improve bucket_prio() calculationKent Overstreet2014-01-081-1/+1
| | | | | | | | | When deciding what order to reuse buckets we take into account both the bucket's priority (which indicates lru order) and also the amount of live data in that bucket. The way they were scaled together wasn't as correct as it could be... this patch improves and documents it. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Add struct btree_keysKent Overstreet2014-01-081-1/+1
| | | | | | Soon, bset.c won't need to depend on struct btree. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Rename/shuffle various code aroundKent Overstreet2014-01-081-8/+0
| | | | | | More work to disentangle bset.c from the rest of the code: Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Add struct bset_sort_stateKent Overstreet2014-01-081-3/+2
| | | | | | | More disentangling bset.c from the rest of the bcache code - soon, the sorting routines won't have any dependencies on any outside structs. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Bkey indexing renamingKent Overstreet2014-01-081-9/+2
| | | | | | | | | More refactoring: node() -> bset_bkey_idx() end() -> bset_bkey_last() Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Remove/fix some header dependenciesKent Overstreet2014-01-081-5/+18
| | | | | | | In the process of disentagling/libraryizing bset.c from the rest of the bcache code. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Use a mempool for mergesort temporary spaceKent Overstreet2014-01-081-6/+1
| | | | | | | It was a single element mempool before, it's slightly cleaner to just use a real mempool. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Btree verify code improvementsKent Overstreet2014-01-081-0/+1
| | | | | | | Used this fixed code to find and fix the bug fixed by a4d885097b0ac0cd1337f171f2d4b83e946094d4. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: kill index()Kent Overstreet2014-01-081-4/+0
| | | | | | That was a terrible name for a macro, add some better helpers to replace it. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Rework allocator reservesKent Overstreet2014-01-081-9/+7
| | | | | | | | | | | | We need a reserve for allocating buckets for new btree nodes - and now that we've got multiple btrees, it really needs to be per btree. This reworks the reserves so we've got separate freelists for each reserve instead of watermarks, which seems to make things a bit cleaner, and it adds some code so that btree_split() can make sure the reserve is available before it starts. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: kill closure locking usageKent Overstreet2014-01-081-3/+6
| | | | Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* Merge tag 'v3.13-rc6' into for-3.14/coreJens Axboe2013-12-311-6/+6
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Needed to bring blk-mq uptodate, since changes have been going in since for-3.14/core was established. Fixup merge issues related to the immutable biovec changes. Signed-off-by: Jens Axboe <axboe@kernel.dk> Conflicts: block/blk-flush.c fs/btrfs/check-integrity.c fs/btrfs/extent_io.c fs/btrfs/scrub.c fs/logfs/dev_bdev.c
| * bcache: New writeback PD controllerKent Overstreet2013-12-161-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | The old writeback PD controller could get into states where it had throttled all the way down and take way too long to recover - it was too complicated to really understand what it was doing. This rewrites a good chunk of it to hopefully be simpler and make more sense, and it also pays more attention to units which should make the behaviour a bit easier to understand. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
| * bcache: bugfix - moving_gc now moves only correct bucketsNicholas Swenson2013-12-161-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | Removed gc_move_threshold because picking buckets only by threshold could lead moving extra buckets (ei. if there are buckets at the threshold that aren't supposed to be moved do to space considerations). This is replaced by a GC_MOVE bit in the gc_mark bitmask. Now only marked buckets get moved. Signed-off-by: Nicholas Swenson <nks@daterainc.com> Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* | block: Introduce new bio_split()Kent Overstreet2013-11-231-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The new bio_split() can split arbitrary bios - it's not restricted to single page bios, like the old bio_split() (previously renamed to bio_pair_split()). It also has different semantics - it doesn't allocate a struct bio_pair, leaving it up to the caller to handle completions. Then convert the existing bio_pair_split() users to the new bio_split() - and also nvme, which was open coding bio splitting. (We have to take that BUG_ON() out of bio_integrity_trim() because this bio_split() needs to use it, and there's no reason it has to be used on bios marked as cloned; BIO_CLONED doesn't seem to have clearly documented semantics anyways.) Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Neil Brown <neilb@suse.de>
* | bcache: Kill unaligned bvec hackKent Overstreet2013-11-231-1/+0
|/ | | | | | | | | | Bcache has a hack to avoid cloning the biovec if it's all full pages - but with immutable biovecs coming this won't be necessary anymore. For now, we remove the special case and always clone the bvec array so that the immutable biovec patches are simpler. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Bypass torture testKent Overstreet2013-11-101-0/+1
| | | | | | | More testing ftw! Also, now verify mode doesn't break if you read dirty data. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Fix sysfs splat on shutdown with flash only devsKent Overstreet2013-11-101-6/+4
| | | | | | Whoops. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Better full stripe scanningKent Overstreet2013-11-101-2/+3
| | | | | | | The old scanning-by-stripe code burned too much CPU, this should be better. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Move spinlock into struct time_statsKent Overstreet2013-11-101-2/+0
| | | | | | Minor cleanup. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Kill sequential_merge optionKent Overstreet2013-11-101-1/+0
| | | | | | It never really made sense to expose this, so just kill it. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Incremental gcKent Overstreet2013-11-101-1/+0
| | | | | | | | | | | | | | | | | | Big garbage collection rewrite; now, garbage collection uses the same mechanisms as used elsewhere for inserting/updating btree node pointers, instead of rewriting interior btree nodes in place. This makes the code significantly cleaner and less fragile, and means we can now make garbage collection incremental - it doesn't have to hold a write lock on the root of the btree for the entire duration of garbage collection. This means that there's less of a latency hit for doing garbage collection, which means we can gc more frequently (and do a better job of reclaiming from the cache), and we can coalesce across more btree nodes (improving our space efficiency). Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Debug code improvementsKent Overstreet2013-11-101-9/+1
| | | | | | | | | | | | | | | Couple changes: * Consolidate bch_check_keys() and bch_check_key_order(), and move the checks that only check_key_order() could do to bch_btree_iter_next(). * Get rid of CONFIG_BCACHE_EDEBUG - now, all that code is compiled in when CONFIG_BCACHE_DEBUG is enabled, and there's now a sysfs file to flip on the EDEBUG checks at runtime. * Dropped an old not terribly useful check in rw_unlock(), and refactored/improved a some of the other debug code. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Pull on disk data structures out into a separate headerKent Overstreet2013-11-101-242/+2
| | | | | | | Now, the on disk data structures are in a header that can be exported to userspace - and having them all centralized is nice too. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Move sector allocator to alloc.cKent Overstreet2013-11-101-0/+4
| | | | | | Just reorganizing things a bit. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Add btree_map() functionsKent Overstreet2013-11-101-2/+0
| | | | | | | | | | | | | | Lots of stuff has been open coding its own btree traversal - which is generally pretty simple code, but there are a few subtleties. This adds new new functions, bch_btree_map_nodes() and bch_btree_map_keys(), which do the traversal for you. Everything that's open coding btree traversal now (with the exception of garbage collection) is slowly going to be converted to these two functions; being able to write other code at a higher level of abstraction is a big improvement w.r.t. overall code quality. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Convert writeback to a kthreadKent Overstreet2013-11-101-4/+6
| | | | | | | | This simplifies the writeback flow control quite a bit - previously, it was conceptually two coroutines, refill_dirty() and read_dirty(). This makes the code quite a bit more straightforward. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Convert gc to a kthreadKent Overstreet2013-11-101-5/+4
| | | | | | | | | We needed a dedicated rescuer workqueue for gc anyways... and gc was conceptually a dedicated thread, just one that wasn't running all the time. Switch it to a dedicated thread to make the code a bit more straightforward. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Convert bucket_wait to wait_queue_head_tKent Overstreet2013-11-101-4/+4
| | | | | | | | At one point we did do fancy asynchronous waiting stuff with bucket_wait, but that's all gone (and bucket_wait is used a lot less than it used to be). So use the standard primitives. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Convert try_wait to wait_queue_head_tKent Overstreet2013-11-101-2/+2
| | | | | | | We never waited on c->try_wait asynchronously, so just use the standard primitives. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Convert btree_insert_check_key() to btree_insert_node()Kent Overstreet2013-11-101-8/+0
| | | | | | | | This was the main point of all this refactoring - now, btree_insert_check_key() won't fail just because the leaf node happened to be full. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Stripe size isn't necessarily a power of twoKent Overstreet2013-11-101-1/+1
| | | | | | | | Originally I got this right... except that the divides didn't use do_div(), which broke 32 bit kernels. When I went to fix that, I forgot that the raid stripe size usually isn't a power of two... doh Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Add on error panic/unregister settingKent Overstreet2013-11-101-0/+6
| | | | | | | Works kind of like the ext4 setting, to panic or remount read only on errors. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Use blkdev_issue_discard()Kent Overstreet2013-11-101-10/+0
| | | | | | | | | The old asynchronous discard code was really a relic from when all the allocation code was asynchronous - now that allocation runs out of a dedicated thread there's no point in keeping around all that complicated machinery. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Fix a writeback performance regressionKent Overstreet2013-09-241-4/+3
| | | | | | | | | | | | | | | | | Background writeback works by scanning the btree for dirty data and adding those keys into a fixed size buffer, then for each dirty key in the keybuf writing it to the backing device. When read_dirty() finishes and it's time to scan for more dirty data, we need to wait for the outstanding writeback IO to finish - they still take up slots in the keybuf (so that foreground writes can check for them to avoid races) - without that wait, we'll continually rescan when we'll be able to add at most a key or two to the keybuf, and that takes locks that starves foreground IO. Doh. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: linux-stable <stable@vger.kernel.org> # >= v3.10 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* bcache: Allocation kthread fixesKent Overstreet2013-07-121-4/+0
| | | | | | | | The alloc kthread should've been using try_to_freeze() - and also there was the potential for the alloc kthread to get woken up after it had shut down, which would have been bad. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Fix a sysfs splat on shutdownKent Overstreet2013-07-121-0/+1
| | | | | | | | | | | | If we stopped a bcache device when we were already detaching (or something like that), bcache_device_unlink() would try to remove a symlink from sysfs that was already gone because the bcache dev kobject had already been removed from sysfs. So keep track of whether we've removed stuff from sysfs. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: linux-stable <stable@vger.kernel.org> # >= v3.10