summaryrefslogtreecommitdiffstats
path: root/drivers/md/bcache/bcache.h
Commit message (Collapse)AuthorAgeFilesLines
* bcache: fix BUG_ON due to integer overflow with GC_SECTORS_USEDDarrick J. Wong2014-01-291-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | The BUG_ON at the end of __bch_btree_mark_key can be triggered due to an integer overflow error: BITMASK(GC_SECTORS_USED, struct bucket, gc_mark, 2, 13); ... SET_GC_SECTORS_USED(g, min_t(unsigned, GC_SECTORS_USED(g) + KEY_SIZE(k), (1 << 14) - 1)); BUG_ON(!GC_SECTORS_USED(g)); In bcache.h, the SECTORS_USED bitfield is defined to be 13 bits wide. While the SET_ code tries to ensure that the field doesn't overflow by clamping it to (1<<14)-1 == 16383, this is incorrect because 16383 requires 14 bits. Therefore, if GC_SECTORS_USED() + KEY_SIZE() = 8192, the SET_ statement tries to store 8192 into a 13-bit field. In a 13-bit field, 8192 becomes zero, thus triggering the BUG_ON. Therefore, create a field width constant and a max value constant, and use those to create the bitfield and check the inputs to SET_GC_SECTORS_USED. Arguably the BITMASK() template ought to have BUG_ON checks for too-large values, but that's a separate patch. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* bcache: Improve bucket_prio() calculationKent Overstreet2014-01-081-1/+1
| | | | | | | | | When deciding what order to reuse buckets we take into account both the bucket's priority (which indicates lru order) and also the amount of live data in that bucket. The way they were scaled together wasn't as correct as it could be... this patch improves and documents it. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Add struct btree_keysKent Overstreet2014-01-081-1/+1
| | | | | | Soon, bset.c won't need to depend on struct btree. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Rename/shuffle various code aroundKent Overstreet2014-01-081-8/+0
| | | | | | More work to disentangle bset.c from the rest of the code: Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Add struct bset_sort_stateKent Overstreet2014-01-081-3/+2
| | | | | | | More disentangling bset.c from the rest of the bcache code - soon, the sorting routines won't have any dependencies on any outside structs. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Bkey indexing renamingKent Overstreet2014-01-081-9/+2
| | | | | | | | | More refactoring: node() -> bset_bkey_idx() end() -> bset_bkey_last() Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Remove/fix some header dependenciesKent Overstreet2014-01-081-5/+18
| | | | | | | In the process of disentagling/libraryizing bset.c from the rest of the bcache code. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Use a mempool for mergesort temporary spaceKent Overstreet2014-01-081-6/+1
| | | | | | | It was a single element mempool before, it's slightly cleaner to just use a real mempool. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Btree verify code improvementsKent Overstreet2014-01-081-0/+1
| | | | | | | Used this fixed code to find and fix the bug fixed by a4d885097b0ac0cd1337f171f2d4b83e946094d4. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: kill index()Kent Overstreet2014-01-081-4/+0
| | | | | | That was a terrible name for a macro, add some better helpers to replace it. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Rework allocator reservesKent Overstreet2014-01-081-9/+7
| | | | | | | | | | | | We need a reserve for allocating buckets for new btree nodes - and now that we've got multiple btrees, it really needs to be per btree. This reworks the reserves so we've got separate freelists for each reserve instead of watermarks, which seems to make things a bit cleaner, and it adds some code so that btree_split() can make sure the reserve is available before it starts. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: kill closure locking usageKent Overstreet2014-01-081-3/+6
| | | | Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* Merge tag 'v3.13-rc6' into for-3.14/coreJens Axboe2013-12-311-6/+6
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Needed to bring blk-mq uptodate, since changes have been going in since for-3.14/core was established. Fixup merge issues related to the immutable biovec changes. Signed-off-by: Jens Axboe <axboe@kernel.dk> Conflicts: block/blk-flush.c fs/btrfs/check-integrity.c fs/btrfs/extent_io.c fs/btrfs/scrub.c fs/logfs/dev_bdev.c
| * bcache: New writeback PD controllerKent Overstreet2013-12-161-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | The old writeback PD controller could get into states where it had throttled all the way down and take way too long to recover - it was too complicated to really understand what it was doing. This rewrites a good chunk of it to hopefully be simpler and make more sense, and it also pays more attention to units which should make the behaviour a bit easier to understand. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
| * bcache: bugfix - moving_gc now moves only correct bucketsNicholas Swenson2013-12-161-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | Removed gc_move_threshold because picking buckets only by threshold could lead moving extra buckets (ei. if there are buckets at the threshold that aren't supposed to be moved do to space considerations). This is replaced by a GC_MOVE bit in the gc_mark bitmask. Now only marked buckets get moved. Signed-off-by: Nicholas Swenson <nks@daterainc.com> Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* | block: Introduce new bio_split()Kent Overstreet2013-11-231-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The new bio_split() can split arbitrary bios - it's not restricted to single page bios, like the old bio_split() (previously renamed to bio_pair_split()). It also has different semantics - it doesn't allocate a struct bio_pair, leaving it up to the caller to handle completions. Then convert the existing bio_pair_split() users to the new bio_split() - and also nvme, which was open coding bio splitting. (We have to take that BUG_ON() out of bio_integrity_trim() because this bio_split() needs to use it, and there's no reason it has to be used on bios marked as cloned; BIO_CLONED doesn't seem to have clearly documented semantics anyways.) Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Neil Brown <neilb@suse.de>
* | bcache: Kill unaligned bvec hackKent Overstreet2013-11-231-1/+0
|/ | | | | | | | | | Bcache has a hack to avoid cloning the biovec if it's all full pages - but with immutable biovecs coming this won't be necessary anymore. For now, we remove the special case and always clone the bvec array so that the immutable biovec patches are simpler. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Bypass torture testKent Overstreet2013-11-101-0/+1
| | | | | | | More testing ftw! Also, now verify mode doesn't break if you read dirty data. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Fix sysfs splat on shutdown with flash only devsKent Overstreet2013-11-101-6/+4
| | | | | | Whoops. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Better full stripe scanningKent Overstreet2013-11-101-2/+3
| | | | | | | The old scanning-by-stripe code burned too much CPU, this should be better. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Move spinlock into struct time_statsKent Overstreet2013-11-101-2/+0
| | | | | | Minor cleanup. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Kill sequential_merge optionKent Overstreet2013-11-101-1/+0
| | | | | | It never really made sense to expose this, so just kill it. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Incremental gcKent Overstreet2013-11-101-1/+0
| | | | | | | | | | | | | | | | | | Big garbage collection rewrite; now, garbage collection uses the same mechanisms as used elsewhere for inserting/updating btree node pointers, instead of rewriting interior btree nodes in place. This makes the code significantly cleaner and less fragile, and means we can now make garbage collection incremental - it doesn't have to hold a write lock on the root of the btree for the entire duration of garbage collection. This means that there's less of a latency hit for doing garbage collection, which means we can gc more frequently (and do a better job of reclaiming from the cache), and we can coalesce across more btree nodes (improving our space efficiency). Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Debug code improvementsKent Overstreet2013-11-101-9/+1
| | | | | | | | | | | | | | | Couple changes: * Consolidate bch_check_keys() and bch_check_key_order(), and move the checks that only check_key_order() could do to bch_btree_iter_next(). * Get rid of CONFIG_BCACHE_EDEBUG - now, all that code is compiled in when CONFIG_BCACHE_DEBUG is enabled, and there's now a sysfs file to flip on the EDEBUG checks at runtime. * Dropped an old not terribly useful check in rw_unlock(), and refactored/improved a some of the other debug code. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Pull on disk data structures out into a separate headerKent Overstreet2013-11-101-242/+2
| | | | | | | Now, the on disk data structures are in a header that can be exported to userspace - and having them all centralized is nice too. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Move sector allocator to alloc.cKent Overstreet2013-11-101-0/+4
| | | | | | Just reorganizing things a bit. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Add btree_map() functionsKent Overstreet2013-11-101-2/+0
| | | | | | | | | | | | | | Lots of stuff has been open coding its own btree traversal - which is generally pretty simple code, but there are a few subtleties. This adds new new functions, bch_btree_map_nodes() and bch_btree_map_keys(), which do the traversal for you. Everything that's open coding btree traversal now (with the exception of garbage collection) is slowly going to be converted to these two functions; being able to write other code at a higher level of abstraction is a big improvement w.r.t. overall code quality. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Convert writeback to a kthreadKent Overstreet2013-11-101-4/+6
| | | | | | | | This simplifies the writeback flow control quite a bit - previously, it was conceptually two coroutines, refill_dirty() and read_dirty(). This makes the code quite a bit more straightforward. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Convert gc to a kthreadKent Overstreet2013-11-101-5/+4
| | | | | | | | | We needed a dedicated rescuer workqueue for gc anyways... and gc was conceptually a dedicated thread, just one that wasn't running all the time. Switch it to a dedicated thread to make the code a bit more straightforward. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Convert bucket_wait to wait_queue_head_tKent Overstreet2013-11-101-4/+4
| | | | | | | | At one point we did do fancy asynchronous waiting stuff with bucket_wait, but that's all gone (and bucket_wait is used a lot less than it used to be). So use the standard primitives. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Convert try_wait to wait_queue_head_tKent Overstreet2013-11-101-2/+2
| | | | | | | We never waited on c->try_wait asynchronously, so just use the standard primitives. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Convert btree_insert_check_key() to btree_insert_node()Kent Overstreet2013-11-101-8/+0
| | | | | | | | This was the main point of all this refactoring - now, btree_insert_check_key() won't fail just because the leaf node happened to be full. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Stripe size isn't necessarily a power of twoKent Overstreet2013-11-101-1/+1
| | | | | | | | Originally I got this right... except that the divides didn't use do_div(), which broke 32 bit kernels. When I went to fix that, I forgot that the raid stripe size usually isn't a power of two... doh Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Add on error panic/unregister settingKent Overstreet2013-11-101-0/+6
| | | | | | | Works kind of like the ext4 setting, to panic or remount read only on errors. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Use blkdev_issue_discard()Kent Overstreet2013-11-101-10/+0
| | | | | | | | | The old asynchronous discard code was really a relic from when all the allocation code was asynchronous - now that allocation runs out of a dedicated thread there's no point in keeping around all that complicated machinery. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Fix a writeback performance regressionKent Overstreet2013-09-241-4/+3
| | | | | | | | | | | | | | | | | Background writeback works by scanning the btree for dirty data and adding those keys into a fixed size buffer, then for each dirty key in the keybuf writing it to the backing device. When read_dirty() finishes and it's time to scan for more dirty data, we need to wait for the outstanding writeback IO to finish - they still take up slots in the keybuf (so that foreground writes can check for them to avoid races) - without that wait, we'll continually rescan when we'll be able to add at most a key or two to the keybuf, and that takes locks that starves foreground IO. Doh. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: linux-stable <stable@vger.kernel.org> # >= v3.10 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* bcache: Allocation kthread fixesKent Overstreet2013-07-121-4/+0
| | | | | | | | The alloc kthread should've been using try_to_freeze() - and also there was the potential for the alloc kthread to get woken up after it had shut down, which would have been bad. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Fix a sysfs splat on shutdownKent Overstreet2013-07-121-0/+1
| | | | | | | | | | | | If we stopped a bcache device when we were already detaching (or something like that), bcache_device_unlink() would try to remove a symlink from sysfs that was already gone because the bcache dev kobject had already been removed from sysfs. So keep track of whether we've removed stuff from sysfs. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
* bcache: Write out full stripesKent Overstreet2013-06-261-2/+1
| | | | | | | | | | | | | Now that we're tracking dirty data per stripe, we can add two optimizations for raid5/6: * If a stripe is already dirty, force writes to that stripe to writeback mode - to help build up full stripes of dirty data * When flushing dirty data, preferentially write out full stripes first if there are any. Signed-off-by: Kent Overstreet <koverstreet@google.com>
* bcache: Track dirty data by stripeKent Overstreet2013-06-261-6/+4
| | | | | | | | To make background writeback aware of raid5/6 stripes, we first need to track the amount of dirty data within each stripe - we do this by breaking up the existing sectors_dirty into per stripe atomic_ts Signed-off-by: Kent Overstreet <koverstreet@google.com>
* bcache: Initialize sectors_dirty when attachingKent Overstreet2013-06-261-1/+1
| | | | | | | | | | | Previously, dirty_data wouldn't get initialized until the first garbage collection... which was a bit of a problem for background writeback (as the PD controller keys off of it) and also confusing for users. This is also prep work for making background writeback aware of raid5/6 stripes. Signed-off-by: Kent Overstreet <koverstreet@google.com>
* bcache: Improve lazy sortingKent Overstreet2013-06-261-0/+1
| | | | | | | | | The old lazy sorting code was kind of hacky - rewrite in a way that mathematically makes more sense; the idea is that the size of the sets of keys in a btree node should increase by a more or less fixed ratio from smallest to biggest. Signed-off-by: Kent Overstreet <koverstreet@google.com>
* bcache: Fix/revamp tracepointsKent Overstreet2013-06-261-20/+0
| | | | | | | | | | | | | | | | The tracepoints were reworked to be more sensible, and fixed a null pointer deref in one of the tracepoints. Converted some of the pr_debug()s to tracepoints - this is partly a performance optimization; it used to be that with DEBUG or CONFIG_DYNAMIC_DEBUG pr_debug() was an empty macro; but at some point it was changed to an empty inline function. Some of the pr_debug() statements had rather expensive function calls as part of the arguments, so this code was getting run unnecessarily even on non debug kernels - in some fast paths, too. Signed-off-by: Kent Overstreet <koverstreet@google.com>
* bcache: Refactor btree ioKent Overstreet2013-06-261-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The most significant change is that btree reads are now done synchronously, instead of asynchronously and doing the post read stuff from a workqueue. This was originally done because we can't block on IO under generic_make_request(). But - we already have a mechanism to punt cache lookups to workqueue if needed, so if we just use that we don't have to deal with the complexity of doing things asynchronously. The main benefit is this makes the locking situation saner; we can hold our write lock on the btree node until we're finished reading it, and we don't need that btree_node_read_done() flag anymore. Also, for writes, btree_write() was broken out into btree_node_write() and btree_leaf_dirty() - the old code with the boolean argument was dumb and confusing. The prio_blocked mechanism was improved a bit too, now the only counter is in struct btree_write, we don't mess with transfering a count from struct btree anymore. This required changing garbage collection to block prios at the start and unblock when it finishes, which is cleaner than what it was doing anyways (the old code had mostly the same effect, but was doing it in a convoluted way) And the btree iter btree_node_read_done() uses was converted to a real mempool. Signed-off-by: Kent Overstreet <koverstreet@google.com>
* bcache: Convert allocator thread to kthreadKent Overstreet2013-06-261-6/+11
| | | | | | Using a workqueue when we just want a single thread is a bit silly. Signed-off-by: Kent Overstreet <koverstreet@google.com>
* bcache: Fix error handling in init codeKent Overstreet2013-05-151-1/+1
| | | | | | | This code appears to have rotted... fix various bugs and do some refactoring. Signed-off-by: Kent Overstreet <koverstreet@google.com>
* bcache: Take data offset from the bdev superblock.Kent Overstreet2013-04-201-10/+37
| | | | | | | Add a new superblock version, and consolidate related defines. Signed-off-by: Gabriel de Perthuis <g2p.code+bcache@gmail.com> Signed-off-by: Kent Overstreet <koverstreet@google.com>
* bcache: Don't export utility code, prefix with bch_Kent Overstreet2013-03-281-1/+1
| | | | | | Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: linux-bcache@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
* bcache: Style/checkpatch fixesKent Overstreet2013-03-251-5/+5
| | | | | | | | | Took out some nested functions, and fixed some more checkpatch complaints. Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: linux-bcache@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
* bcache: A block layer cacheKent Overstreet2013-03-231-0/+1232
Does writethrough and writeback caching, handles unclean shutdown, and has a bunch of other nifty features motivated by real world usage. See the wiki at http://bcache.evilpiepirate.org for more. Signed-off-by: Kent Overstreet <koverstreet@google.com>