summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Btrfs: wake up transaction waiters when aborting a transactionJosef Bacik2012-06-142-6/+7
| | | | | | | | | | | I was getting lots of hung tasks and a NULL pointer dereference because we are not cleaning up the transaction properly when it aborts. First we need to reset the running_transaction to NULL so we don't get a bad dereference for any start_transaction callers after this. Also we cannot rely on waitqueue_active() since it's just a list_empty(), so just call wake_up() directly since that will do the barrier for us and such. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
* Btrfs: fix locking in btrfs_destroy_delayed_refsJosef Bacik2012-06-141-13/+17
| | | | | | | | | | | | | The transaction abort stuff was throwing warnings from the list debugging code because we do a list_del_init outside of the delayed_refs spin lock. The delayed refs locking makes baby Jesus cry so it's not hard to get wrong, but we need to take the ref head mutex to make sure it's not being processed currently, and so if it is we need to drop the spin lock and then take and drop the mutex and do the search again. If we can take the mutex then we can safely remove the head from the list and carry on. Now when the transaction aborts I don't get the list debugging warnings. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
* Btrfs: pass locked_page into extent_clear_unlock_delalloc if theres an errorJosef Bacik2012-06-141-2/+2
| | | | | | | | | | | While doing my enospc work I got a transaction abortion that resulted in a panic when we tried to unlock_page() an already unlocked page. This is because we aren't calling extent_clear_unlock_delalloc with the locked page so it was unlocking all the pages in the range. This is wrong since __extent_writepage expects to have the page locked still unless we return *page_started as 1. This should keep us from panicing. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
* Merge branch 'for-chris' of git://git.jan-o-sch.net/btrfs-unstable into ↵Chris Mason2012-05-3114-252/+1368
|\ | | | | | | | | | | | | | | | | for-linus Conflicts: fs/btrfs/ulist.h Signed-off-by: Chris Mason <chris.mason@oracle.com>
| * Btrfs: fix tree mod log rewinded level and rewinding of moved keysJan Schmidt2012-05-311-2/+4
| | | | | | | | | | | | | | | | | | | | | | When we rewind REMOVE_WHILE_FREEING operations, there's code that allocates a fresh buffer instead of cloning the old one. Setting that buffer's level correctly was missing in this case. When rewinding a MOVE_KEYS operation, btrfs_node_key_ptr_offset(slot) was missing for memmove_extent_buffer()'s arguments. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
| * Btrfs: fix tree mod log del_ptrJan Schmidt2012-05-311-6/+7
| | | | | | | | | | | | | | Logging for del_ptr when we're not deleting the last pointer was wrong. This fixes both, duplicate log entries and log sequence. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
| * Btrfs: add tree_mod_dont_log helperJan Schmidt2012-05-311-9/+15
| | | | | | | | | | | | Replace duplicate code by small inline helper function. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
| * Btrfs: add missing spin_lock for insertion into tree mod logJan Schmidt2012-05-311-5/+18
| | | | | | | | | | | | | | tree_mod_alloc calls __get_tree_mod_seq and must acquire a spinlock before doing so. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
| * Btrfs: add inodes before dropping the extent lock in find_all_leafsJan Schmidt2012-05-313-6/+43
| | | | | | | | | | | | | | | | | | | | We must build up the inode list with the extent lock held after following indirect refs. This also requires an extension to ulists, which allows to modify the stored aux value in case a key already exists in the list. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
| * Btrfs: use delayed ref sequence numbers for all fs-tree updatesJan Schmidt2012-05-303-23/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The sequence number for delayed refs is needed to postpone certain delayed refs for a very short period while walking backrefs. Before the tree modification log, we thought we'd only have to hold back those references that don't have a counter operation. While now we've the tree mod log, we're rewinding fs tree blocks to a defined consistent state. We cannot know in advance for which tree block we'll be doing rewind operations later. Therefore, we must postpone all the delayed refs for fs-tree blocks, even those having a counter operation. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
| * Btrfs: tree mod log sanity checks in join_transactionJan Schmidt2012-05-301-0/+18
| | | | | | | | | | | | | | | | | | | | | | When a fresh transaction begins, the tree mod log must be clean. Users of the tree modification log must ensure they never span across transaction boundaries. We reset the sequence to 0 in this safe situation to make absolutely sure overflow can't happen. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
| * Btrfs: fs_info variable for join_transactionJan Schmidt2012-05-301-18/+19
| | | | | | | | Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
| * Btrfs: use the tree modification log for backref resolvingJan Schmidt2012-05-302-17/+29
| | | | | | | | | | | | | | | | This enables backref resolving on life trees while they are changing. This is a prerequisite for quota groups and just nice to have for everything else. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
| * Btrfs: add btrfs_search_old_slotJan Schmidt2012-05-302-4/+317
| | | | | | | | | | | | | | | | | | The tree modification log together with the current state of the tree gives a consistent, old version of the tree. btrfs_search_old_slot is used to search through this old version and return old (dummy!) extent buffers. Naturally, this function cannot do any tree modifications. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
| * Btrfs: add del_ptr and insert_ptr modifications to the tree mod logJan Schmidt2012-05-301-10/+32
| | | | | | | | | | | | | | Record all relevant modifications to block pointers in the tree mod log so that we can rewind them later on for backref walking. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
| * Btrfs: put all block modifications into the tree mod logJan Schmidt2012-05-301-0/+36
| | | | | | | | | | | | | | | | | | When running functions that can make changes to the internal trees (e.g. btrfs_search_slot), we check if somebody may be interested in the block we're currently modifying. If so, we record our modification to be able to rewind it later on. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
| * Btrfs: add tree modification log functionsJan Schmidt2012-05-302-1/+412
| | | | | | | | | | | | | | | | | | | | | | | | The tree mod log will log modifications made fs-tree nodes. Most modifications are done by autobalance of the tree. Such changes are recorded as long as a block entry exists. When released, the log is cleaned. With the tree modification log, it's possible to reconstruct a consistent old state of the tree. This is required to do backref walking on a busy file system. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
| * Btrfs: add tree mod log to fs_infoJan Schmidt2012-05-262-0/+14
| | | | | | | | Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
| * Btrfs: dummy extent buffers for tree mod logJan Schmidt2012-05-262-7/+76
| | | | | | | | | | | | | | | | The tree modification log needs two ways to create dummy extent buffers, once by allocating a fresh one (to rebuild an old root) and once by cloning an existing one (to make private rewind modifications) to it. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
| * Btrfs: move struct seq_list to ctree.hJan Schmidt2012-05-262-5/+7
| | | | | | | | Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
| * Btrfs: don't set for_cow parameter for tree block functionsJan Schmidt2012-05-265-20/+20
| | | | | | | | | | | | | | | | | | | | | | | | Three callers of btrfs_free_tree_block or btrfs_alloc_tree_block passed parameter for_cow = 1. In fact, these two functions should never mark their tree modification operations as for_cow, because they can change the number of blocks referenced by a tree. Hence, we remove the extra for_cow parameter from these functions and make them pass a zero down. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
| * Btrfs: look into the extent during find_all_leafsJan Schmidt2012-05-262-84/+158
| | | | | | | | | | | | | | | | | | | | | | | | | | Before this patch we called find_all_leafs for a data extent, then called find_all_roots and then looked into the extent to grab the information we were seeking. This was done without holding the leaves locked to avoid deadlocks. However, this can obviouly race with concurrent tree modifications. Instead, we now look into the extent while we're holding the lock during find_all_leafs and store this information together with the leaf list. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
| * Btrfs: bugfix: ignore the wrong key for indirect tree block backrefsJan Schmidt2012-05-261-50/+135
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The key we store with a tree block backref is only a hint. It is set when the ref is created and can remain correct for a long time. As the tree is rebalanced, however, eventually the key no longer points to the correct destination. With this patch, we change find_parent_nodes to no longer add keys unless it knows for sure they're correct (e.g. because they're for an extent data backref). Then when we later encounter a backref ref with no parent and no key set, we grab the block and take the first key from the block itself. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
| * Btrfs: bugfix in btrfs_find_parent_nodesJan Schmidt2012-05-261-2/+3
| | | | | | | | | | | | | | | | | | That one has been around since the addition of backref.c. Due to the way we calculate our slot numbers, after adding inline refs we're missing one keyed ref unless it's located at the beginning of a new leaf. Reported-by: Alexander Block <ablock84@googlemail.com> Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
| * Btrfs: ulist realloc bugfixJan Schmidt2012-05-263-21/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | ulist_next gets the pointer to the previously returned element to find the next element from there. However, when we call ulist_add while iteration with ulist_next is in progress (ulist explicitly supports this), we can realloc the ulist internal memory, which makes the pointer to the previous element useless. Instead, we now use an iterator parameter that's independent from the internal pointers. Reported-by: Alexander Block <ablock84@googlemail.com> Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
* | Merge branch 'for-chris' of ↵Chris Mason2012-05-3029-616/+1483
|\ \ | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next into HEAD
| * | Btrfs: fix false positive in check-integrity on unmountStefan Behrens2012-05-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | During unmount, it could happen that the integrity checker printed a warning message "attempt to free ... on umount which is not yet iodone" which turned out to be a false positive. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
| * | Btrfs: fix runtime warning in check-integrity check data modeStefan Behrens2012-05-301-3/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | If a file_extent_item was located at the very end of a leaf and there was not enough space to hold a full item, but there was enough space to hold one of type BTRFS_FILE_EXTENT_INLINE or PREALLOC, and it was only such a short item, a warning was printed anyway. This check is now fixed. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
| * | Btrfs: set ioprio of scrub readahead to idleStefan Behrens2012-05-302-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | Reduce ioprio class of scrub readahead threads to idle priority. This setting is fixed. This priority has shown the best performance during all measurements. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
| * | Btrfs: fix return code in drop_objectid_itemsJosef Bacik2012-05-301-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | So dpkg fsync()'s the file and the directory containing the file whenever it writes to a file which is really slow in btrfs. This is partly because fsync()'ing a directory _always_ committed the transaction instead of just going to the tree log. This is because drop_objectid_items() would return 1 since it does a btrfs_search_slot() which returns 1. In tree-log jargon this means that we have to commit the transaction to be safe. So just check if ret is greater than 0 and set it to 0 if it does. With this patch we now use the tree-log instead of committing the entire transaction, which is twice as fast on my box. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
| * | Btrfs: check to see if the inode is in the log before fsyncingJosef Bacik2012-05-303-17/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have this check down in the actual logging code, but this is after we start a transaction and all that good stuff. So move the helper inode_in_log() out so we can call it in fsync() and avoid starting a transaction altogether and just exit if we've already fsync()'ed this file recently. You would notice this issue if you fsync()'ed a file over and over again until the transaction committed. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
| * | Btrfs: return value of btrfs_read_buffer is checked correctlyTsutomu Itoh2012-05-302-4/+18
| | | | | | | | | | | | | | | | | | | | | | | | btrfs_read_buffer() has the possibility of returning the error. Therefore, I add the code in which the return value of btrfs_read_buffer() is checked. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
| * | Btrfs: read device stats on mount, write modified ones during commitStefan Behrens2012-05-306-0/+232
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The device statistics are written into the device tree with each transaction commit. Only modified statistics are written. When a filesystem is mounted, the device statistics for each involved device are read from the device tree and used to initialize the counters. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
| * | Btrfs: add ioctl to get and reset the device statsStefan Behrens2012-05-304-0/+77
| | | | | | | | | | | | | | | | | | | | | An ioctl interface is added to get the device statistic counters. A second ioctl is added to atomically get and reset these counters. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
| * | Btrfs: add device counters for detected IO and checksum errorsStefan Behrens2012-05-306-24/+230
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The goal is to detect when drives start to get an increased error rate, when drives should be replaced soon. Therefore statistic counters are added that count IO errors (read, write and flush). Additionally, the software detected errors like checksum errors and corrupted blocks are counted. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
| * | btrfs: Drop unused function btrfs_abort_devices()Asias He2012-05-302-14/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1) This function is not used anywhere. 2) Using the blk_abort_queue() to abort the queue seems not correct. blk_abort_queue() is used for timeout handling (block/blk-timeout.c). Cc: Chris Mason <chris.mason@oracle.com> Cc: linux-btrfs@vger.kernel.org Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-kernel@vger.kernel.org Signed-off-by: Asias He <asias@redhat.com>
| * | Btrfs: fix the same inode id problem when doing auto defragmentMiao Xie2012-05-301-10/+39
| | | | | | | | | | | | | | | | | | | | | | | | Two files in the different subvolumes may have the same inode id, so The rb-tree which is used to manage the defragment object must take it into account. This patch fix this problem. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
| * | Btrfs: fall back to non-inline if we don't have enough spaceJosef Bacik2012-05-301-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | If cow_file_range_inline fails with ENOSPC we abort the transaction which isn't very nice. This really shouldn't be happening anyways but there's no sense in making it a horrible error when we can easily just go allocate normal data space for this stuff. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
| * | Btrfs: fix how we deal with the orphan block rsvJosef Bacik2012-05-304-22/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ceph was hitting this race where we would remove an inode from the per-root orphan list before we would release the space we had reserved for the inode. We actually don't need a list or anything, we just need to make sure the root doesn't try to free up the orphan reserve until after the inodes have released their reservations. So use an atomic counter instead of a list on the root and only decrement the counter after we've released our reservation. I've tested this as well as several others and we no longer see the warnings that you would see while running ceph. Thanks, Btrfs: fix how we deal with the orphan block rsv Ceph was hitting this race where we would remove an inode from the per-root orphan list before we would release the space we had reserved for the inode. We actually don't need a list or anything, we just need to make sure the root doesn't try to free up the orphan reserve until after the inodes have released their reservations. So use an atomic counter instead of a list on the root and only decrement the counter after we've released our reservation. I've tested this as well as several others and we no longer see the warnings that you would see while running ceph. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
| * | Btrfs: convert the inode bit field to use the actual bit operationsJosef Bacik2012-05-306-44/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Miao pointed this out while I was working on an orphan problem that messing with a bitfield where different ranges are protected by different locks doesn't work out right. Turns out we've been doing this forever where we have different parts of the bit field protected by either no lock at all or different locks which could cause all sorts of weird problems including the issue I was hitting. So instead make a runtime_flags thing that we use the normal bit operations on that are all atomic so we can keep having our no/different locking for the different flags and then make force_compress it's own thing so it can be treated normally. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
| * | Btrfs: merge contigous regions when loading free space cacheJosef Bacik2012-05-301-0/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we write out the free space cache we will write out everything that is in our in memory tree, and then we will just walk the pinned extents tree and write anything we see there. The problem with this is that during normal operations the pinned extents will be merged back into the free space tree normally, and then we can allocate space from the merged areas and commit them to the tree log. If we crash and replay the tree log we will crash again because the tree log will try to free up space from what looks like 2 seperate but contiguous entries, since one entry is from the original free space cache and the other was a pinned extent that was merged back. To fix this we just need to walk the free space tree after we load it and merge contiguous entries back together. This will keep the tree log stuff from breaking and it will make the allocator behave more nicely. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
| * | Btrfs: do not do balance in readonly modeLiu Bo2012-05-301-3/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In normal cases, we would not be allowed to do balance in RO mode. However, when we're using a seeding device and adding another device to sprout, things will change: $ mkfs.btrfs /dev/sdb7 $ btrfstune -S 1 /dev/sdb7 $ mount /dev/sdb7 /mnt/btrfs -o ro $ btrfs fi bal /mnt/btrfs -----------------------> fail. $ btrfs dev add /dev/sdb8 /mnt/btrfs $ btrfs fi bal /mnt/btrfs -----------------------> works! It should not be designed as an exception, and we'd better add another check for mnt flags. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Reviewed-by: Josef Bacik <josef@redhat.com>
| * | Btrfs: use fastpath in extent state ops as much as possibleLiu Bo2012-05-301-26/+18
| | | | | | | | | | | | | | | | | | | | | | | | Fully utilize our extent state's new helper functions to use fastpath as much as possible. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Reviewed-by: Josef Bacik <josef@redhat.com>
| * | Btrfs: fix wrong error returned by adding a deviceLiu Bo2012-05-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reproduce: $ mkfs.btrfs /dev/sdb7 $ mount /dev/sdb7 /mnt/btrfs -o ro $ btrfs dev add /dev/sdb8 /mnt/btrfs ERROR: error adding the device '/dev/sdb8' - Invalid argument Since we mount with readonly options, and /dev/sdb7 is not a seeding one, a readonly notification is preferred. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Reviewed-by: Josef Bacik <josef@redhat.com>
| * | Btrfs: finish ordered extents in their own threadJosef Bacik2012-05-307-191/+164
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We noticed that the ordered extent completion doesn't really rely on having a page and that it could be done independantly of ending the writeback on a page. This patch makes us not do the threaded endio stuff for normal buffered writes and direct writes so we can end page writeback as soon as possible (in irq context) and only start threads to do the ordered work when it is actually done. Compression needs to be reworked some to take advantage of this as well, but atm it has to do a find_get_page in its endio handler so it must be done in its own thread. This makes direct writes quite a bit faster. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
| * | Btrfs: do not check delalloc when updating disk_i_sizeJosef Bacik2012-05-301-16/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We are checking delalloc to see if it is ok to update the i_size. There are 2 cases it stops us from updating 1) If there is delalloc between our current disk_i_size and this ordered extent 2) If there is delalloc between our current ordered extent and the next ordered extent These tests are racy however since we can set delalloc for these ranges at any time. Also for the first case if we notice there is delalloc between disk_i_size and our ordered extent we will not update disk_i_size and assume that when that delalloc bit gets written out it will update everything properly. However if we crash before that we will have file extents outside of our i_size, which is not good, so this test is dangerous as well as racy. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>
| * | Btrfs: avoid buffer overrun in mount option handlingJim Meyering2012-05-301-41/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is an off-by-one error: allocating room for a maximal result string but without room for a trailing NUL. That, can lead to returning a transformed string that is not NUL-terminated, and then to a caller reading beyond end of the malloc'd buffer. Rewrite to s/kzalloc/kmalloc/, remove unwarranted use of strncpy (the result is guaranteed to fit), remove dead strlen at end, and change a few variable names and comments. Reviewed-by: Josef Bacik <josef@redhat.com> Signed-off-by: Jim Meyering <meyering@redhat.com>
| * | Btrfs: NUL-terminate path buffer in DEV_INFO ioctl resultJim Meyering2012-05-301-2/+4
| | | | | | | | | | | | | | | | | | | | | A device with name of length BTRFS_DEVICE_PATH_NAME_MAX or longer would not be NUL-terminated in the DEV_INFO ioctl result buffer. Signed-off-by: Jim Meyering <meyering@redhat.com>
| * | Btrfs: avoid buffer overrun in btrfs_printkJim Meyering2012-05-301-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The buffer read-overrun would be triggered by a printk format starting with <N>, where N is a single digit. NUL-terminate after strncpy. Use memcpy, not strncpy, since we know the string we're copying fits in the destination buffer and contains no NUL byte. Signed-off-by: Jim Meyering <meyering@redhat.com>
| * | Fix minor type issuesDaniel J Blueman2012-05-303-6/+5
| | | | | | | | | | | | | | | | | | Address some minor type issues identified by sparse checker. Signed-off-by: Daniel J Blueman <daniel@quora.org>