summaryrefslogtreecommitdiffstats
path: root/fs
Commit message (Collapse)AuthorAgeFilesLines
* Btrfs: fix max dir item size calculationFilipe David Borba Manana2014-01-281-1/+1
| | | | | | | | | | We were accounting for sizeof(struct btrfs_item) twice, once in the data_size variable and another time in the if statement below. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: more efficient extent state insertionsFilipe David Borba Manana2014-01-281-21/+55
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently we do 2 traversals of an inode's extent_io_tree before inserting an extent state structure: 1 to see if a matching extent state already exists and 1 to do the insertion if the fist traversal didn't found such extent state. This change just combines those tree traversals into a single one. While running sysbench tests (random writes) I captured the number of elements in extent_io_tree trees for a while (into a procfs file backed by a seq_list from seq_file module) and got this histogram: Count: 9310 Range: 51.000 - 21386.000; Mean: 11785.243; Median: 18743.500; Stddev: 8923.688 Percentiles: 90th: 20985.000; 95th: 21155.000; 99th: 21369.000 51.000 - 93.933: 693 ######## 93.933 - 172.314: 938 ########## 172.314 - 315.408: 856 ######### 315.408 - 576.646: 95 # 576.646 - 6415.830: 888 ########## 6415.830 - 11713.809: 1024 ########### 11713.809 - 21386.000: 4816 ##################################################### So traversing such trees can take some significant time that can easily be avoided. Ran the following sysbench tests, 5 times each, for sequential and random writes, and got the following results: sysbench --test=fileio --file-num=1 --file-total-size=2G \ --file-test-mode=seqwr --num-threads=16 --file-block-size=65536 \ --max-requests=0 --max-time=60 --file-io-mode=sync sysbench --test=fileio --file-num=1 --file-total-size=2G \ --file-test-mode=rndwr --num-threads=16 --file-block-size=65536 \ --max-requests=0 --max-time=60 --file-io-mode=sync Before this change: sequential writes: 69.28Mb/sec (average of 5 runs) random writes: 4.14Mb/sec (average of 5 runs) After this change: sequential writes: 69.91Mb/sec (average of 5 runs) random writes: 5.69Mb/sec (average of 5 runs) Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: add missing extent state caching callsFilipe David Borba Manana2014-01-281-1/+3
| | | | | | | | | | When we didn't find a matching extent state, we inserted a new one but didn't cache it in the **cached_state parameter, which makes a subsequent call do a tree lookup to get it. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: faster and more efficient extent map insertionFilipe David Borba Manana2014-01-281-31/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before this change, adding an extent map to the extent map tree of an inode required 2 tree nevigations: 1) doing a tree navigation to search for an existing extent map starting at the same offset or an extent map that overlaps the extent map we want to insert; 2) Another tree navigation to add the extent map to the tree (if the former tree search didn't found anything). This change just merges these 2 steps into a single one. While running first few btrfs xfstests I had noticed these trees easily had a few hundred elements, and then with the following sysbench test it reached over 1100 elements very often. Test: sysbench --test=fileio --file-num=32 --file-total-size=10G \ --file-test-mode=seqwr --num-threads=512 --file-block-size=8192 \ --max-requests=1000000 --file-io-mode=sync [prepare|run] (fs created with mkfs.btrfs -l 4096 -f /dev/sdb3 before each sysbench prepare phase) Before this patch: run 1 - 41.894Mb/sec run 2 - 40.527Mb/sec run 3 - 40.922Mb/sec run 4 - 49.433Mb/sec run 5 - 40.959Mb/sec average - 42.75Mb/sec After this patch: run 1 - 48.036Mb/sec run 2 - 50.21Mb/sec run 3 - 50.929Mb/sec run 4 - 46.881Mb/sec run 5 - 53.192Mb/sec average - 49.85Mb/sec Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: fix extent boundary check in bio_readpage_errorFilipe David Borba Manana2014-01-281-1/+1
| | | | | | Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: try harder to avoid btree node splitsFilipe David Borba Manana2014-01-281-6/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When attempting to move items from our target leaf to its neighbor leaves (right and left), we only need to free data_size - free_space bytes from our leaf in order to add the new item (which has size of data_size bytes). Therefore attempt to move items to the right and left leaves if they have at least data_size - free_space bytes free, instead of data_size bytes free. After 5 runs of the following test, I got a smaller number of btree node splits overall: sysbench --test=fileio --file-num=512 --file-total-size=5G \ --file-test-mode=seqwr --num-threads=512 \ --file-block-size=8192 --max-requests=100000 --file-io-mode=sync Before this change: * 6171 splits (average of 5 test runs) * 61.508Mb/sec of throughput (average of 5 test runs) After this change: * 6036 splits (average of 5 test runs) * 63.533Mb/sec of throughput (average of 5 test runs) An ideal test would not just have multiple threads/processes writing to a file (insertion of file extent items) but also do other operations that result in insertion of items with varied sizes, like file/directory creations, creation of links, symlinks, xattrs, etc. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: avoid unnecessary ordered extent cache resetsFilipe David Borba Manana2014-01-281-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After an ordered extent completes, don't blindly reset the inode's ordered tree last accessed ordered extent pointer. While running the xfstests I noticed that about 29% of the time the ordered extent to which tree->last pointed was not the same as our just completed ordered extent. After that I ran the following sysbench test (after a prepare phase) and noticed that about 68% of the time tree->last pointed to a different ordered extent too. sysbench --test=fileio --file-num=32 --file-total-size=4G \ --file-test-mode=rndwr --num-threads=512 \ --file-block-size=32768 --max-time=60 --max-requests=0 run Therefore reset tree->last on ordered extent removal only if it pointed to the ordered extent we're removing from the tree. Results from 4 runs of the following test before and after applying this patch: $ sysbench --test=fileio --file-num=32 --file-total-size=4G \ --file-test-mode=seqwr --num-threads=512 \ --file-block-size=32768 --max-time=60 --file-io-mode=sync prepare $ sysbench --test=fileio --file-num=32 --file-total-size=4G \ --file-test-mode=seqwr --num-threads=512 \ --file-block-size=32768 --max-time=60 --file-io-mode=sync run Before this path: run 1 - 64.049Mb/sec run 2 - 63.455Mb/sec run 3 - 64.656Mb/sec run 4 - 63.833Mb/sec After this patch: run 1 - 66.149Mb/sec run 2 - 68.459Mb/sec run 3 - 66.338Mb/sec run 4 - 66.176Mb/sec With random writes (--file-test-mode=rndwr) I had huge fluctuations on the results (+- 35% easily). Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
* btrfs: fix leaks during sysfs teardownJeff Mahoney2014-01-281-60/+73
| | | | | | | | | | | | | | | | | | | | | | | Filipe noticed that we were leaking the features attribute group after umount. His fix of just calling sysfs_remove_group() wasn't enough since that removes just the supported features and not the unsupported features. This patch changes the unknown feature handling to add them individually so we can skip the kmalloc and uses the same iteration to tear them down later. We also fix the error handling during mount so that we catch the failing creation of the per-super kobject, and handle proper teardown of a half-setup sysfs context. Tested properly with kmemleak enabled this time. Reported-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Jeff Mahoney <jeffm@suse.com> Tested-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
* btrfs: fix static checker warningsJeff Mahoney2014-01-281-2/+2
| | | | | | | | | | | This patch fixes the following warnings: fs/btrfs/extent-tree.c:6201:12: sparse: symbol 'get_raid_name' was not declared. Should it be static? fs/btrfs/extent-tree.c:8430:9: error: format not a string literal and no format arguments [-Werror=format-security] get_raid_name(index)); Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: fix very slow inode eviction and fs unmountFilipe David Borba Manana2014-01-281-14/+84
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The inode eviction can be very slow, because during eviction we tell the VFS to truncate all of the inode's pages. This results in calls to btrfs_invalidatepage() which in turn does calls to lock_extent_bits() and clear_extent_bit(). These calls result in too many merges and splits of extent_state structures, which consume a lot of time and cpu when the inode has many pages. In some scenarios I have experienced umount times higher than 15 minutes, even when there's no pending IO (after a btrfs fs sync). A quick way to reproduce this issue: $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt/btrfs $ cd /mnt/btrfs $ sysbench --test=fileio --file-num=128 --file-total-size=16G \ --file-test-mode=seqwr --num-threads=128 \ --file-block-size=16384 --max-time=60 --max-requests=0 run $ time btrfs fi sync . FSSync '.' real 0m25.457s user 0m0.000s sys 0m0.092s $ cd .. $ time umount /mnt/btrfs real 1m38.234s user 0m0.000s sys 1m25.760s The same test on ext4 runs much faster: $ mkfs.ext4 /dev/sdb3 $ mount /dev/sdb3 /mnt/ext4 $ cd /mnt/ext4 $ sysbench --test=fileio --file-num=128 --file-total-size=16G \ --file-test-mode=seqwr --num-threads=128 \ --file-block-size=16384 --max-time=60 --max-requests=0 run $ sync $ cd .. $ time umount /mnt/ext4 real 0m3.626s user 0m0.004s sys 0m3.012s After this patch, the unmount (inode evictions) is much faster: $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt/btrfs $ cd /mnt/btrfs $ sysbench --test=fileio --file-num=128 --file-total-size=16G \ --file-test-mode=seqwr --num-threads=128 \ --file-block-size=16384 --max-time=60 --max-requests=0 run $ time btrfs fi sync . FSSync '.' real 0m26.774s user 0m0.000s sys 0m0.084s $ cd .. $ time umount /mnt/btrfs real 0m1.811s user 0m0.000s sys 0m1.564s Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: improve forever loop when doing balance relocationWang Shilong2014-01-281-38/+36
| | | | | | | | | | | | | | | | | | | | | We hit a forever loop when doing balance relocation,the reason is that we firstly reserve 4M(node size is 16k).and within transaction we will try to add extra reservation for snapshot roots,this will return -EAGAIN if there has been a thread flushing space to reserve space.We will do this again and again with filesystem becoming nearly full. If the above '-EAGAIN' case happens, we try to refill reservation more outsize of transaction, and this will return eariler in enospc case,however, this dosen't really hurt because it makes no sense doing balance relocation with the filesystem nearly full. Miao Xie helped a lot to track this issue, thanks. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: fix ordered extent check in btrfs_punch_holeFilipe David Borba Manana2014-01-281-1/+1
| | | | | | | | | | | If the ordered extent's last byte was 1 less than our region's start byte, we would unnecessarily wait for the completion of that ordered extent, because it doesn't intersect our target range. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: fix the reserved space leak caused by the race between nonlock dio ↵Miao Xie2014-01-281-47/+84
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | and buffered io When we ran sysbench on the fs with compression, the following WARN_ONs were triggered: fs/btrfs/inode.c:7829 WARN_ON(BTRFS_I(inode)->outstanding_extents); fs/btrfs/inode.c:7830 WARN_ON(BTRFS_I(inode)->reserved_extents); fs/btrfs/inode.c:7832 WARN_ON(BTRFS_I(inode)->csum_bytes); Steps to reproduce: # mkfs.btrfs -f <dev> # mount -o compress <dev> <mnt> # cd <mnt> # sysbench --test=fileio --num-threads=8 --file-total-size=8G \ > --file-block-size=32K --file-io-mode=rndwr --file-fsync-freq=0 \ > --file-fsync-end=no --max-requests=300000 --file-extra-flags=direct \ > --file-test-mode=sync prepare # cd - # umount <mnt> # mount -o compress <dev> <mnt> # cd <mnt> # sysbench --test=fileio --num-threads=8 --file-total-size=8G \ > --file-block-size=32K --file-io-mode=rndwr --file-fsync-freq=0 \ > --file-fsync-end=no --max-requests=300000 --file-extra-flags=direct \ > --file-test-mode=sync run # cd - # umount <mnt> The reason of this problem is: Task0 Task1 btrfs_direct_IO unlock(&inode->i_mutex) lock(&inode->i_mutex) reserve_space() prepare_pages() lock_extent() clear_extent() unlock_extent() lock_extent() test_extent(uptodate) return false copy_data() set_delalloc_extent() extent need compress go back to buffered write clear_extent(DELALLOC | DIRTY) unlock_extent() Task 0 and 1 wrote the same place, and task0 cleared the delalloc flag which was set by task1, it made the dirty pages in that extents couldn't be flushed into the disk, so the reserved space for that extent was not released at the end. This patch fixes the above bug by unlocking the extent after the delalloc. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: cleanup unnecessary parameter and variant of prepare_pages()Miao Xie2014-01-281-8/+6
| | | | | | | | | | | - the caller has gotten the inode object, needn't pass the file object. And if so, we needn't define a inode pointer variant. - the position should be aligned by the page size not sector size, so we also needn't pass the root object into prepare_pages(). Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
* btrfs: replace BUG in can_modify_featureDavid Sterba2014-01-281-1/+3
| | | | | | | | | | We don't need to crash hard here, it's just reading a sysfs file. The values considered in switch are from a fixed set, the default case should not happen at all. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
* btrfs: reserve no transaction units in btrfs_feature_attr_storeDavid Sterba2014-01-281-1/+1
| | | | | | | | | | Added in patch "btrfs: add ability to change features via sysfs", modifications to superblock don't need to reserve metadata blocks when starting a transaction. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: make btrfs_debug match pr_debug handling related to DEBUGFrank Holton2014-01-281-0/+6
| | | | | | | | | | The kernel macro pr_debug is defined as a empty statement when DEBUG is not defined. Make btrfs_debug match pr_debug to avoid spamming the kernel log with debug messages Signed-off-by: Frank Holton <fholton@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
* btrfs: cleanup: removed unused 'btrfs_get_inode_ref_index'Sergei Trofimovich2014-01-282-71/+0
| | | | | | | | | | | | Found by uselex.rb: > btrfs_get_inode_ref_index: [R]: exported from: fs/btrfs/inode-item.o fs/btrfs/btrfs.o fs/btrfs/built-in.o Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org> Reviewed-by: David Stebra <dsterba@suse.cz> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
* btrfs: expand btrfs_find_item() to include find_orphan_item functionalityKelley Nielsen2014-01-284-35/+17
| | | | | | | | | | | | | | | | | | This is the third step in bootstrapping the btrfs_find_item interface. The function find_orphan_item(), in orphan.c, is similar to the two functions already replaced by the new interface. It uses two parameters, which are already present in the interface, and is nearly identical to the function brought in in the previous patch. Replace the two calls to find_orphan_item() with calls to btrfs_find_item(), with the defined objectid and type that was used internally by find_orphan_item(), a null path, and a null key. Add a test for a null path to btrfs_find_item, and if it passes, allocate and free the path. Finally, remove find_orphan_item(). Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
* btrfs: expand btrfs_find_item() to include find_root_ref functionalityKelley Nielsen2014-01-283-20/+11
| | | | | | | | | | | | | | | | | | | | This patch is the second step in bootstrapping the btrfs_find_item interface. The btrfs_find_root_ref() is similar to the former __inode_info(); it accepts four of its parameters, and duplicates the first half of its functionality. Replace the one former call to btrfs_find_root_ref() with a call to btrfs_find_item(), along with the defined key type that was used internally by btrfs_find_root ref, and a null found key. In btrfs_find_item(), add a test for the null key at the place where the functionality of btrfs_find_root_ref() ends; btrfs_find_item() then returns if the test passes. Finally, remove btrfs_find_root_ref(). Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com> Suggested-by: Zach Brown <zab@redhat.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
* btrfs: bootstrap generic btrfs_find_item interfaceKelley Nielsen2014-01-283-36/+43
| | | | | | | | | | | | | | | | | | | | | | | There are many btrfs functions that manually search the tree for an item. They all reimplement the same mechanism and differ in the conditions that they use to find the item. __inode_info() is one such example. Zach Brown proposed creating a new interface to take the place of these functions. This patch is the first step to creating the interface. A new function, btrfs_find_item, has been added to ctree.c and prototyped in ctree.h. It is identical to __inode_info, except that the order of the parameters has been rearranged to more closely those of similar functions elsewhere in the code (now, root and path come first, then the objectid, offset and type, and the key to be filled in last). __inode_info's callers have been set to call this new function instead, and __inode_info itself has been removed. Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com> Suggested-by: Zach Brown <zab@redhat.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
* btrfs: fix unused variables in qgroup.cValentina Giusti2014-01-281-9/+2
| | | | | | | | | | Use otherwise unused local variables slot in update_qgroup_limit_item and in update_qgroup_info_item, and remove unused variable ins from btrfs_qgroup_account_ref. Signed-off-by: Valentina Giusti <valentina.giusti@microon.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
* btrfs: replace path->slots[0] with otherwise unused variable 'slot'Valentina Giusti2014-01-281-2/+2
| | | | | | Signed-off-by: Valentina Giusti <valentina.giusti@microon.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
* btrfs: remove unused variable from scrub_fixup_nodatasumValentina Giusti2014-01-281-2/+0
| | | | | | Signed-off-by: Valentina Giusti <valentina.giusti@microon.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
* btrfs: remove unused variable from setup_cluster_no_bitmapValentina Giusti2014-01-281-2/+0
| | | | | | | | | | The variable window_start in setup_cluster_no_bitmap is not used since commit 1bb91902dc90e25449893e693ad45605cb08fbe5 (Btrfs: revamp clustered allocation logic) Signed-off-by: Valentina Giusti <valentina.giusti@microon.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
* btrfs: remove unused variables from extent_io.cValentina Giusti2014-01-281-7/+0
| | | | | | | | | | Remove unused variables: * tree from end_bio_extent_writepage, * item from extent_fiemap. Signed-off-by: Valentina Giusti <valentina.giusti@microon.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
* btrfs: remove unused variable from find_free_extentValentina Giusti2014-01-281-2/+0
| | | | | | | | | | The variable found_uncached_bg in find_free_extent is not used since commit 285ff5af6ce358e73f53b55c9efadd4335f4c2ff (Btrfs: remove the ideal caching code) Signed-off-by: Valentina Giusti <valentina.giusti@microon.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
* btrfs: remove unused variables from disk-io.cValentina Giusti2014-01-281-11/+0
| | | | | | | | | | | | | Remove unused variables: * tree from csum_dirty_buffer, * tree from btree_readpage_end_io_hook, * tree from btree_writepages, * bytenr from btrfs_create_tree, * fs_info from end_workqueue_fn. Signed-off-by: Valentina Giusti <valentina.giusti@microon.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
* btrfs: remove unused variable from btrfs_new_inodeValentina Giusti2014-01-281-6/+0
| | | | | | | | | | Variable owner in btrfs_new_inode is unused since commit d82a6f1d7e8b61ed5996334d0db66651bb43641d (Btrfs: kill BTRFS_I(inode)->block_group) Signed-off-by: Valentina Giusti <valentina.giusti@microon.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
* btrfs: publish fs label in sysfsJeff Mahoney2014-01-281-0/+44
| | | | | | | | This adds a writeable attribute which describes the label. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
* btrfs: publish device membership in sysfsJeff Mahoney2014-01-282-0/+33
| | | | | | | | | | Now that we have the infrastructure for per-super attributes, we can publish device membership in /sys/fs/btrfs/<fsid>/devices. The information is published as symlinks to the block devices. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
* btrfs: publish allocation data in sysfsJeff Mahoney2014-01-284-5/+238
| | | | | | | | | | While trying to debug ENOSPC issues, it's helpful to understand what the kernel's view of the available space is. We export this information via ioctl, but sysfs files are more easily used. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
* btrfs: add ioctl to export size of global metadata reservationJeff Mahoney2014-01-281-0/+16
| | | | | | | | | | | | | | | btrfs filesystem df output will show the size of the metadata space and how much of it is used, and the user assumes that the difference is all usable space. Since that's not actually the case due to the global metadata reservation, we should provide the full picture to the user. This patch adds an ioctl that exports the size of the global metadata reservation so that btrfs filesystem df can report it. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
* btrfs: use feature attribute names to print better error messagesJeff Mahoney2014-01-283-6/+64
| | | | | | | | | | Now that we have the feature name strings available in the kernel via the sysfs attributes, we can use them for printing better failure messages from the ioctl path. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
* btrfs: add ability to change features via sysfsJeff Mahoney2014-01-282-4/+117
| | | | | | | | | | | | | | | | | | | This patch adds the ability to change (set/clear) features while the file system is mounted. A bitmask is added for each feature set for the support to set and clear the bits. A message indicating which bit has been set or cleared is issued when it's been changed and also when permission or support for a particular bit has been denied. Since the the attributes can now be writable, we need to introduce another struct attribute to hold the different permissions. If neither set or clear is supported, the file will have 0444 permissions. If either set or clear is supported, the file will have 0644 permissions and the store handler will filter out the write based on the bitmask. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
* btrfs: publish unknown feature bits in sysfsJeff Mahoney2014-01-281-1/+107
| | | | | | | | | | | | | | With the compat and compat-ro bits, it's possible for file systems to exist that have features that aren't supported by the kernel's file system implementation yet still be mountable. This patch publishes read-only info on those features using a prefix:number format, where the number is the bit number rather than the shifted value. e.g. "compat:12" Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
* btrfs: publish per-super features in sysfsJeff Mahoney2014-01-281-16/+65
| | | | | | | | | | This patch publishes information on which features are enabled in the file system on a per-super basis. At this point, it only publishes information on features supported by the file system implementation. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
* btrfs: publish per-super attributes in sysfsJeff Mahoney2014-01-284-0/+57
| | | | | | | | | | | This patch adds per-super attributes to sysfs. It doesn't publish any attributes yet, but does the proper lifetime handling as well as the basic infrastructure to add new attributes. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
* btrfs: publish supported featured in sysfsJeff Mahoney2014-01-282-0/+87
| | | | | | | | | | | | | This patch adds the ability to publish supported features to sysfs under /sys/fs/btrfs/features. The files are module-wide and export which features the kernel supports. The content, for now, is just "0\n". Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
* btrfs: add ioctls to query/change feature bits onlineJeff Mahoney2014-01-282-0/+151
| | | | | | | | | | | | | | | | | | | | | | | There are some feature bits that require no offline setup and can be enabled online. I've only reviewed extended irefs, but there will probably be more. We introduce three new ioctls: - BTRFS_IOC_GET_SUPPORTED_FEATURES: query the kernel for supported features. - BTRFS_IOC_GET_FEATURES: query the kernel for enabled features on a per-fs basis, as well as querying for which features are changeable with mounted. - BTRFS_IOC_SET_FEATURES: change features on a per-fs basis. We introduce two new masks per feature set (_SAFE_SET and _SAFE_CLEAR) that allow us to define which features are safe to change at runtime. The failure modes for BTRFS_IOC_SET_FEATURES are as follows: - Enabling a completely unsupported feature: warns and returns -ENOTSUPP - Enabling a feature that can only be done offline: warns and returns -EPERM Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: skip merge part for delayed data refsLiu Bo2014-01-281-0/+7
| | | | | | | | | | | | | | | When we have data deduplication on, we'll hang on the merge part because it needs to verify every queued delayed data refs related to this disk offset but we may have millions refs. And in the case of delayed data refs, we don't usually have too much data refs to merge. So it's safe to shut it down for data refs. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: introduce a head ref rbtreeLiu Bo2014-01-285-60/+99
| | | | | | | | | | | | | | | | | | | | | The way how we process delayed refs is 1) get a bunch of head refs, 2) pick up one head ref, 3) go one node back for any delayed ref updates. The head ref is also linked in the same rbtree as the delayed ref is, so in 1) stage, we have to walk one by one including not only head refs, but delayed refs. When we have a great number of delayed refs pending to process, this'll cost time a lot. Here we introduce a head ref specific rbtree, it only has head refs, so troubles go away. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: fix check-integrity to look at the referenced data properlyJosef Bacik2014-01-282-4/+12
| | | | | | | | | | | | We were looking at file_extent_num_bytes unconditionally when looking at referenced data bytes, but this isn't correct for compression. Fix this by checking the compression of the file extent we are and setting num_bytes to disk_num_bytes in the case of compression so that we are marking the proper bytes as referenced. This fixes check_int_data freaking out when running btrfs/004. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
* Btrfs: incompatible format change to remove hole extentsJosef Bacik2014-01-286-56/+373
| | | | | | | | | | | | | | | Btrfs has always had these filler extent data items for holes in inodes. This has made somethings very easy, like logging hole punches and sending hole punches. However for large holey files these extent data items are pure overhead. So add an incompatible feature to no longer add hole extents to reduce the amount of metadata used by these sort of files. This has a few changes for logging and send obviously since they will need to detect holes and log/send the holes if there are any. I've tested this thoroughly with xfstests and it doesn't cause any issues with and without the incompat format set. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>
* Merge branch 'for-linus' of ↵Linus Torvalds2014-01-172-2/+7
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull namespace fixes from Eric Biederman: "This is a set of 3 regression fixes. This fixes /proc/mounts when using "ip netns add <netns>" to display the actual mount point. This fixes a regression in clone that broke lxc-attach. This fixes a regression in the permission checks for mounting /proc that made proc unmountable if binfmt_misc was in use. Oops. My apologies for sending this pull request so late. Al Viro gave interesting review comments about the d_path fix that I wanted to address in detail before I sent this pull request. Unfortunately a bad round of colds kept from addressing that in detail until today. The executive summary of the review was: Al: Is patching d_path really sufficient? The prepend_path, d_path, d_absolute_path, and __d_path family of functions is a really mess. Me: Yes, patching d_path is really sufficient. Yes, the code is mess. No it is not appropriate to rewrite all of d_path for a regression that has existed for entirely too long already, when a two line change will do" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: vfs: Fix a regression in mounting proc fork: Allow CLONE_PARENT after setns(CLONE_NEWPID) vfs: In d_path don't call d_dname on a mount point
| * vfs: Fix a regression in mounting procEric W. Biederman2013-11-261-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Gao feng <gaofeng@cn.fujitsu.com> reported that commit e51db73532955dc5eaba4235e62b74b460709d5b userns: Better restrictions on when proc and sysfs can be mounted caused a regression on mounting a new instance of proc in a mount namespace created with user namespace privileges, when binfmt_misc is mounted on /proc/sys/fs/binfmt_misc. This is an unintended regression caused by the absolutely bogus empty directory check in fs_fully_visible. The check fs_fully_visible replaced didn't even bother to attempt to verify proc was fully visible and hiding proc files with any kind of mount is rare. So for now fix the userspace regression by allowing directory with nlink == 1 as /proc/sys/fs/binfmt_misc has. I will have a better patch but it is not stable material, or last minute kernel material. So it will have to wait. Cc: stable@vger.kernel.org Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Acked-by: Gao feng <gaofeng@cn.fujitsu.com> Tested-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * vfs: In d_path don't call d_dname on a mount pointEric W. Biederman2013-11-261-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Aditya Kali (adityakali@google.com) wrote: > Commit bf056bfa80596a5d14b26b17276a56a0dcb080e5: > "proc: Fix the namespace inode permission checks." converted > the namespace files into symlinks. The same commit changed > the way namespace bind mounts appear in /proc/mounts: > $ mount --bind /proc/self/ns/ipc /mnt/ipc > Originally: > $ cat /proc/mounts | grep ipc > proc /mnt/ipc proc rw,nosuid,nodev,noexec 0 0 > > After commit bf056bfa80596a5d14b26b17276a56a0dcb080e5: > $ cat /proc/mounts | grep ipc > proc ipc:[4026531839] proc rw,nosuid,nodev,noexec 0 0 > > This breaks userspace which expects the 2nd field in > /proc/mounts to be a valid path. The symlink /proc/<pid>/ns/{ipc,mnt,net,pid,user,uts} point to dentries allocated with d_alloc_pseudo that we can mount, and that have interesting names printed out with d_dname. When these files are bind mounted /proc/mounts is not currently displaying the mount point correctly because d_dname is called instead of just displaying the path where the file is mounted. Solve this by adding an explicit check to distinguish mounted pseudo inodes and unmounted pseudo inodes. Unmounted pseudo inodes always use mount of their filesstem as the mnt_root in their path making these two cases easy to distinguish. CC: stable@vger.kernel.org Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Reported-by: Aditya Kali <adityakali@google.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
* | Merge tag 'writeback-fixes' of ↵Linus Torvalds2014-01-161-6/+9
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux Pull writeback fix from Wu Fengguang: "Fix data corruption on NFS writeback. It has been in linux-next for one month" * tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux: writeback: Fix data corruption on NFS
| * | writeback: Fix data corruption on NFSJan Kara2013-12-141-6/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 4f8ad655dbc8 "writeback: Refactor writeback_single_inode()" added a condition to skip clean inode. However this is wrong in WB_SYNC_ALL mode because there we also want to wait for outstanding writeback on possibly clean inode. This was causing occasional data corruption issues on NFS because it uses sync_inode() to make sure all outstanding writes are flushed to the server before truncating the inode and with sync_inode() returning prematurely file was sometimes extended back by an outstanding write after it was truncated. So modify the test to also check for pages under writeback in WB_SYNC_ALL mode. CC: stable@vger.kernel.org # >= 3.5 Fixes: 4f8ad655dbc82cf05d2edc11e66b78a42d38bf93 Reported-and-tested-by: Dan Duval <dan.duval@oracle.com> Signed-off-by: Jan Kara <jack@suse.cz>
* | | nilfs2: fix segctor bug that causes file system corruptionAndreas Rohner2014-01-151-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a bug in the function nilfs_segctor_collect, which results in active data being written to a segment, that is marked as clean. It is possible, that this segment is selected for a later segment construction, whereby the old data is overwritten. The problem shows itself with the following kernel log message: nilfs_sufile_do_cancel_free: segment 6533 must be clean Usually a few hours later the file system gets corrupted: NILFS: bad btree node (blocknr=8748107): level = 0, flags = 0x0, nchildren = 0 NILFS error (device sdc1): nilfs_bmap_last_key: broken bmap (inode number=114660) The issue can be reproduced with a file system that is nearly full and with the cleaner running, while some IO intensive task is running. Although it is quite hard to reproduce. This is what happens: 1. The cleaner starts the segment construction 2. nilfs_segctor_collect is called 3. sc_stage is on NILFS_ST_SUFILE and segments are freed 4. sc_stage is on NILFS_ST_DAT current segment is full 5. nilfs_segctor_extend_segments is called, which allocates a new segment 6. The new segment is one of the segments freed in step 3 7. nilfs_sufile_cancel_freev is called and produces an error message 8. Loop around and the collection starts again 9. sc_stage is on NILFS_ST_SUFILE and segments are freed including the newly allocated segment, which will contain active data and can be allocated at a later time 10. A few hours later another segment construction allocates the segment and causes file system corruption This can be prevented by simply reordering the statements. If nilfs_sufile_cancel_freev is called before nilfs_segctor_extend_segments the freed segments are marked as dirty and cannot be allocated any more. Signed-off-by: Andreas Rohner <andreas.rohner@gmx.net> Reviewed-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Tested-by: Andreas Rohner <andreas.rohner@gmx.net> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>