summaryrefslogtreecommitdiffstats
path: root/fs/gfs2
Commit message (Collapse)AuthorAgeFilesLines
* GFS2: Increase i_writecount during gfs2_setattr_chownBob Peterson2014-01-251-1/+15
| | | | | | | | | | | | | | | | commit 62e96cf81988101fe9e086b2877307b6adda5197 upstream. This patch calls get_write_access in function gfs2_setattr_chown, which merely increases inode->i_writecount for the duration of the function. That will ensure that any file closes won't delete the inode's multi-block reservation while the function is running. It also ensures that a multi-block reservation exists when needed for quota change operations during the chown. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* GFS2: Fix incorrect invalidation for DIO/buffered I/OSteven Whitehouse2014-01-091-0/+30
| | | | | | | | | | | | | | | | | | commit dfd11184d894cd0a92397b25cac18831a1a6a5bc upstream. In patch 209806aba9d540dde3db0a5ce72307f85f33468f we allowed local deferred locks to be granted against a cached exclusive lock. That opened up a corner case which this patch now fixes. The solution to the problem is to check whether we have cached pages each time we do direct I/O and if so to unmap, flush and invalidate those pages. Since the glock state machine normally does that for us, mostly the code will be a no-op. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* GFS2: don't hold s_umount over blkdev_putSteven Whitehouse2014-01-091-1/+11
| | | | | | | | | | | | | | | | | commit dfe5b9ad83a63180f358b27d1018649a27b394a9 upstream. This is a GFS2 version of Tejun's patch: 4f331f01b9c43bf001d3ffee578a97a1e0633eac vfs: don't hold s_umount over close_bdev_exclusive() call In this case its blkdev_put itself that is the issue and this patch uses the same solution of dropping and retaking s_umount. Reported-by: Tejun Heo <tj@kernel.org> Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* GFS2: Don't cache iopen glocksBob Peterson2013-06-032-1/+6
| | | | | | | | | | This patch makes GFS2 immediately reclaim/delete all iopen glocks as soon as they're dequeued. This allows deleters to get an EXclusive lock on iopen so files are deleted properly instead of being set as unlinked. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* GFS2: Fall back to vmalloc if kmalloc fails for dir hash tablesBob Peterson2013-06-031-10/+33
| | | | | | | | | | | | | | | | This version has one more correction: the vmalloc calls are replaced by __vmalloc calls to preserve the GFP_NOFS flag. When GFS2's directory management code allocates buffers for a directory hash table, if it can't get the memory it needs, it currently gives a bad return code. Rather than giving an error, this patch allows it to use virtual memory rather than kernel memory for the hash table. This should make it possible for directories to function properly, even when kernel memory becomes very fragmented. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* GFS2: Increase i_writecount during gfs2_setattr_sizeBob Peterson2013-06-033-11/+29
| | | | | | | | | | This patch calls get_write_access in a few functions. This merely increases inode->i_writecount for the duration of the function. That will ensure that any file closes won't delete the inode's multi-block reservation while the function is running. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* GFS2: Set log descriptor type for jdata blocksBob Peterson2013-06-031-1/+3
| | | | | | | | | | This patch sets the log descriptor type according to whether the journal commit is for (journaled) data or metadata. This was recently broken when the functions to process data and metadata log ops were combined. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* GFS2: Fix typo in gfs2_log_end_write loopSteven Whitehouse2013-05-241-1/+1
| | | | | | There was a missing _all in this loop iterator Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* GFS2: fix DLM depends to fix build errorsRandy Dunlap2013-05-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix build errors by correcting DLM dependencies in GFS2. Build errors happen when CONFIG_GFS2_FS_LOCKING_DLM=y and CONFIG_DLM=m: fs/built-in.o: In function `gfs2_lock': file.c:(.text+0xc7abd): undefined reference to `dlm_posix_get' file.c:(.text+0xc7ad0): undefined reference to `dlm_posix_unlock' file.c:(.text+0xc7ad9): undefined reference to `dlm_posix_lock' fs/built-in.o: In function `gdlm_unmount': lock_dlm.c:(.text+0xd6e5b): undefined reference to `dlm_release_lockspace' fs/built-in.o: In function `sync_unlock': lock_dlm.c:(.text+0xd6e9e): undefined reference to `dlm_unlock' fs/built-in.o: In function `sync_lock': lock_dlm.c:(.text+0xd6fb6): undefined reference to `dlm_lock' fs/built-in.o: In function `gdlm_put_lock': lock_dlm.c:(.text+0xd7238): undefined reference to `dlm_unlock' fs/built-in.o: In function `gdlm_mount': lock_dlm.c:(.text+0xd753e): undefined reference to `dlm_new_lockspace' lock_dlm.c:(.text+0xd79d3): undefined reference to `dlm_release_lockspace' fs/built-in.o: In function `gdlm_lock': lock_dlm.c:(.text+0xd8179): undefined reference to `dlm_lock' fs/built-in.o: In function `gdlm_cancel': lock_dlm.c:(.text+0xd6b22): undefined reference to `dlm_unlock' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* GFS2: Use single-block reservations for directoriesBob Peterson2013-05-241-2/+7
| | | | | | | | | | | | | | | | This patch changes the multi-block allocation code, such that directory inodes only get a single block reserved in the bitmap. That way, the bitmaps are more tightly packed together, and there are fewer spans of free blocks for in-use block reservations. This means it takes less time to find a free span of blocks in the bitmap, which speeds things up. This increases the performance of some workloads by almost 2X. In Nate's mockup.py script (which does (1) create dir, (2) create dir in dir, (3) create file in that dir) the test executes in 23 steps rather than 43 steps, a 47% performance improvement. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* GFS2: two minor quota fixupsBob Peterson2013-05-241-2/+2
| | | | | | | | This patch fixes two regression problems that Abhi found in the GFS2 quota code. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-blockLinus Torvalds2013-05-081-1/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block core updates from Jens Axboe: - Major bit is Kents prep work for immutable bio vecs. - Stable candidate fix for a scheduling-while-atomic in the queue bypass operation. - Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging discard bios. - Tejuns changes to convert the writeback thread pool to the generic workqueue mechanism. - Runtime PM framework, SCSI patches exists on top of these in James' tree. - A few random fixes. * 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits) relay: move remove_buf_file inside relay_close_buf partitions/efi.c: replace useless kzalloc's by kmalloc's fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read() block: fix max discard sectors limit blkcg: fix "scheduling while atomic" in blk_queue_bypass_start Documentation: cfq-iosched: update documentation help for cfq tunables writeback: expose the bdi_wq workqueue writeback: replace custom worker pool implementation with unbound workqueue writeback: remove unused bdi_pending_list aoe: Fix unitialized var usage bio-integrity: Add explicit field for owner of bip_buf block: Add an explicit bio flag for bios that own their bvec block: Add bio_alloc_pages() block: Convert some code to bio_for_each_segment_all() block: Add bio_for_each_segment_all() bounce: Refactor __blk_queue_bounce to not use bi_io_vec raid1: use bio_copy_data() pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage pktcdvd: use bio_copy_data() block: Add bio_copy_data() ...
| * block: Add bio_end_sector()Kent Overstreet2013-03-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Just a little convenience macro - main reason to add it now is preparing for immutable bio vecs, it'll reduce the size of the patch that puts bi_sector/bi_size/bi_idx into a struct bvec_iter. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: Lars Ellenberg <drbd-dev@lists.linbit.com> CC: Jiri Kosina <jkosina@suse.cz> CC: Alasdair Kergon <agk@redhat.com> CC: dm-devel@redhat.com CC: Neil Brown <neilb@suse.de> CC: Martin Schwidefsky <schwidefsky@de.ibm.com> CC: Heiko Carstens <heiko.carstens@de.ibm.com> CC: linux-s390@vger.kernel.org CC: Chris Mason <chris.mason@fusionio.com> CC: Steven Whitehouse <swhiteho@redhat.com> Acked-by: Steven Whitehouse <swhiteho@redhat.com>
* | aio: don't include aio.h in sched.hKent Overstreet2013-05-072-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Faster kernel compiles by way of fewer unnecessary includes. [akpm@linux-foundation.org: fix fallout] [akpm@linux-foundation.org: fix build] Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Reviewed-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmwLinus Torvalds2013-04-3016-242/+188
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull GFS2 updates from Steven Whitehouse: "There is not a whole lot of change this time - there are some further changes which are in the works, but those will be held over until next time. Here there are some clean ups to inode creation, the addition of an origin (local or remote) indicator to glock demote requests, removal of one of the remaining GFP_NOFAIL allocations during log flushes, one minor clean up, and a one liner bug fix." * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw: GFS2: Flush work queue before clearing glock hash tables GFS2: Add origin indicator to glock demote tracing GFS2: Add origin indicator to glock callbacks GFS2: replace gfs2_ail structure with gfs2_trans GFS2: Remove vestigial parameter ip from function rs_deltree GFS2: Use gfs2_dinode_out() in the inode create path GFS2: Remove gfs2_refresh_inode from inode creation path GFS2: Clean up inode creation path
| * | GFS2: Flush work queue before clearing glock hash tablesBob Peterson2013-04-261-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There was a timing window when a GFS2 file system was unmounted that caused GFS2 to call BUG() and panic the kernel. The call to BUG() is meant to ensure that the glock reference count, gl_ref, never gets down to zero and bounce back up again. What was happening during umount is that function gfs2_put_super was dequeing its glocks for well-known files. In particular, we saw it on the journal glock, sd_jinode_gh. The dequeue caused delayed work to be queued for the glock state machine, to transition the lock to an "unlocked" state. While the work was still queued, gfs2_put_super called gfs2_gl_hash_clear to clear out the glock hash tables. If the timing was just so, the glock work function would drop the reference count at the time when it was being checked for zero, and that caused BUG() to be called. This patch calls flush_workqueue before clearing the glock hash tables, thereby ensuring that the delayed work is executed before the hash tables are cleared, and therefore the reference count never goes to zero until the glock is cleared. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
| * | GFS2: Add origin indicator to glock demote tracingSteven Whitehouse2013-04-102-5/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds the origin indicator to the trace point for glock demotion, so that it is possible to see where demote requests have come from. Note that requests generated from the demote_rq sysfs interface will show as remote, since they are intended to replicate exactly the effect of a demote reuqest from a remote node. It is still possible to tell these apart by looking at the process which initiated the demote request. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
| * | GFS2: Add origin indicator to glock callbacksSteven Whitehouse2013-04-103-9/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds a bool indicating whether the demote request was originated locally or remotely. This is then used by the iopen ->go_callback() to make 100% sure that it will only respond to remote callbacks. Since ->evict_inode() uses GL_NOCACHE when it attempts to get an exclusive lock on the iopen lock, this may result in extra scheduling of the workqueue in case that the exclusive promotion request failed. This patch prevents that from happening. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
| * | GFS2: replace gfs2_ail structure with gfs2_transBenjamin Marzinski2013-04-087-72/+94
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In order to allow transactions and log flushes to happen at the same time, gfs2 needs to move the transaction accounting and active items list code into the gfs2_trans structure. As a first step toward this, this patch removes the gfs2_ail structure, and handles the active items list in the gfs_trans structure. This keeps gfs2 from allocating an ail structure on log flushes, and gives us a struture that can later be used to store the transaction accounting outside of the gfs2 superblock structure. With this patch, at the end of a transaction, gfs2 will add the gfs2_trans structure to the superblock if there is not one already. This structure now has the active items fields that were previously in gfs2_ail. This is not necessary in the case where the transaction was simply used to add revokes, since these are never written outside of the journal, and thus, don't need an active items list. Also, in order to make sure that the transaction structure is not removed while it's still in use by gfs2_trans_end, unlocking the sd_log_flush_lock has to happen slightly later in ending the transaction. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
| * | GFS2: Remove vestigial parameter ip from function rs_deltreeBob Peterson2013-04-084-11/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | The functions that delete block reservations from the rgrp block reservations rbtree no longer use the ip parameter. This patch eliminates the parameter. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
| * | GFS2: Use gfs2_dinode_out() in the inode create pathSteven Whitehouse2013-04-081-35/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Over the previous two patches relating to inode creation, the content of init_dinode() has been looking more and more like gfs2_dinode_out(). This is not an accident! This patch replaces the parts of init_dinode() which are duplicated in gfs2_dinode_out() with a call to that function. Mostly that is straightforward, but there is one issue which needed to be resolved relating to the link count. The link count has to be set to zero in a certain error handling code path, which lands up calling iput(). This is now done specifically in that code path allowing the link count to be set earlier and written into the on disk inode by gfs2_dinode_put() in the normal way. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
| * | GFS2: Remove gfs2_refresh_inode from inode creation pathSteven Whitehouse2013-04-083-58/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The original method for creating inodes used in GFS2 was to fill out a buffer, with all the information, and then to read that buffer into the in-core inode, using gfs2_refresh_inode() The problem with this approach is that all the inode's fields need to be calculated ahead of time, and were stored in various variables making the code rather complicated. The new approach is simply to allocate the in-core inode earlier and fill in as many fields as possible ahead of time. These can then be used to initilise the on disk representation. The code has been working towards the point where it is possible to remove gfs2_refresh_inode() because all the fields are correctly initialised ahead of time. We've now reached that milestone, and have reversed the order of setting up the in core and on disk inodes. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
| * | GFS2: Clean up inode creation pathSteven Whitehouse2013-04-082-69/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch cleans up the inode creation code path in GFS2. After the Orlov allocator was merged, a number of potential improvements are now possible, and this is a first set of these. The quota handling is now updated so that it matches the point in the code where the allocation takes place. This means that the one exception in gfs2_alloc_blocks relating to quota is now no longer required, and we can use the generic code everywhere. In addition the call to figure out whether we need to allocate any extra blocks in order to add a directory entry is moved higher up gfs2_create_inode. This means that if it returns an error, we can deal with that at a stage where it is easier to handle that case. The returned status cannot change during the function since we hold an exclusive lock on the directory. Two calls to gfs2_rindex_update have been changed to one, again at the top of gfs2_create_inode to simplify error handling. The time stamps are also now initialised earlier in the creation process, this is gradually moving towards being able to remove the call to gfs2_refresh_inode in gfs2_inode_create once we have all the fields covered. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* | | Merge branch 'for-linus' of ↵Linus Torvalds2013-04-302-3/+4
|\ \ \ | |/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree updates from Jiri Kosina: "Usual stuff, mostly comment fixes, typo fixes, printk fixes and small code cleanups" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (45 commits) mm: Convert print_symbol to %pSR gfs2: Convert print_symbol to %pSR m32r: Convert print_symbol to %pSR iostats.txt: add easy-to-find description for field 6 x86 cmpxchg.h: fix wrong comment treewide: Fix typo in printk and comments doc: devicetree: Fix various typos docbook: fix 8250 naming in device-drivers pata_pdc2027x: Fix compiler warning treewide: Fix typo in printks mei: Fix comments in drivers/misc/mei treewide: Fix typos in kernel messages pm44xx: Fix comment for "CONFIG_CPU_IDLE" doc: Fix typo "CONFIG_CGROUP_CGROUP_MEMCG_SWAP" mmzone: correct "pags" to "pages" in comment. kernel-parameters: remove outdated 'noresidual' parameter Remove spurious _H suffixes from ifdef comments sound: Remove stray pluses from Kconfig file radio-shark: Fix printk "CONFIG_LED_CLASS" doc: put proper reference to CONFIG_MODULE_SIG_ENFORCE ...
| * | gfs2: Convert print_symbol to %pSRJoe Perches2013-04-292-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | Use the new vsprintf extension to avoid any possible message interleaving. Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
* | | GFS2: Issue discards in 512b sectorsBob Peterson2013-04-051-17/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch changes GFS2's discard issuing code so that it calls function sb_issue_discard rather than blkdev_issue_discard. The code was calling blkdev_issue_discard and specifying the correct sector offset and sector size, but blkdev_issue_discard expects these values to be in terms of 512 byte sectors, even if the native sector size for the device is different. Calling sb_issue_discard with the BLOCK size instead ensures the correct block-to-512b-sector translation. I verified that "minlen" is specified in blocks, so comparing it to a number of blocks is correct. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* | | GFS2: Fix unlock of fcntl locks during withdrawn stateSteven Whitehouse2013-04-041-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | When withdraw occurs, we need to continue to allow unlocks of fcntl locks to occur, however these will only be local, since the node has withdrawn from the cluster. This prevents triggering a VFS level bug trap due to locks remaining when a file is closed. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* | | GFS2: return error if malloc failed in gfs2_rs_alloc()Wei Yongjun2013-04-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | The error code in gfs2_rs_alloc() is set to ENOMEM when error but never be used, instead, gfs2_rs_alloc() always return 0. Fix to return 'error'. Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* | | GFS2: use memchr_invAkinobu Mita2013-04-041-6/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use memchr_inv to verify that the specified memory range is cleared. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: cluster-devel@redhat.com Cc: Christine Caulfield <ccaulfie@redhat.com> Cc: David Teigland <teigland@redhat.com>
* | | GFS2: use kmalloc for lvb bitmapDavid Teigland2013-04-042-13/+19
| |/ |/| | | | | | | | | | | | | | | The temp lvb bitmap was on the stack, which could be an alignment problem for __set_bit_le. Use kmalloc for it instead. Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* | fs: Limit sys_mount to only request filesystem modules.Eric W. Biederman2013-03-031-1/+3
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Modify the request_module to prefix the file system type with "fs-" and add aliases to all of the filesystems that can be built as modules to match. A common practice is to build all of the kernel code and leave code that is not commonly needed as modules, with the result that many users are exposed to any bug anywhere in the kernel. Looking for filesystems with a fs- prefix limits the pool of possible modules that can be loaded by mount to just filesystems trivially making things safer with no real cost. Using aliases means user space can control the policy of which filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf with blacklist and alias directives. Allowing simple, safe, well understood work-arounds to known problematic software. This also addresses a rare but unfortunate problem where the filesystem name is not the same as it's module name and module auto-loading would not work. While writing this patch I saw a handful of such cases. The most significant being autofs that lives in the module autofs4. This is relevant to user namespaces because we can reach the request module in get_fs_type() without having any special permissions, and people get uncomfortable when a user specified string (in this case the filesystem type) goes all of the way to request_module. After having looked at this issue I don't think there is any particular reason to perform any filtering or permission checks beyond making it clear in the module request that we want a filesystem module. The common pattern in the kernel is to call request_module() without regards to the users permissions. In general all a filesystem module does once loaded is call register_filesystem() and go to sleep. Which means there is not much attack surface exposed by loading a filesytem module unless the filesystem is mounted. In a user namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT, which most filesystems do not set today. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Acked-by: Kees Cook <keescook@chromium.org> Reported-by: Kees Cook <keescook@google.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
* Merge branch 'for-linus' of ↵Linus Torvalds2013-02-264-21/+20
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile (part one) from Al Viro: "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent locking violations, etc. The most visible changes here are death of FS_REVAL_DOT (replaced with "has ->d_weak_revalidate()") and a new helper getting from struct file to inode. Some bits of preparation to xattr method interface changes. Misc patches by various people sent this cycle *and* ocfs2 fixes from several cycles ago that should've been upstream right then. PS: the next vfs pile will be xattr stuff." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits) saner proc_get_inode() calling conventions proc: avoid extra pde_put() in proc_fill_super() fs: change return values from -EACCES to -EPERM fs/exec.c: make bprm_mm_init() static ocfs2/dlm: use GFP_ATOMIC inside a spin_lock ocfs2: fix possible use-after-free with AIO ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero target: writev() on single-element vector is pointless export kernel_write(), convert open-coded instances fs: encode_fh: return FILEID_INVALID if invalid fid_type kill f_vfsmnt vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op nfsd: handle vfs_getattr errors in acl protocol switch vfs_getattr() to struct path default SET_PERSONALITY() in linux/elf.h ceph: prepopulate inodes only when request is aborted d_hash_and_lookup(): export, switch open-coded instances 9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate() 9p: split dropping the acls from v9fs_set_create_acl() ...
| * fs: change return values from -EACCES to -EPERMZhao Hongjiang2013-02-261-9/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | According to SUSv3: [EACCES] Permission denied. An attempt was made to access a file in a way forbidden by its file access permissions. [EPERM] Operation not permitted. An attempt was made to perform an operation limited to processes with appropriate privileges or to the owner of a file or other resource. So -EPERM should be returned if capability checks fails. Strictly speaking this is an API change since the error code user sees is altered. Signed-off-by: Zhao Hongjiang <zhaohongjiang@huawei.com> Acked-by: Jan Kara <jack@suse.cz> Acked-by: Steven Whitehouse <swhiteho@redhat.com> Acked-by: Ian Kent <raven@themaw.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * fs: encode_fh: return FILEID_INVALID if invalid fid_typeNamjae Jeon2013-02-261-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | This patch is a follow up on below patch: [PATCH] exportfs: add FILEID_INVALID to indicate invalid fid_type commit: 216b6cbdcbd86b1db0754d58886b466ae31f5a63 Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Vivek Trivedi <t.vivek@samsung.com> Acked-by: Steven Whitehouse <swhiteho@redhat.com> Acked-by: Sage Weil <sage@inktank.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * new helper: file_inode(file)Al Viro2013-02-222-10/+9
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | Merge branch 'for-linus' of ↵Linus Torvalds2013-02-2511-118/+104
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull user namespace and namespace infrastructure changes from Eric W Biederman: "This set of changes starts with a few small enhnacements to the user namespace. reboot support, allowing more arbitrary mappings, and support for mounting devpts, ramfs, tmpfs, and mqueuefs as just the user namespace root. I do my best to document that if you care about limiting your unprivileged users that when you have the user namespace support enabled you will need to enable memory control groups. There is a minor bug fix to prevent overflowing the stack if someone creates way too many user namespaces. The bulk of the changes are a continuation of the kuid/kgid push down work through the filesystems. These changes make using uids and gids typesafe which ensures that these filesystems are safe to use when multiple user namespaces are in use. The filesystems converted for 3.9 are ceph, 9p, afs, ocfs2, gfs2, ncpfs, nfs, nfsd, and cifs. The changes for these filesystems were a little more involved so I split the changes into smaller hopefully obviously correct changes. XFS is the only filesystem that remains. I was hoping I could get that in this release so that user namespace support would be enabled with an allyesconfig or an allmodconfig but it looks like the xfs changes need another couple of days before it they are ready." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (93 commits) cifs: Enable building with user namespaces enabled. cifs: Convert struct cifs_ses to use a kuid_t and a kgid_t cifs: Convert struct cifs_sb_info to use kuids and kgids cifs: Modify struct smb_vol to use kuids and kgids cifs: Convert struct cifsFileInfo to use a kuid cifs: Convert struct cifs_fattr to use kuid and kgids cifs: Convert struct tcon_link to use a kuid. cifs: Modify struct cifs_unix_set_info_args to hold a kuid_t and a kgid_t cifs: Convert from a kuid before printing current_fsuid cifs: Use kuids and kgids SID to uid/gid mapping cifs: Pass GLOBAL_ROOT_UID and GLOBAL_ROOT_GID to keyring_alloc cifs: Use BUILD_BUG_ON to validate uids and gids are the same size cifs: Override unmappable incoming uids and gids nfsd: Enable building with user namespaces enabled. nfsd: Properly compare and initialize kuids and kgids nfsd: Store ex_anon_uid and ex_anon_gid as kuids and kgids nfsd: Modify nfsd4_cb_sec to use kuids and kgids nfsd: Handle kuids and kgids in the nfs4acl to posix_acl conversion nfsd: Convert nfsxdr to use kuids and kgids nfsd: Convert nfs3xdr to use kuids and kgids ...
| * | gfs2: Convert uids and gids between dinodes and vfs inodes.Eric W. Biederman2013-02-133-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When reading dinodes from the disk convert uids and gids into kuids and kgids to store in vfs data structures. When writing to dinodes to the disk convert kuids and kgids in the in memory structures into plain uids and gids. For now all on disk data structures are assumed to be stored in the initial user namespace. Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | gfs2: Use uid_eq and gid_eq where appropriateEric W. Biederman2013-02-133-11/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Where kuid_t values are compared use uid_eq and where kgid_t values are compared use gid_eq. This is unfortunately necessary because of the type safety that keeps someone from accidentally mixing kuids and kgids with other types. Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | gfs2: Use kuid_t and kgid_t types where appropriate.Eric W. Biederman2013-02-133-9/+10
| | | | | | | | | | | | | | | Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | gfs2: Remove the QUOTA_USER and QUOTA_GROUP definesEric W. Biederman2013-02-131-20/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove the QUOTA_USER and QUOTA_GRUP defines. Remove the last vestigal users of QUOTA_USER and QUOTA_GROUP. Now that struct kqid is used throughout the gfs2 quota code the need there is to use QUOTA_USER and QUOTA_GROUP and the defines are just extraneous and confusing. Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | gfs2: Store qd_id in struct gfs2_quota_data as a struct kqidEric W. Biederman2013-02-132-46/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Change qd_id in struct gfs2_qutoa_data to struct kqid. - Remove the now unnecessary QDF_USER bit field in qd_flags. - Propopoage this change through the code generally making things simpler along the way. Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | gfs2: Convert gfs2_quota_refresh to take a kqidEric W. Biederman2013-02-133-5/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - In quota_refresh_user_store convert the user supplied uid into a kqid and pass it to gfs2_quota_refresh. - In quota_refresh_group_store convert the user supplied gid into a kqid and pass it to gfs2_quota_refresh. Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | gfs2: Modify qdsb_get to take a struct kqidEric W. Biederman2013-02-131-6/+7
| | | | | | | | | | | | | | | Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | gfs2: Modify struct gfs2_quota_change_host to use struct kqidEric W. Biederman2013-02-131-3/+5
| | | | | | | | | | | | | | | Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | gfs2: Introduce qd2indexEric W. Biederman2013-02-131-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | Both qd_alloc and qd2offset perform the exact same computation to get an index from a gfs2_quota_data. Make life a little simpler and factor out this index computation. Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | gfs2: Report quotas in the caller's user namespace.Eric W. Biederman2013-02-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a quota is queried return the uid or the gid in the mapped into the caller's user namespace. In addition perform the munged version of the mapping so that instead of -1 a value that does not map is reported as the overflowuid or the overflowgid. Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | gfs2: Split NO_QUOTA_CHANGE inot NO_UID_QUTOA_CHANGE and NO_GID_QUTOA_CHANGEEric W. Biederman2013-02-137-14/+15
| | | | | | | | | | | | | | | | | | | | | | | | Split NO_QUOTA_CHANGE into NO_UID_QUTOA_CHANGE and NO_GID_QUTOA_CHANGE so the constants may be well typed. Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | gfs2: Remove improper checks in gfs2_set_dqblk.Eric W. Biederman2013-02-131-6/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In set_dqblk it is an error to look at fdq->d_id or fdq->d_flags. Userspace quota applications do not set these fields when calling quotactl(Q_XSETQLIM,...), and the kernel does not set those fields when quota_setquota calls set_dqblk. gfs2 never looks at fdq->d_id or fdq->d_flags after checking to see if they match the id and type supplied to set_dqblk. No other linux filesystem in set_dqblk looks at either fdq->d_id or fdq->d_flags. Therefore remove these bogus checks from gfs2 and allow normal quota setting applications to work. Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
* | | mm: only enforce stable page writes if the backing device requires itDarrick J. Wong2013-02-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Create a helper function to check if a backing device requires stable page writes and, if so, performs the necessary wait. Then, make it so that all points in the memory manager that handle making pages writable use the helper function. This should provide stable page write support to most filesystems, while eliminating unnecessary waiting for devices that don't require the feature. Before this patchset, all filesystems would block, regardless of whether or not it was necessary. ext3 would wait, but still generate occasional checksum errors. The network filesystems were left to do their own thing, so they'd wait too. After this patchset, all the disk filesystems except ext3 and btrfs will wait only if the hardware requires it. ext3 (if necessary) snapshots pages instead of blocking, and btrfs provides its own bdi so the mm will never wait. Network filesystems haven't been touched, so either they provide their own stable page guarantees or they don't block at all. The blocking behavior is back to what it was before 3.0 if you don't have a disk requiring stable page writes. Here's the result of using dbench to test latency on ext2: 3.8.0-rc3: Operation Count AvgLat MaxLat ---------------------------------------- WriteX 109347 0.028 59.817 ReadX 347180 0.004 3.391 Flush 15514 29.828 287.283 Throughput 57.429 MB/sec 4 clients 4 procs max_latency=287.290 ms 3.8.0-rc3 + patches: WriteX 105556 0.029 4.273 ReadX 335004 0.005 4.112 Flush 14982 30.540 298.634 Throughput 55.4496 MB/sec 4 clients 4 procs max_latency=298.650 ms As you can see, the maximum write latency drops considerably with this patch enabled. The other filesystems (ext3/ext4/xfs/btrfs) behave similarly, but see the cover letter for those results. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Steven Whitehouse <swhiteho@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Artem Bityutskiy <dedekind1@gmail.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | GFS2: Reinstate withdraw ack systemSteven Whitehouse2013-02-134-1/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | This patch reinstates the ack system which withdraw should be using. It appears to have been accidentally forgotten when the lock module was merged into GFS2, due to two different sysfs files having the same name. Reported-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>