summaryrefslogtreecommitdiffstats
path: root/fs
Commit message (Collapse)AuthorAgeFilesLines
* GFS2: Fix spectator umount issueSteven Whitehouse2010-09-292-7/+7
| | | | | | | | | | The tests further down the recovery function relating to unlocking the journal need to be updated to match the intial test. Also, a test in the umount code which was surplus to requirements has been removed. Umounting spectator mounts now works correctly, as expected. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* GFS2: Fix compiler warning from previous patchSteven Whitehouse2010-09-281-1/+1
| | | | | | | This shouldn't really be required, but gcc can't tell that "al" is only accessed when initialised. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* GFS2: reserve more blocks for transactionsBenjamin Marzinski2010-09-287-7/+20
| | | | | | | | | | | | | Some of the functions in GFS2 were not reserving space in the transaction for the resource group header and the resource groups bitblocks that get added when you do allocation. GFS2 now makes sure to reserve space for the resource group header and either all the bitblocks in the resource group, or one for each block that it may allocate, whichever is smaller using the new gfs2_rg_blocks() inline function. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* GFS2: Fix journal check for spectator mountsSteven Whitehouse2010-09-271-1/+2
| | | | | | | | | When checking journals for spectator mounts, we cannot rely on the journal being locked, whatever its jid might be. This patch ensures that we always get the journal locks when checking journals for a spectator mount. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* GFS2: Remove upgrade mount optionSteven Whitehouse2010-09-243-62/+3
| | | | | | | | | This option has never done anything useful. Also at the same time this cleans up the sb checks which are done at mount time. The debug option will be accepted, but ignored in future. Since it didn't do anything, there didn't seem much point in retaining it. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* GFS2: Remove localcaching mount optionSteven Whitehouse2010-09-234-7/+2
| | | | | | | | | | | | | | | This option defaulted to on for lock_nolock mounts and off otherwise. The only function was to avoid the revalidation of dentries. In the cluster case, that is entirely pointless and liable to cause coherency problems. The patch changes the revalidation to depend upon whether the fs is a local or cluster fs (i.e. it follows the existing default behaviour). I very much doubt anybody ever used this option as there is no reason to. Even so we will continue to accept it on the mount command line, but ignore it. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* GFS2: Remove ignore_local_fs mount argumentSteven Whitehouse2010-09-232-5/+1
| | | | | | | | This is been a no-op for a very long time now. I'm pretty sure nobody uses it, but just in case we'll still accept it on the command line, but ignore it. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* GFS2: Make . and .. qstrs constantSteven Whitehouse2010-09-205-41/+33
| | | | | | | | | Rather than calculating the qstrs for . and .. each time we need them, its better to keep a constant version of these and just refer to them when required. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Reviewed-by: Christoph Hellwig <hch@infradead.org>
* GFS2: Use new workqueue schemeSteven Whitehouse2010-09-202-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | The recovery workqueue can be freezable since we want it to finish what it is doing if the system is to be frozen (although why you'd want to freeze a cluster node is beyond me since it will result in it being ejected from the cluster). It does still make sense for single node GFS2 filesystems though. The glock workqueue will benefit from being able to run more work items concurrently. A test running postmark shows improved performance and multi-threaded workloads are likely to benefit even more. It needs to be high priority because the latency directly affects the latency of filesystem glock operations. The delete workqueue is similar to the recovery workqueue in that it must not get blocked by memory allocations, and may run for a long time. Potentially other GFS2 threads might also be converted to workqueues, but I'll leave that for a later patch. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Acked-by: Tejun Heo <tj@kernel.org>
* GFS2: Update handling of DLM return codes to match realitySteven Whitehouse2010-09-201-2/+2
| | | | | | | | | | GFS2's idea of which return codes it needs to handle was based upon those listed in dlm.h. Those didn't cover all the possible codes and listed some which never happen. This updates GFS2 to handle all the codes which can actually be returned from the DLM under various circumstances. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* GFS2: Don't enforce min hold time when two demotes occur in rapid successionSteven Whitehouse2010-09-203-5/+14
| | | | | | | | | | | | | | | | | | | | | Due to the design of the VFS, it is quite usual for operations on GFS2 to consist of a lookup (requiring a shared lock) followed by an operation requiring an exclusive lock. If a remote node has cached an exclusive lock, then it will receive two demote events in rapid succession firstly for a shared lock and then to unlocked. The existing min hold time code was triggering in this case, even if the node was otherwise idle since the state change time was being updated by the initial demote. This patch introduces logic to skip the min hold timer in the case that a "double demote" of this kind has occurred. The min hold timer will still be used in all other cases. A new glock flag is introduced which is used to keep track of whether there have been any newly queued holders since the last glock state change. The min hold time is only applied if the flag is set. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Tested-by: Abhijith Das <adas@redhat.com>
* GFS2: Fix whitespace in previous patchSteven Whitehouse2010-09-201-1/+1
| | | | | | Removes the offending space Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* GFS2: fallocate supportBenjamin Marzinski2010-09-206-2/+272
| | | | | | | | | | | | | | This patch adds support for fallocate to gfs2. Since the gfs2 does not support uninitialized data blocks, it must write out zeros to all the blocks. However, since it does not need to lock any pages to read from, gfs2 can write out the zero blocks much more efficiently. On a moderately full filesystem, fallocate works around 5 times faster on average. The fallocate call also allows gfs2 to add blocks to the file without changing the filesize, which will make it possible for gfs2 to preallocate space for the rindex file, so that gfs2 can grow a completely full filesystem. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* GFS2: Add a bug trap in allocation codeSteven Whitehouse2010-09-201-1/+9
| | | | | | | | | | This adds a check to ensure that if we reach the block allocator that we don't try and proceed if there is no alloc structure hanging off the inode. This should only happen if there is a bug in GFS2. The error return code is distinctive in order that it will be easily spotted. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* GFS2: No longer experimentalSteven Whitehouse2010-09-201-1/+1
| | | | | | | I think the time has arrvied to remove the experimental tag from GFS2. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* GFS2: Remove i_disksizeSteven Whitehouse2010-09-2013-58/+60
| | | | | | | | | With the update of the truncate code, ip->i_disksize and inode->i_size are merely copies of each other. This means we can remove ip->i_disksize and use inode->i_size exclusively reducing the size of a GFS2 inode by 8 bytes. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* GFS2: New truncate sequenceSteven Whitehouse2010-09-204-168/+135
| | | | | | | | | | | This updates GFS2's truncate code to use the new truncate sequence correctly. This is a stepping stone to being able to remove ip->i_disksize in favour of using i_size everywhere now that the two sizes are always identical. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Christoph Hellwig <hch@lst.de>
* Coda: mount hangs because of missed REQ_WRITE renameJan Harkes2010-09-191-2/+2
| | | | | | | | | | | | | | Coda's REQ_* defines were renamed to avoid clashes with the block layer (commit 4aeefdc69f7b: "coda: fixup clash with block layer REQ_* defines"). However one was missed and response messages are no longer matched with requests and waiting threads are no longer woken up. This patch fixes this. Signed-off-by: Jan Harkes <jaharkes@cs.cmu.edu> [ Also fixed up whitespace while at it -Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* GFS2: gfs2_logd should be using interruptible waitsSteven Whitehouse2010-09-171-1/+1
| | | | | | | Looks like this crept in, in a recent update. Reported-by: Krzysztof Urbaniak <urban@bash.org.pl> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6Linus Torvalds2010-09-161-3/+3
|\ | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6: cifs: fix potential double put of TCP session reference
| * cifs: fix potential double put of TCP session referenceJeff Layton2010-09-141-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cifs_get_smb_ses must be called on a server pointer on which it holds an active reference. It first does a search for an existing SMB session. If it finds one, it'll put the server reference and then try to ensure that the negprot is done, etc. If it encounters an error at that point then it'll return an error. There's a potential problem here though. When cifs_get_smb_ses returns an error, the caller will also put the TCP server reference leading to a double-put. Fix this by having cifs_get_smb_ses only put the server reference if it found an existing session that it could use and isn't returning an error. Cc: stable@kernel.org Reviewed-by: Suresh Jayaraman <sjayaraman@suse.de> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
* | Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6Linus Torvalds2010-09-145-5/+11
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: SUNRPC: Fix the NFSv4 and RPCSEC_GSS Kconfig dependencies statfs() gives ESTALE error NFS: Fix a typo in nfs_sockaddr_match_ipaddr6 sunrpc: increase MAX_HASHTABLE_BITS to 14 gss:spkm3 miss returning error to caller when import security context gss:krb5 miss returning error to caller when import security context Remove incorrect do_vfs_lock message SUNRPC: cleanup state-machine ordering SUNRPC: Fix a race in rpc_info_open SUNRPC: Fix race corrupting rpc upcall Fix null dereference in call_allocate
| * | SUNRPC: Fix the NFSv4 and RPCSEC_GSS Kconfig dependenciesTrond Myklebust2010-09-122-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The NFSv4 client's callback server calls svc_gss_principal(), which is defined in the auth_rpcgss.ko The NFSv4 server has the same dependency, and in addition calls svcauth_gss_flavor(), gss_mech_get_by_pseudoflavor(), gss_pseudoflavor_to_service() and gss_mech_put() from the same module. The module auth_rpcgss itself has no dependencies aside from sunrpc, so we only need to select RPCSEC_GSS. Reported-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
| * | statfs() gives ESTALE errorMenyhart Zoltan2010-09-121-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Hi, An NFS client executes a statfs("file", &buff) call. "file" exists / existed, the client has read / written it, but it has already closed it. user_path(pathname, &path) looks up "file" successfully in the directory-cache and restarts the aging timer of the directory-entry. Even if "file" has already been removed from the server, because the lookupcache=positive option I use, keeps the entries valid for a while. nfs_statfs() returns ESTALE if "file" has already been removed from the server. If the user application repeats the statfs("file", &buff) call, we are stuck: "file" remains young forever in the directory-cache. Signed-off-by: Zoltan Menyhart <Zoltan.Menyhart@bull.net> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@kernel.org
| * | NFS: Fix a typo in nfs_sockaddr_match_ipaddr6Trond Myklebust2010-09-121-1/+1
| | | | | | | | | | | | | | | | | | Reported-by: Ben Greear <greearb@candelatech.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@kernel.org
| * | Remove incorrect do_vfs_lock messageFabio Olive Leite2010-09-121-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The do_vfs_lock function on fs/nfs/file.c is only called if NLM is not being used, via the -onolock mount option. Therefore it cannot really be "out of sync with lock manager" when the local locking function called returns an error, as there will be no corresponding call to the NLM. For details, simply check the if/else on do_setlk and do_unlk on fs/nfs/file.c. Signed-Off-By: Fabio Olive Leite <fleite@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* | | aio: check for multiplication overflow in do_io_submitJeff Moyer2010-09-141-0/+3
| |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Tavis Ormandy pointed out that do_io_submit does not do proper bounds checking on the passed-in iocb array:        if (unlikely(nr < 0))                return -EINVAL;        if (unlikely(!access_ok(VERIFY_READ, iocbpp, (nr*sizeof(iocbpp)))))                return -EFAULT;                      ^^^^^^^^^^^^^^^^^^ The attached patch checks for overflow, and if it is detected, the number of iocbs submitted is scaled down to a number that will fit in the long.  This is an ok thing to do, as sys_io_submit is documented as returning the number of iocbs submitted, so callers should handle a return value of less than the 'nr' argument passed in. Reported-by: Tavis Ormandy <taviso@cmpxchg8b.com> Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6Linus Torvalds2010-09-1313-592/+166
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6: cifs: prevent possible memory corruption in cifs_demultiplex_thread cifs: eliminate some more premature cifsd exits cifs: prevent cifsd from exiting prematurely [CIFS] ntlmv2/ntlmssp remove-unused-function CalcNTLMv2_partial_mac_key cifs: eliminate redundant xdev check in cifs_rename Revert "[CIFS] Fix ntlmv2 auth with ntlmssp" Revert "missing changes during ntlmv2/ntlmssp auth and sign" Revert "Eliminate sparse warning - bad constant expression" Revert "[CIFS] Eliminate unused variable warning"
| * | cifs: prevent possible memory corruption in cifs_demultiplex_threadJeff Layton2010-09-083-11/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cifs_demultiplex_thread sets the addr.sockAddr.sin_port without any regard for the socket family. While it may be that the error in question here never occurs on an IPv6 socket, it's probably best to be safe and set the port properly if it ever does. Break the port setting code out of cifs_fill_sockaddr and into a new function, and call that from cifs_demultiplex_thread. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
| * | cifs: eliminate some more premature cifsd exitsJeff Layton2010-09-081-29/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the tcpStatus is still CifsNew, the main cifs_demultiplex_loop can break out prematurely in some cases. This is wrong as we will almost always have other structures with pointers to the TCP_Server_Info. If the main loop breaks under any other condition other than tcpStatus == CifsExiting, then it'll face a use-after-free situation. I don't see any reason to treat a CifsNew tcpStatus differently than CifsGood. I believe we'll still want to attempt to reconnect in either case. What should happen in those situations is that the MIDs get marked as MID_RETRY_NEEDED. This will make CIFSSMBNegotiate return -EAGAIN, and then the caller can retry the whole thing on a newly reconnected socket. If that fails again in the same way, the caller of cifs_get_smb_ses should tear down the TCP_Server_Info struct. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
| * | cifs: prevent cifsd from exiting prematurelyJeff Layton2010-09-081-9/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When cifs_demultiplex_thread exits, it does a number of cleanup tasks including freeing the TCP_Server_Info struct. Much of the existing code in cifs assumes that when there is a cisfSesInfo struct, that it holds a reference to a valid TCP_Server_Info struct. We can never allow cifsd to exit when a cifsSesInfo struct is still holding a reference to the server. The server pointers will then point to freed memory. This patch eliminates a couple of questionable conditions where it does this. The idea here is to make an -EINTR return from kernel_recvmsg behave the same way as -ERESTARTSYS or -EAGAIN. If the task was signalled from cifs_put_tcp_session, then tcpStatus will be CifsExiting, and the kernel_recvmsg call will return quickly. There's also another condition where this can occur too -- if the tcpStatus is still in CifsNew, then it will also exit if the server closes the socket prematurely. I think we'll probably also need to fix that situation, but that requires a bit more consideration. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
| * | [CIFS] ntlmv2/ntlmssp remove-unused-function CalcNTLMv2_partial_mac_keySteve French2010-09-082-59/+0
| | | | | | | | | | | | | | | | | | | | | | | | This function is not used, so remove the definition and declaration. Reviewed-by: Jeff Layton <jlayton@samba.org> Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
| * | cifs: eliminate redundant xdev check in cifs_renameJeff Layton2010-09-081-21/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | The VFS always checks that the source and target of a rename are on the same vfsmount, and hence have the same superblock. So, this check is redundant. Remove it and simplify the error handling. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
| * | Revert "[CIFS] Fix ntlmv2 auth with ntlmssp"Steve French2010-09-0811-452/+172
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 9fbc590860e75785bdaf8b83e48fabfe4d4f7d58. The change to kernel crypto and fixes to ntlvm2 and ntlmssp series, introduced a regression. Deferring this patch series to 2.6.37 after Shirish fixes it. Signed-off-by: Steve French <sfrench@us.ibm.com> Acked-by: Jeff Layton <jlayton@redhat.com> CC: Shirish Pargaonkar <shirishp@us.ibm.com>
| * | Revert "missing changes during ntlmv2/ntlmssp auth and sign"Steve French2010-09-082-10/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 3ec6bbcdb4e85403f2c5958876ca9492afdf4031. The change to kernel crypto and fixes to ntlvm2 and ntlmssp series, introduced a regression. Deferring this patch series to 2.6.37 after Shirish fixes it. Signed-off-by: Steve French <sfrench@us.ibm.com> Acked-by: Jeff Layton <jlayton@redhat.com> CC: Shirish Pargaonkar <shirishp@us.ibm.com>
| * | Revert "Eliminate sparse warning - bad constant expression"Steve French2010-09-082-128/+72
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 2d20ca835867d93ead6ce61780d883a4b128106d. The change to kernel crypto and fixes to ntlvm2 and ntlmssp series, introduced a regression. Deferring this patch series to 2.6.37 after Shirish fixes it. Signed-off-by: Steve French <sfrench@us.ibm.com> Acked-by: Jeff Layton <jlayton@redhat.com> CC: Shirish Pargaonkar <shirishp@us.ibm.com>
| * | Revert "[CIFS] Eliminate unused variable warning"Steve French2010-09-081-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The change to kernel crypto and fixes to ntlvm2 and ntlmssp series, introduced a regression. Deferring this patch series to 2.6.37 after Shirish fixes it. This reverts commit c89e5198b26a869ce2842bad8519264f3394dee9. Signed-off-by: Steve French <sfrench@us.ibm.com> Acked-by: Jeff Layton <jlayton@redhat.com> CC: Shirish Pargaonkar <shirishp@us.ibm.com>
* | | fs/9p: Don't use dotl version of mknod for dotu inode operationsAneesh Kumar K.V2010-09-131-1/+1
| | | | | | | | | | | | | | | | | | | | | We should not use dotlversion for the dotu inode operations Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
* | | fs/9p: Use the correct dentry operationsAneesh Kumar K.V2010-09-131-1/+4
| | | | | | | | | | | | | | | | | | | | | We should use the cached dentry operation only if caching mode is enabled Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
* | | 9p: Check for NULL fid in v9fs_dir_release()jvrao2010-09-131-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | NULL fid should be handled in cases where we endup calling v9fs_dir_release() before even we instantiate the fid in filp. Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
* | | fs/9p: Fix error handling in v9fs_get_sbAneesh Kumar K.V2010-09-131-6/+14
| | | | | | | | | | | | | | | | | | | | | This was introduced by 7cadb63d58a932041afa3f957d5cbb6ce69dcee5 Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
* | | fs/9p, net/9p: memory leak fixesLatchesar Ionkov2010-09-131-0/+2
| |/ |/| | | | | | | | | | | Four memory leak fixes in the 9P code. Signed-off-by: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
* | Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfsLinus Torvalds2010-09-102-1/+4
|\ \ | | | | | | | | | | | | | | | * 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: log IO completion workqueue is a high priority queue xfs: prevent reading uninitialized stack memory
| * | xfs: log IO completion workqueue is a high priority queueDave Chinner2010-09-101-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The workqueue implementation in 2.6.36-rcX has changed, resulting in the workqueues no longer having dedicated threads for work processing. This has caused severe livelocks under heavy parallel create workloads because the log IO completions have been getting held up behind metadata IO completions. Hence log commits would stall, memory allocation would stall because pages could not be cleaned, and lock contention on the AIL during inode IO completion processing was being seen to slow everything down even further. By making the log Io completion workqueue a high priority workqueue, they are queued ahead of all data/metadata IO completions and processed before the data/metadata completions. Hence the log never gets stalled, and operations needed to clean memory can continue as quickly as possible. This avoids the livelock conditions and allos the system to keep running under heavy load as per normal. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
| * | xfs: prevent reading uninitialized stack memoryDan Rosenberg2010-09-101-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The XFS_IOC_FSGETXATTR ioctl allows unprivileged users to read 12 bytes of uninitialized stack memory, because the fsxattr struct declared on the stack in xfs_ioc_fsgetxattr() does not alter (or zero) the 12-byte fsx_pad member before copying it back to the user. This patch takes care of it. Signed-off-by: Dan Rosenberg <dan.j.rosenberg@gmail.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
* | | execve: make responsive to SIGKILL with large argumentsRoland McGrath2010-09-101-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | An execve with a very large total of argument/environment strings can take a really long time in the execve system call. It runs uninterruptibly to count and copy all the strings. This change makes it abort the exec quickly if sent a SIGKILL. Note that this is the conservative change, to interrupt only for SIGKILL, by using fatal_signal_pending(). It would be perfectly correct semantics to let any signal interrupt the string-copying in execve, i.e. use signal_pending() instead of fatal_signal_pending(). We'll save that change for later, since it could have user-visible consequences, such as having a timer set too quickly make it so that an execve can never complete, though it always happened to work before. Signed-off-by: Roland McGrath <roland@redhat.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | execve: improve interactivity with large argumentsRoland McGrath2010-09-101-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds a preemption point during the copying of the argument and environment strings for execve, in copy_strings(). There is already a preemption point in the count() loop, so this doesn't add any new points in the abstract sense. When the total argument+environment strings are very large, the time spent copying them can be much more than a normal user time slice. So this change improves the interactivity of the rest of the system when one process is doing an execve with very large arguments. Signed-off-by: Roland McGrath <roland@redhat.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | setup_arg_pages: diagnose excessive argument sizeRoland McGrath2010-09-101-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The CONFIG_STACK_GROWSDOWN variant of setup_arg_pages() does not check the size of the argument/environment area on the stack. When it is unworkably large, shift_arg_pages() hits its BUG_ON. This is exploitable with a very large RLIMIT_STACK limit, to create a crash pretty easily. Check that the initial stack is not too large to make it possible to map in any executable. We're not checking that the actual executable (or intepreter, for binfmt_elf) will fit. So those mappings might clobber part of the initial stack mapping. But that is just userland lossage that userland made happen, not a kernel problem. Signed-off-by: Roland McGrath <roland@redhat.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds2010-09-102-3/+3
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'for-linus' of git://git.kernel.dk/linux-2.6-block: block: Range check cpu in blk_cpu_to_group scatterlist: prevent invalid free when alloc fails writeback: Fix lost wake-up shutting down writeback thread writeback: do not lose wakeup events when forking bdi threads cciss: fix reporting of max queue depth since init block: switch s390 tape_block and mg_disk to elevator_change() block: add function call to switch the IO scheduler from a driver fs/bio-integrity.c: return -ENOMEM on kmalloc failure bio-integrity.c: remove dependency on __GFP_NOFAIL BLOCK: fix bio.bi_rw handling block: put dev->kobj in blk_register_queue fail path cciss: handle allocation failure cfq-iosched: Documentation help for new tunables cfq-iosched: blktrace print per slice sector stats cfq-iosched: Implement tunable group_idle cfq-iosched: Do group share accounting in IOPS when slice_idle=0 cfq-iosched: Do not idle if slice_idle=0 cciss: disable doorbell reset on reset_devices blkio: Fix return code for mkdir calls
| * | | writeback: Fix lost wake-up shutting down writeback threadJ. Bruce Fields2010-08-281-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Setting the task state here may cause us to miss the wake up from kthread_stop(), so we need to recheck kthread_should_stop() or risk sleeping forever in the following schedule(). Symptom was an indefinite hang on an NFSv4 mount. (NFSv4 may create multiple mounts in a temporary namespace while traversing the mount path, and since the temporary namespace is immediately destroyed, it may end up destroying a mount very soon after it was created, possibly making this race more likely.) INFO: task mount.nfs4:4314 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. mount.nfs4 D 0000000000000000 2880 4314 4313 0x00000000 ffff88001ed6da28 0000000000000046 ffff88001ed6dfd8 ffff88001ed6dfd8 ffff88001ed6c000 ffff88001ed6c000 ffff88001ed6c000 ffff88001e5003a0 ffff88001ed6dfd8 ffff88001e5003a8 ffff88001ed6c000 ffff88001ed6dfd8 Call Trace: [<ffffffff8196090d>] schedule_timeout+0x1cd/0x2e0 [<ffffffff8106a31c>] ? mark_held_locks+0x6c/0xa0 [<ffffffff819639a0>] ? _raw_spin_unlock_irq+0x30/0x60 [<ffffffff8106a5fd>] ? trace_hardirqs_on_caller+0x14d/0x190 [<ffffffff819671fe>] ? sub_preempt_count+0xe/0xd0 [<ffffffff8195fc80>] wait_for_common+0x120/0x190 [<ffffffff81033c70>] ? default_wake_function+0x0/0x20 [<ffffffff8195fdcd>] wait_for_completion+0x1d/0x20 [<ffffffff810595fa>] kthread_stop+0x4a/0x150 [<ffffffff81061a60>] ? thaw_process+0x70/0x80 [<ffffffff810cc68a>] bdi_unregister+0x10a/0x1a0 [<ffffffff81229dc9>] nfs_put_super+0x19/0x20 [<ffffffff810ee8c4>] generic_shutdown_super+0x54/0xe0 [<ffffffff810ee9b6>] kill_anon_super+0x16/0x60 [<ffffffff8122d3b9>] nfs4_kill_super+0x39/0x90 [<ffffffff810eda45>] deactivate_locked_super+0x45/0x60 [<ffffffff810edfb9>] deactivate_super+0x49/0x70 [<ffffffff81108294>] mntput_no_expire+0x84/0xe0 [<ffffffff811084ef>] release_mounts+0x9f/0xc0 [<ffffffff81108575>] put_mnt_ns+0x65/0x80 [<ffffffff8122cc56>] nfs_follow_remote_path+0x1e6/0x420 [<ffffffff8122cfbf>] nfs4_try_mount+0x6f/0xd0 [<ffffffff8122d0c2>] nfs4_get_sb+0xa2/0x360 [<ffffffff810edcb8>] vfs_kern_mount+0x88/0x1f0 [<ffffffff810ede92>] do_kern_mount+0x52/0x130 [<ffffffff81963d9a>] ? _lock_kernel+0x6a/0x170 [<ffffffff81108e9e>] do_mount+0x26e/0x7f0 [<ffffffff81106b3a>] ? copy_mount_options+0xea/0x190 [<ffffffff811094b8>] sys_mount+0x98/0xf0 [<ffffffff810024d8>] system_call_fastpath+0x16/0x1b 1 lock held by mount.nfs4/4314: #0: (&type->s_umount_key#24){+.+...}, at: [<ffffffff810edfb1>] deactivate_super+0x41/0x70 Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com> Acked-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>