| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We want to move commiting pages to pages list instead.
Otherwise it causes pnfs small writes crash like:
[34560.037692] BUG: unable to handle kernel NULL pointer dereference at 0000000000000068
[34560.038557] IP: [<ffffffffa05423d6>] nfs_init_commit+0x26/0x130 [nfs]
[34560.039400] PGD 69f5a067 PUD 69f59067 PMD 0
[34560.040207] Oops: 0000 [#1] SMP
[34560.041014] Modules linked in: nfsv3(OE) nfs_layout_flexfiles(OE) nfsv4(OE) nfs(OE) fscache(E) rpcsec_gss_krb5(E) xt_addrtype(E) xt_conntrack(E) ipt_MASQUERADE(E) nf_nat_masquerade_ipv4(E) iptable_nat(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) nf_nat_ipv4(E) iptable_filter(E) ip_tables(E) x_tables(E) nf_nat(E) nf_conntrack(E) bridge(E) stp(E) llc(E) dm_thin_pool(E) dm_persistent_data(E) dm_bio_prison(E) dm_bufio(E) ppdev(E) vmw_balloon(E) coretemp(E) crc32_pclmul(E) ghash_clmulni_intel(E) aesni_intel(E) aes_x86_64(E) glue_helper(E) lrw(E) gf128mul(E) ablk_helper(E) cryptd(E) psmouse(E) serio_raw(E) vmw_vmci(E) i2c_piix4(E) shpchp(E) parport_pc(E) parport(E) mac_hid(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) xfs(E) libcrc32c(E) hid_generic(E) usbhid(E) hid(E) e1000(E) mptspi(E)
[34560.045106] mptscsih(E) mptbase(E) vmwgfx(E) drm_kms_helper(E) ttm(E) drm(E) autofs4(E) [last unloaded: fscache]
[34560.045897] CPU: 0 PID: 130543 Comm: bash Tainted: G OE 4.2.0-rc5-dp-00057-gf993a93 #11
[34560.046699] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/20/2014
[34560.047525] task: ffff880031b0a980 ti: ffff880045fec000 task.ti: ffff880045fec000
[34560.048264] RIP: 0010:[<ffffffffa05423d6>] [<ffffffffa05423d6>] nfs_init_commit+0x26/0x130 [nfs]
[34560.049000] RSP: 0018:ffff880045fefc18 EFLAGS: 00010246
[34560.049717] RAX: 0000000000000000 RBX: ffff8800208fbc80 RCX: ffff880045fefd50
[34560.050396] RDX: ffff880031c19ec0 RSI: ffff880045fefc88 RDI: ffff8800208fbc80
[34560.051041] RBP: ffff880045fefc28 R08: ffff8800208fbe68 R09: ffff880045fefc88
[34560.051666] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880045fefc78
[34560.052247] R13: ffff880045fefc88 R14: ffff880045fefa90 R15: ffff880045fefd50
[34560.052825] FS: 00007fa02d58c740(0000) GS:ffff88006d600000(0000) knlGS:0000000000000000
[34560.053410] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[34560.053992] CR2: 0000000000000068 CR3: 000000003b37a000 CR4: 00000000001406f0
[34560.054615] Stack:
[34560.055200] ffff8800208fbc80 ffff8800208fbc80 ffff880045fefcc8 ffffffffa05c1a5b
[34560.055800] ffff880045fefcc8 ffff880045fefd50 0000000045fefcb8 ffff880045fefd40
[34560.056418] ffff8800420608e0 ffffffffa04f3910 0000000100000001 ffff880045fefd50
[34560.057013] Call Trace:
[34560.057672] [<ffffffffa05c1a5b>] pnfs_generic_commit_pagelist+0x1cb/0x300 [nfsv4]
[34560.058277] [<ffffffffa04f3910>] ? ff_layout_commit_pagelist+0x20/0x20 [nfs_layout_flexfiles]
[34560.058907] [<ffffffffa04f3905>] ff_layout_commit_pagelist+0x15/0x20 [nfs_layout_flexfiles]
[34560.059557] [<ffffffffa0543fc1>] nfs_generic_commit_list+0xb1/0xf0 [nfs]
[34560.060214] [<ffffffffa0543e47>] ? nfs_scan_commit+0x37/0xa0 [nfs]
[34560.060825] [<ffffffffa0544081>] nfs_commit_inode+0x81/0x150 [nfs]
[34560.061432] [<ffffffffa05443ae>] nfs_wb_all+0x1ae/0x400 [nfs]
[34560.062035] [<ffffffffa05380ad>] nfs_getattr+0x33d/0x510 [nfs]
[34560.062630] [<ffffffff8122499c>] vfs_getattr_nosec+0x2c/0x40
[34560.063223] [<ffffffff81224a66>] vfs_getattr+0x26/0x30
[34560.063818] [<ffffffff81224b35>] vfs_fstatat+0x65/0xa0
[34560.064413] [<ffffffff81224f3f>] SYSC_newstat+0x1f/0x40
[34560.065016] [<ffffffff8102b176>] ? do_audit_syscall_entry+0x66/0x70
[34560.065626] [<ffffffff8102c773>] ? syscall_trace_enter_phase1+0x113/0x170
[34560.066245] [<ffffffff81003017>] ? trace_hardirqs_on_thunk+0x17/0x19
[34560.066868] [<ffffffff812251ae>] SyS_newstat+0xe/0x10
[34560.067533] [<ffffffff817a5df2>] entry_SYSCALL_64_fastpath+0x16/0x7a
[34560.068173] Code: 0f 1f 44 00 00 0f 1f 44 00 00 55 4c 8d 87 e8 01 00 00 48 89 e5 53 48 89 fb 48 83 ec 08 4c 8b 0e 49 8b 41 18 4c 39 ce 48 8b 40 40 <4c> 8b 50 68 74 24 48 8b 87 e8 01 00 00 48 8b 7e 08 4d 89 41 08
[34560.069609] RIP [<ffffffffa05423d6>] nfs_init_commit+0x26/0x130 [nfs]
[34560.070295] RSP <ffff880045fefc18>
[34560.071008] CR2: 0000000000000068
[34560.073207] ---[ end trace f85f873260977406 ]---
[fixes 27571297a7e(pNFS: Tighten up locking around DS commit buckets)]
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
|
|
|
|
|
|
|
|
|
| |
Unlike the previous attempt, this takes into account the fact that
we may be calling it from the recovery thread itself. Detect this
by looking at what kind of open we're doing, and checking the state
of the NFS_DELEGATION_NEED_RECLAIM if it turns out we're doing a
reboot reclaim-type open.
Cc: Olga Kornievskaia <aglo@umich.edu>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
|
|
| |
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
|
|
|
|
|
|
|
|
|
| |
Otherwise we break fstest case tests/read_write/mctime.t
Does files layout need the same fix as well?
Cc: stable@vger.kernel.org # v4.0+
Cc: Anna Schumaker <anna.schumaker@netapp.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
|
|
|
|
|
|
|
| |
If layout is marked by NFS_LAYOUT_RETURN_BEFORE_CLOSE, we should always
send LAYOUTRETURN before close, and we don't need to do ROC drain if we
do send LAYOUTRETURN.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit 4e379d36c050b0117b5d10048be63a44f5036115.
This commit opens up a race between the recovery code and the open code.
Reported-by: Olga Kornievskaia <aglo@umich.edu>
Cc: stable@vger.kernel # v4.0+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
|
|
|
|
|
|
| |
If we have an OPEN_DOWNGRADE and CLOSE race with one another, we want
to ensure that the layout is forgotten by the client, so that we
start afresh with a new layoutget.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The helper pnfs_roc() has already verified that we have no delegations,
and no further open files, hence no outstanding I/O and it has marked
all the return-on-close lsegs as being invalid.
Furthermore, it sets the NFS_LAYOUT_RETURN bit, thus serialising the
close/delegreturn with all future layoutget calls on this inode.
The checks in pnfs_roc_drain() for valid layout segments are therefore
redundant: those cannot exist until another layoutget completes.
The other check for whether or not NFS_LAYOUT_RETURN is set, actually
causes a hang, since we already know that we hold that flag.
To fix, we therefore strip out all the functionality in pnfs_roc_drain()
except the retrieval of the barrier state, and then rename the function
accordingly.
Reported-by: Christoph Hellwig <hch@infradead.org>
Fixes: 5c4a79fb2b1c ("Don't prevent layoutgets when doing return-on-close")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
|
|
|
|
|
| |
generic_file_write_iter() will already do an fsync on our behalf
if the file descriptor is O_SYNC or the file is marked as IS_SYNC.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Chuck reports seeing cases where a GETATTR that happens to race
with an asynchronous WRITE is overriding the file size, despite
the attribute barrier being set by the writeback code.
The culprit turns out to be the check in nfs_ctime_need_update(),
which sees that the ctime is newer than the cached ctime, and
assumes that it is safe to override the attribute barrier.
This patch removes that override, and ensures that attribute
barriers are always respected.
Reported-by: Chuck Lever <chuck.lever@oracle.com>
Fixes: a08a8cd375db9 ("NFS: Add attribute update barriers to NFS writebacks")
Cc: stable@vger.kernel.org # v4.0+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
* layoutfixes:
NFSv4.1/pnfs: Remove redundant wakeup in pnfs_send_layoutreturn()
NFSv4.1/pnfs: Remove redundant check in pnfs_layoutgets_blocked()
NFSv4.1/pnfs: Remove redundant lo->plh_block_lgets in layoutreturn
NFSv4.1/pnfs: Don't prevent layoutgets when doing return-on-close
NFSv4.1/pnfs: Fix serialisation of layout return and layoutget
NFSv4.1/pnfs: Remove redundant checks in pnfs_layoutgets_blocked()
pNFS: Tighten up locking around DS commit buckets
|
| |
| |
| |
| |
| |
| |
| | |
pnfs_clear_layoutreturn_waitbit() should already be calling
rpc_wake_up(&NFS_SERVER(ino)->roc_rpcwaitq) for us.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
| |
| |
| |
| |
| |
| | |
layoutget now should already be serialised w.r.t. layout returns
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
| |
| |
| |
| |
| |
| |
| | |
The NFS_LAYOUT_RETURN bit already suffices to ensure that layoutget
is blocked.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
| |
| |
| |
| |
| |
| |
| | |
If there is an outstanding return-on-close, then we just want new
layoutget requests to wait rather than fail.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
| |
| |
| |
| |
| |
| |
| | |
We should always test for outstanding layout returns, whether or not
pnfs_should_retry_layoutget() is true.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
| |
| |
| |
| |
| |
| |
| |
| | |
If there are no valid layout segments, then we should already have
checked in pnfs_update_layout() whether or not this is the first
layoutget.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
I'm not aware of any bugreports around this issue, but the locking
around the pnfs_commit_bucket is inconsistent at best. This patch
tightens it up by ensuring that the 'bucket->committing' list is always
changed atomically w.r.t. the 'bucket->clseg' layout segment tracking.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | | |
* bugfixes:
SUNRPC: Fix a thinko in xs_connect()
NFSv4.1/pNFS: Fix borken function _same_data_server_addrs_locked()
NFS: nfs_set_pgio_error sometimes misses errors
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
- Switch back to using list_for_each_entry(). Fixes an incorrect test
for list NULL termination.
- Do not assume that lists are sorted.
- Finally, consider an existing entry to match if it consists of a subset
of the addresses in the new entry.
Cc: stable@vger.kernel.org # 4.0+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
We should ensure that we always set the pgio_header's error field
if a READ or WRITE RPC call returns an error. The current code depends
on 'hdr->good_bytes' always being initialised to a large value, which
is not always done correctly by callers.
When this happens, applications may end up missing important errors.
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|\ \ \
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
NFS: NFS over RDMA Client Side Changes
These patches improve both client performance and scalability, most notably
by increasing the maixmum allowed rsize and wsize and by increasing the number
of RDMA "credits". There are also several bugfixes, such as correcting how
WRITE compounds are encoded and fixing large NFS symlink operations.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Repair how rpcrdma_marshal_req() chooses which RDMA message type
to use for large non-WRITE operations so that it picks RDMA_NOMSG
in the correct situations, and sets up the marshaling logic to
SEND only the RPC/RDMA header.
Large NFSv2 SYMLINK requests now use RDMA_NOMSG calls. The Linux NFS
server XDR decoder for NFSv2 SYMLINK does not handle having the
pathname argument arrive in a separate buffer. The decoder could be
fixed, but this is simpler and RDMA_NOMSG can be used in a variety
of other situations.
Ensure that the Linux client continues to use "RDMA_MSG + read
list" when sending large NFSv3 SYMLINK requests, which is more
efficient than using RDMA_NOMSG.
Large NFSv4 CREATE(NF4LNK) requests are changed to use "RDMA_MSG +
read list" just like NFSv3 (see Section 5 of RFC 5667). Before,
these did not work at all.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
And call nfs_file_clear_open_context() directly. This makes it obvious
that nfs_file_release() will always return 0.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
All nfs_write_inode() does is pass its arguments to
nfs_commit_unstable_pages(). Let's cut out the middle man and have
nfs_write_pages() do the work directly.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
All these functions do is call nfs41_ping_server() without adding
anything. Let's remove them and give nfs41_ping_server() a better name
instead.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
The idmap_init() and idmap_quit() functions only exist to call the
_keyring() version. Let's just call the keyring() functions directly.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
They already exist and do the exact same thing. Let's save ourselves
several lines of code!
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
nfs_readdir_xdr_to_array() uses both a cache array and an array of
pages, so I rename these functions to make it clearer how the code
works. nfs_readdir_large_page() becomes nfs_readdir_alloc_pages()
because this function has absolutely nothing to do with setting up a
large page. nfs_readdir_free_pagearray() becomes
nfs_readdir_free_pages() to stay consistent with the new alloc_pages()
function.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
This variable is initialized to NULL and is never modified before being
passed to nfs_readdir_free_large_page(). But that's okay, because
nfs_readdir_free_large_page() only seems to exist as a way of calling
nfs_readdir_free_pagearray() without this parameter. Let's simplify by
removing pages_ptr and nfs_readdir_free_pagearray().
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
We already know that pg_lseg is NULL here.
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
| | | |
| | | |
| | | |
| | | |
| | | | |
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
We generally want to read and write to a block device that's used by
the pNFS block layout client (and even if it's read only the server
has no way of telling us). Add FMODE_WRITE to the mode argument
so that we don't incorrectly tell the block driver that we want a
read-only open.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Instead of overwriting kernel memory reject too long signatures.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
We need to replace the __be32 with a void pointer to do proper arithmentics
on the virtual addresses so that we can get the right page pointers.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
We need to include the first u32 for the number of entries. Add a helper
for the calculation instead of opencoding it so that it's in one place.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
| |_|/
|/| |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
---Steps to Reproduce--
<nfs-server>
# cat /etc/exports
/nfs/referal *(rw,insecure,no_subtree_check,no_root_squash,crossmnt)
/nfs/old *(ro,insecure,subtree_check,root_squash,crossmnt)
<nfs-client>
# mount -t nfs nfs-server:/nfs/ /mnt/
# ll /mnt/*/
<nfs-server>
# cat /etc/exports
/nfs/referal *(rw,insecure,no_subtree_check,no_root_squash,crossmnt,refer=/nfs/old/@nfs-server)
/nfs/old *(ro,insecure,subtree_check,root_squash,crossmnt)
# service nfs restart
<nfs-client>
# ll /mnt/*/ --->>>>> oops here
[ 5123.102925] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 5123.103363] IP: [<ffffffffa03ed38b>] nfs4_proc_get_locations+0x9b/0x120 [nfsv4]
[ 5123.103752] PGD 587b9067 PUD 3cbf5067 PMD 0
[ 5123.104131] Oops: 0000 [#1]
[ 5123.104529] Modules linked in: nfsv4(OE) nfs(OE) fscache(E) nfsd(OE) xfs libcrc32c iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi coretemp crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel ppdev vmw_balloon parport_pc parport i2c_piix4 shpchp auth_rpcgss nfs_acl vmw_vmci lockd grace sunrpc vmwgfx drm_kms_helper ttm drm mptspi serio_raw scsi_transport_spi e1000 mptscsih mptbase ata_generic pata_acpi [last unloaded: nfsd]
[ 5123.105887] CPU: 0 PID: 15853 Comm: ::1-manager Tainted: G OE 4.2.0-rc6+ #214
[ 5123.106358] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/20/2014
[ 5123.106860] task: ffff88007620f300 ti: ffff88005877c000 task.ti: ffff88005877c000
[ 5123.107363] RIP: 0010:[<ffffffffa03ed38b>] [<ffffffffa03ed38b>] nfs4_proc_get_locations+0x9b/0x120 [nfsv4]
[ 5123.107909] RSP: 0018:ffff88005877fdb8 EFLAGS: 00010246
[ 5123.108435] RAX: ffff880053f3bc00 RBX: ffff88006ce6c908 RCX: ffff880053a0d240
[ 5123.108968] RDX: ffffea0000e6d940 RSI: ffff8800399a0000 RDI: ffff88006ce6c908
[ 5123.109503] RBP: ffff88005877fe28 R08: ffffffff81c708a0 R09: 0000000000000000
[ 5123.110045] R10: 00000000000001a2 R11: ffff88003ba7f5c8 R12: ffff880054c55800
[ 5123.110618] R13: 0000000000000000 R14: ffff880053a0d240 R15: ffff880053a0d240
[ 5123.111169] FS: 0000000000000000(0000) GS:ffffffff81c27000(0000) knlGS:0000000000000000
[ 5123.111726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5123.112286] CR2: 0000000000000000 CR3: 0000000054cac000 CR4: 00000000001406f0
[ 5123.112888] Stack:
[ 5123.113458] ffffea0000e6d940 ffff8800399a0000 00000000000167d0 0000000000000000
[ 5123.114049] 0000000000000000 0000000000000000 0000000000000000 00000000a7ec82c6
[ 5123.114662] ffff88005877fe18 ffffea0000e6d940 ffff8800399a0000 ffff880054c55800
[ 5123.115264] Call Trace:
[ 5123.115868] [<ffffffffa03fb44b>] nfs4_try_migration+0xbb/0x220 [nfsv4]
[ 5123.116487] [<ffffffffa03fcb3b>] nfs4_run_state_manager+0x4ab/0x7b0 [nfsv4]
[ 5123.117104] [<ffffffffa03fc690>] ? nfs4_do_reclaim+0x510/0x510 [nfsv4]
[ 5123.117813] [<ffffffff810a4527>] kthread+0xd7/0xf0
[ 5123.118456] [<ffffffff810a4450>] ? kthread_worker_fn+0x160/0x160
[ 5123.119108] [<ffffffff816d9cdf>] ret_from_fork+0x3f/0x70
[ 5123.119723] [<ffffffff810a4450>] ? kthread_worker_fn+0x160/0x160
[ 5123.120329] Code: 4c 8b 6a 58 74 17 eb 52 48 8d 55 a8 89 c6 4c 89 e7 e8 4a b5 ff ff 8b 45 b0 85 c0 74 1c 4c 89 f9 48 8b 55 90 48 8b 75 98 48 89 df <41> ff 55 00 3d e8 d8 ff ff 41 89 c6 74 cf 48 8b 4d c8 65 48 33
[ 5123.121643] RIP [<ffffffffa03ed38b>] nfs4_proc_get_locations+0x9b/0x120 [nfsv4]
[ 5123.122308] RSP <ffff88005877fdb8>
[ 5123.122942] CR2: 0000000000000000
Fixes: ec011fe847 ("NFS: Introduce a vector of migration recovery ops")
Cc: stable@vger.kernel.org # v3.13+
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The xprt created by svc_create_xprt have be added to serv->sv_permsocks.
So putting the xprt directly is useless.
Otherwise, there is a more svc_xprt_put after the xprt be freed.
v2, same as v1.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
| |/
|/|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Commit 1d3d4437ea "vmscan: per-node deferred work" have made
register_shrinker can return an intergater error.
If register_shrinker() fail, the later unregister_shrinker() will
cause a NULL pointer access.
v2, same as v1.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
It is unusual to combine the open flags O_RDONLY and O_EXCL, but
it appears that libre-office does just that.
[pid 3250] stat("/home/USER/.config", {st_mode=S_IFDIR|0700, st_size=8192, ...}) = 0
[pid 3250] open("/home/USER/.config/libreoffice/4-suse/user/extensions/buildid", O_RDONLY|O_EXCL <unfinished ...>
NFSv4 takes O_EXCL as a sign that a setattr command should be sent,
probably to reset the timestamps.
When it was an O_RDONLY open, the SETATTR command does not
identify any actual attributes to change.
If no delegation was provided to the open, the SETATTR uses the
all-zeros stateid and the request is accepted (at least by the
Linux NFS server - no harm, no foul).
If a read-delegation was provided, this is used in the SETATTR
request, and a Netapp filer will justifiably claim
NFS4ERR_BAD_STATEID, which the Linux client takes as a sign
to retry - indefinitely.
So only treat O_EXCL specially if O_CREAT was also given.
Signed-off-by: NeilBrown <neilb@suse.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
| |
| |
| |
| |
| |
| | |
Prevent a potential deadlock.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Turned out I misinterpreted the spec...
Cc: Tom Haynes <thomas.haynes@primarydata.com>
Reported-by: Jean Spector <jean@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|/
|
|
|
|
|
|
|
|
|
| |
pnfs_layout_mark_request_commit() needs to ensure that it adds the
request to the commit list atomically with all the other updates
in order to prevent corruption to buckets[ds_commit_idx].wlseg
due to races with pnfs_generic_clear_request_commit().
Fixes: 338d00cfef07d ("pnfs: Refactor the *_layout_mark_request_commit...")
Cc: stable@vger.kernel.org # v4.0+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull VFS fix from Al Viro:
"Spurious ENOTDIR fix"
This should fix the problems reported by Dominique Martinet and Hugh
Dickins.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
link_path_walk(): be careful when failing with ENOTDIR
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
In RCU mode we might end up with dentry evicted just we check
that it's a directory. In such case we should return ECHILD
rather than ENOTDIR, so that pathwalk would be retries in non-RCU
mode.
Breakage had been introduced in commit b18825a - prior to that
we were looking at nd->inode, which had been fetched before
verifying that ->d_seq was still valid. That form of check
would only be satisfied if at some point the pathname prefix
would indeed have resolved to a non-directory. The fix consists
of checking ->d_seq after we'd run into a non-directory dentry,
and failing with ECHILD in case of mismatch.
Note that all branches since 3.12 have that problem...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
"Filipe fixed up a hard to trigger ENOSPC regression from our merge
window pull, and we have a few other smaller fixes"
* 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix quick exhaustion of the system array in the superblock
btrfs: its btrfs_err() instead of btrfs_error()
btrfs: Avoid NULL pointer dereference of free_extent_buffer when read_tree_block() fail
btrfs: Fix lockdep warning of btrfs_run_delayed_iputs()
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Omar reported that after commit 4fbcdf669454 ("Btrfs: fix -ENOSPC when
finishing block group creation"), introduced in 4.2-rc1, the following
test was failing due to exhaustion of the system array in the superblock:
#!/bin/bash
truncate -s 100T big.img
mkfs.btrfs big.img
mount -o loop big.img /mnt/loop
num=5
sz=10T
for ((i = 0; i < $num; i++)); do
echo fallocate $i $sz
fallocate -l $sz /mnt/loop/testfile$i
done
btrfs filesystem sync /mnt/loop
for ((i = 0; i < $num; i++)); do
echo rm $i
rm /mnt/loop/testfile$i
btrfs filesystem sync /mnt/loop
done
umount /mnt/loop
This made btrfs_add_system_chunk() fail with -EFBIG due to excessive
allocation of system block groups. This happened because the test creates
a large number of data block groups per transaction and when committing
the transaction we start the writeout of the block group caches for all
the new new (dirty) block groups, which results in pre-allocating space
for each block group's free space cache using the same transaction handle.
That in turn often leads to creation of more block groups, and all get
attached to the new_bgs list of the same transaction handle to the point
of getting a list with over 1500 elements, and creation of new block groups
leads to the need of reserving space in the chunk block reserve and often
creating a new system block group too.
So that made us quickly exhaust the chunk block reserve/system space info,
because as of the commit mentioned before, we do reserve space for each
new block group in the chunk block reserve, unlike before where we would
not and would at most allocate one new system block group and therefore
would only ensure that there was enough space in the system space info to
allocate 1 new block group even if we ended up allocating thousands of
new block groups using the same transaction handle. That worked most of
the time because the computed required space at check_system_chunk() is
very pessimistic (assumes a chunk tree height of BTRFS_MAX_LEVEL/8 and
that all nodes/leafs in a path will be COWed and split) and since the
updates to the chunk tree all happen at btrfs_create_pending_block_groups
it is unlikely that a path needs to be COWed more than once (unless
writepages() for the btree inode is called by mm in between) and that
compensated for the need of creating any new nodes/leads in the chunk
tree.
So fix this by ensuring we don't accumulate a too large list of new block
groups in a transaction's handles new_bgs list, inserting/updating the
chunk tree for all accumulated new block groups and releasing the unused
space from the chunk block reserve whenever the list becomes sufficiently
large. This is a generic solution even though the problem currently can
only happen when starting the writeout of the free space caches for all
dirty block groups (btrfs_start_dirty_block_groups()).
Reported-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
sorry I indented to use btrfs_err() and I have no idea
how btrfs_error() got there.
infact I was thinking about these kind of oversights
since these two func are too closely named.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
read_tree_block() fail
When read_tree_block() failed, we can see following dmesg:
[ 134.371389] BUG: unable to handle kernel NULL pointer dereference at 0000000000000063
[ 134.372236] IP: [<ffffffff813a4a51>] free_extent_buffer+0x21/0x90
[ 134.372236] PGD 0
[ 134.372236] Oops: 0000 [#1] SMP
[ 134.372236] Modules linked in:
[ 134.372236] CPU: 0 PID: 2289 Comm: mount Not tainted 4.2.0-rc1_HEAD_c65b99f046843d2455aa231747b5a07a999a9f3d_+ #115
[ 134.372236] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org 04/01/2014
[ 134.372236] task: ffff88003b6e1a00 ti: ffff880011e60000 task.ti: ffff880011e60000
[ 134.372236] RIP: 0010:[<ffffffff813a4a51>] [<ffffffff813a4a51>] free_extent_buffer+0x21/0x90
...
[ 134.372236] Call Trace:
[ 134.372236] [<ffffffff81379aa1>] free_root_extent_buffers+0x91/0xb0
[ 134.372236] [<ffffffff81379c3d>] free_root_pointers+0x17d/0x190
[ 134.372236] [<ffffffff813801b0>] open_ctree+0x1ca0/0x25b0
[ 134.372236] [<ffffffff8144d017>] ? disk_name+0x97/0xb0
[ 134.372236] [<ffffffff813558aa>] btrfs_mount+0x8fa/0xab0
...
Reason:
read_tree_block() changed to return error number on fail,
and this value(not NULL) is set to tree_root->node, then subsequent
code will run to:
free_root_pointers()
->free_root_extent_buffers()
->free_extent_buffer()
->atomic_read((extent_buffer *)(-E_XXX)->refs);
and trigger above error.
Fix:
Set tree_root->node to NULL on fail to make error_handle code
happy.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Liu Bo <bo.li.liu@oracle.com> reported a lockdep warning of
delayed_iput_sem in xfstests generic/241:
[ 2061.345955] =============================================
[ 2061.346027] [ INFO: possible recursive locking detected ]
[ 2061.346027] 4.1.0+ #268 Tainted: G W
[ 2061.346027] ---------------------------------------------
[ 2061.346027] btrfs-cleaner/3045 is trying to acquire lock:
[ 2061.346027] (&fs_info->delayed_iput_sem){++++..}, at:
[<ffffffff814063ab>] btrfs_run_delayed_iputs+0x6b/0x100
[ 2061.346027] but task is already holding lock:
[ 2061.346027] (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffff814063ab>] btrfs_run_delayed_iputs+0x6b/0x100
[ 2061.346027] other info that might help us debug this:
[ 2061.346027] Possible unsafe locking scenario:
[ 2061.346027] CPU0
[ 2061.346027] ----
[ 2061.346027] lock(&fs_info->delayed_iput_sem);
[ 2061.346027] lock(&fs_info->delayed_iput_sem);
[ 2061.346027]
*** DEADLOCK ***
It is rarely happened, about 1/400 in my test env.
The reason is recursion of btrfs_run_delayed_iputs():
cleaner_kthread
-> btrfs_run_delayed_iputs() *1
-> get delayed_iput_sem lock *2
-> iput()
-> ...
-> btrfs_commit_transaction()
-> btrfs_run_delayed_iputs() *1
-> get delayed_iput_sem lock (dead lock) *2
*1: recursion of btrfs_run_delayed_iputs()
*2: warning of lockdep about delayed_iput_sem
When fs is in high stress, new iputs may added into fs_info->delayed_iputs
list when btrfs_run_delayed_iputs() is running, which cause
second btrfs_run_delayed_iputs() run into down_read(&fs_info->delayed_iput_sem)
again, and cause above lockdep warning.
Actually, it will not cause real problem because both locks are read lock,
but to avoid lockdep warning, we can do a fix.
Fix:
Don't do btrfs_run_delayed_iputs() in btrfs_commit_transaction() for
cleaner_kthread thread to break above recursion path.
cleaner_kthread is calling btrfs_run_delayed_iputs() explicitly in code,
and don't need to call btrfs_run_delayed_iputs() again in
btrfs_commit_transaction(), it also give us a bonus to avoid stack overflow.
Test:
No above lockdep warning after patch in 1200 generic/241 tests.
Reported-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
|