| Commit message (Collapse) | Author | Age | Files | Lines |
|\
| |
| |
| | |
into for-linus2
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
commit fa1aa143ac4a ("selinux: extended permissions for ioctls")
introduced a bug into the handling of conditional rules, skipping the
processing entirely when the caller does not provide an extended
permissions (xperms) structure. Access checks from userspace using
/sys/fs/selinux/access do not include such a structure since that
interface does not presently expose extended permission information.
As a result, conditional rules were being ignored entirely on userspace
access requests, producing denials when access was allowed by
conditional rules in the policy. Fix the bug by only skipping
computation of extended permissions in this situation, not the entire
conditional rules processing.
Reported-by: Laurent Bigonville <bigon@debian.org>
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
[PM: fixed long lines in patch description]
Cc: stable@vger.kernel.org # 4.3
Signed-off-by: Paul Moore <pmoore@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
If a user key gets negatively instantiated, an error code is cached in the
payload area. A negatively instantiated key may be then be positively
instantiated by updating it with valid data. However, the ->update key
type method must be aware that the error code may be there.
The following may be used to trigger the bug in the user key type:
keyctl request2 user user "" @u
keyctl add user user "a" @u
which manifests itself as:
BUG: unable to handle kernel paging request at 00000000ffffff8a
IP: [<ffffffff810a376f>] __call_rcu.constprop.76+0x1f/0x280 kernel/rcu/tree.c:3046
PGD 7cc30067 PUD 0
Oops: 0002 [#1] SMP
Modules linked in:
CPU: 3 PID: 2644 Comm: a.out Not tainted 4.3.0+ #49
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff88003ddea700 ti: ffff88003dd88000 task.ti: ffff88003dd88000
RIP: 0010:[<ffffffff810a376f>] [<ffffffff810a376f>] __call_rcu.constprop.76+0x1f/0x280
[<ffffffff810a376f>] __call_rcu.constprop.76+0x1f/0x280 kernel/rcu/tree.c:3046
RSP: 0018:ffff88003dd8bdb0 EFLAGS: 00010246
RAX: 00000000ffffff82 RBX: 0000000000000000 RCX: 0000000000000001
RDX: ffffffff81e3fe40 RSI: 0000000000000000 RDI: 00000000ffffff82
RBP: ffff88003dd8bde0 R08: ffff88007d2d2da0 R09: 0000000000000000
R10: 0000000000000000 R11: ffff88003e8073c0 R12: 00000000ffffff82
R13: ffff88003dd8be68 R14: ffff88007d027600 R15: ffff88003ddea700
FS: 0000000000b92880(0063) GS:ffff88007fd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000ffffff8a CR3: 000000007cc5f000 CR4: 00000000000006e0
Stack:
ffff88003dd8bdf0 ffffffff81160a8a 0000000000000000 00000000ffffff82
ffff88003dd8be68 ffff88007d027600 ffff88003dd8bdf0 ffffffff810a39e5
ffff88003dd8be20 ffffffff812a31ab ffff88007d027600 ffff88007d027620
Call Trace:
[<ffffffff810a39e5>] kfree_call_rcu+0x15/0x20 kernel/rcu/tree.c:3136
[<ffffffff812a31ab>] user_update+0x8b/0xb0 security/keys/user_defined.c:129
[< inline >] __key_update security/keys/key.c:730
[<ffffffff8129e5c1>] key_create_or_update+0x291/0x440 security/keys/key.c:908
[< inline >] SYSC_add_key security/keys/keyctl.c:125
[<ffffffff8129fc21>] SyS_add_key+0x101/0x1e0 security/keys/keyctl.c:60
[<ffffffff8185f617>] entry_SYSCALL_64_fastpath+0x12/0x6a arch/x86/entry/entry_64.S:185
Note the error code (-ENOKEY) in EDX.
A similar bug can be tripped by:
keyctl request2 trusted user "" @u
keyctl add trusted user "a" @u
This should also affect encrypted keys - but that has to be correctly
parameterised or it will fail with EINVAL before getting to the bit that
will crashes.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Signed-off-by: James Morris <james.l.morris@oracle.com>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
"Two fixes for 4.4-rc1's DM ioctl changes that introduced the potential
for infinite recursion on ioctl (with DM multipath).
And four stable fixes:
- A DM thin-provisioning fix to restore 'error_if_no_space' setting
when a thin-pool is made writable again (after having been out of
space).
- A DM thin-provisioning fix to properly advertise discard support
for thin volumes that are stacked on a thin-pool whose underlying
data device doesn't support discards.
- A DM ioctl fix to allow ctrl-c to break out of an ioctl retry loop
when DM multipath is configured to 'queue_if_no_path'.
- A DM crypt fix for a possible hang on dm-crypt device removal"
* tag 'dm-4.4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm thin: fix regression in advertised discard limits
dm crypt: fix a possible hang due to race condition on exit
dm mpath: fix infinite recursion in ioctl when no paths and !queue_if_no_path
dm: do not reuse dm_blk_ioctl block_device input as local variable
dm: fix ioctl retry termination with signal
dm thin: restore requested 'error_if_no_space' setting on OODS to WRITE transition
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
When establishing a thin device's discard limits we cannot rely on the
underlying thin-pool device's discard capabilities (which are inherited
from the thin-pool's underlying data device) given that DM thin devices
must provide discard support even when the thin-pool's underlying data
device doesn't support discards.
Users were exposed to this thin device discard limits regression if
their thin-pool's underlying data device does _not_ support discards.
This regression caused all upper-layers that called the
blkdev_issue_discard() interface to not be able to issue discards to
thin devices (because discard_granularity was 0). This regression
wasn't caught earlier because the device-mapper-test-suite's extensive
'thin-provisioning' discard tests are only ever performed against
thin-pool's with data devices that support discards.
Fix is to have thin_io_hints() test the pool's 'discard_enabled' feature
rather than inferring whether or not a thin device's discard support
should be enabled by looking at the thin-pool's discard_granularity.
Fixes: 216076705 ("dm thin: disable discard support for thin devices if pool's is disabled")
Reported-by: Mike Gerber <mike@sprachgewalt.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 4.1+
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
A kernel thread executes __set_current_state(TASK_INTERRUPTIBLE),
__add_wait_queue, spin_unlock_irq and then tests kthread_should_stop().
It is possible that the processor reorders memory accesses so that
kthread_should_stop() is executed before __set_current_state(). If such
reordering happens, there is a possible race on thread termination:
CPU 0:
calls kthread_should_stop()
it tests KTHREAD_SHOULD_STOP bit, returns false
CPU 1:
calls kthread_stop(cc->write_thread)
sets the KTHREAD_SHOULD_STOP bit
calls wake_up_process on the kernel thread, that sets the thread
state to TASK_RUNNING
CPU 0:
sets __set_current_state(TASK_INTERRUPTIBLE)
spin_unlock_irq(&cc->write_thread_wait.lock)
schedule() - and the process is stuck and never terminates, because the
state is TASK_INTERRUPTIBLE and wake_up_process on CPU 1 already
terminated
Fix this race condition by using a new flag DM_CRYPT_EXIT_THREAD to
signal that the kernel thread should exit. The flag is set and tested
while holding cc->write_thread_wait.lock, so there is no possibility of
racy access to the flag.
Also, remove the unnecessary set_task_state(current, TASK_RUNNING)
following the schedule() call. When the process was woken up, its state
was already set to TASK_RUNNING. Other kernel code also doesn't set the
state to TASK_RUNNING following schedule() (for example,
do_wait_for_common in completion.c doesn't do it).
Fixes: dc2676210c42 ("dm crypt: offload writes to thread")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org # v4.0+
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
In multipath_prepare_ioctl(),
- pgpath is a path selected from available paths
- m->queue_io is true if we cannot send a request immediately to
paths, either because:
* there is no available path
* the path group needs activation (pg_init)
- pg_init is not started
- pg_init is still running
- m->queue_if_no_path is true if the device is configured to queue
I/O if there are no available paths
If !pgpath && !m->queue_if_no_path, the handler should return -EIO.
However in the course of refactoring the condition check has broken
and returns success in that case. Since bdev points to the dm device
itself, dm_blk_ioctl() calls __blk_dev_driver_ioctl() for itself and
recurses until crash.
You could reproduce the problem like this:
# dmsetup create mp --table '0 1024 multipath 0 0 0 0'
# sg_inq /dev/mapper/mp
<crash>
[ 172.648615] BUG: unable to handle kernel paging request at fffffffc81b10268
[ 172.662843] PGD 19dd067 PUD 0
[ 172.666269] Thread overran stack, or stack corrupted
[ 172.671808] Oops: 0000 [#1] SMP
...
Fix the condition check with some clarifications.
Fixes: e56f81e0b01e ("dm: refactor ioctl handling")
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
(Ab)using the @bdev passed to dm_blk_ioctl() opens the potential for
targets' .prepare_ioctl to fail if they go on to check the bdev for
!NULL.
Fixes: e56f81e0b01e ("dm: refactor ioctl handling")
Reported-by: Junichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
dm-mpath retries ioctl, when no path is readily available and the device
is configured to queue I/O in such a case. If you want to stop the retry
before multipathd decides to turn off queueing mode, you could send
signal for the process to exit from the loop.
However the check of fatal signal has not carried along when commit
6c182cd88d17 ("dm mpath: fix ioctl deadlock when no paths") moved the
loop from dm-mpath to dm core. As a result, we can't terminate such
a process in the retry loop.
Easy reproducer of the situation is:
# dmsetup create mp --table '0 1024 multipath 0 0 0 0'
# dmsetup message mp 0 'queue_if_no_path'
# sg_inq /dev/mapper/mp
then you should be able to terminate sg_inq by pressing Ctrl+C.
Fixes: 6c182cd88d17 ("dm mpath: fix ioctl deadlock when no paths")
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
transition
A thin-pool that is in out-of-data-space (OODS) mode may transition back
to write mode -- without the admin adding more space to the thin-pool --
if/when blocks are released (either by deleting thin devices or
discarding provisioned blocks).
But as part of the thin-pool's earlier transition to out-of-data-space
mode the thin-pool may have set the 'error_if_no_space' flag to true if
the no_space_timeout expires without more space having been made
available. That implementation detail, of changing the pool's
error_if_no_space setting, needs to be reset back to the default that
the user specified when the thin-pool's table was loaded.
Otherwise we'll drop the user requested behaviour on the floor when this
out-of-data-space to write mode transition occurs.
Reported-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Fixes: 2c43fd26e4 ("dm thin: fix missing out-of-data-space to write mode transition if blocks are released")
Cc: stable@vger.kernel.org
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
I got a crash during a "perf top" session that was caused by a race in
__task_pid_nr_ns() :
pid_nr_ns() was inlined, but apparently compiler chose to read
task->pids[type].pid twice, and the pid->level dereference crashed
because we got a NULL pointer at the second read :
if (pid && ns->level <= pid->level) { // CRASH
Just use RCU API properly to solve this race, and not worry about "perf
top" crashing hosts :(
get_task_pid() can benefit from same fix.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|\ \ \
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Pull block layer fixes from Jens Axboe:
"A round of fixes/updates for the current series.
This looks a little bigger than it is, but that's mainly because we
pushed the lightnvm enabled null_blk change out of the merge window so
it could be updated a bit. The rest of the volume is also mostly
lightnvm. In particular:
- Lightnvm. Various fixes, additions, updates from Matias and
Javier, as well as from Wenwei Tao.
- NVMe:
- Fix for potential arithmetic overflow from Keith.
- Also from Keith, ensure that we reap pending completions from
a completion queue before deleting it. Fixes kernel crashes
when resetting a device with IO pending.
- Various little lightnvm related tweaks from Matias.
- Fixup flushes to go through the IO scheduler, for the cases where a
flush is not required. Fixes a case in CFQ where we would be
idling and not see this request, hence not break the idling. From
Jan Kara.
- Use list_{first,prev,next} in elevator.c for cleaner code. From
Gelian Tang.
- Fix for a warning trigger on btrfs and raid on single queue blk-mq
devices, where we would flush plug callbacks with preemption
disabled. From me.
- A mac partition validation fix from Kees Cook.
- Two merge fixes from Ming, marked stable. A third part is adding a
new warning so we'll notice this quicker in the future, if we screw
up the accounting.
- Cleanup of thread name/creation in mtip32xx from Rasmus Villemoes"
* 'for-linus' of git://git.kernel.dk/linux-block: (32 commits)
blk-merge: warn if figured out segment number is bigger than nr_phys_segments
blk-merge: fix blk_bio_segment_split
block: fix segment split
blk-mq: fix calling unplug callbacks with preempt disabled
mac: validate mac_partition is within sector
mtip32xx: use formatting capability of kthread_create_on_node
NVMe: reap completion entries when deleting queue
lightnvm: add free and bad lun info to show luns
lightnvm: keep track of block counts
nvme: lightnvm: use admin queues for admin cmds
lightnvm: missing free on init error
lightnvm: wrong return value and redundant free
null_blk: do not del gendisk with lightnvm
null_blk: use device addressing mode
null_blk: use ppa_cache pool
NVMe: Fix possible arithmetic overflow for max segments
blk-flush: Queue through IO scheduler when flush not required
null_blk: register as a LightNVM device
elevator: use list_{first,prev,next}_entry
lightnvm: cleanup queue before target removal
...
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
We had seen lots of reports of this kind issue, so add one
warnning in blk-merge, then it can be triggered easily and
avoid to depend on warning/bug from drivers.
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Commit bdced438acd83a(block: setup bi_phys_segments after
splitting) introduces function of computing bio->bi_phys_segments
during bio splitting.
Unfortunately both bio->bi_seg_front_size and bio->bi_seg_back_size
arn't computed, so too many physical segments may be obtained
for one request since both the two are used to check if one segment
across two bios can be possible.
This patch fixes the issue by computing the two variables in
blk_bio_segment_split().
Fixes: bdced438acd83a(block: setup bi_phys_segments after splitting)
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Reported-by: Mark Salter <msalter@redhat.com>
Tested-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Tested-by: Mark Salter <msalter@redhat.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Inside blk_bio_segment_split(), previous bvec pointer(bvprvp)
always points to the iterator local variable, which is obviously
wrong, so fix it by pointing to the local variable of 'bvprv'.
Fixes: 5014c311baa2b(block: fix bogus compiler warnings in blk-merge.c)
Cc: stable@kernel.org #4.3
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Reported-by: Mark Salter <msalter@redhat.com>
Tested-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Tested-by: Mark Salter <msalter@redhat.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Liu reported that running certain parts of xfstests threw the
following error:
BUG: sleeping function called from invalid context at mm/page_alloc.c:3190
in_atomic(): 1, irqs_disabled(): 0, pid: 6, name: kworker/u16:0
3 locks held by kworker/u16:0/6:
#0: ("writeback"){++++.+}, at: [<ffffffff8107f083>] process_one_work+0x173/0x730
#1: ((&(&wb->dwork)->work)){+.+.+.}, at: [<ffffffff8107f083>] process_one_work+0x173/0x730
#2: (&type->s_umount_key#44){+++++.}, at: [<ffffffff811e6805>] trylock_super+0x25/0x60
CPU: 5 PID: 6 Comm: kworker/u16:0 Tainted: G OE 4.3.0+ #3
Hardware name: Red Hat KVM, BIOS Bochs 01/01/2011
Workqueue: writeback wb_workfn (flush-btrfs-108)
ffffffff81a3abab ffff88042e282ba8 ffffffff8130191b ffffffff81a3abab
0000000000000c76 ffff88042e282ba8 ffff88042e27c180 ffff88042e282bd8
ffffffff8108ed95 ffff880400000004 0000000000000000 0000000000000c76
Call Trace:
[<ffffffff8130191b>] dump_stack+0x4f/0x74
[<ffffffff8108ed95>] ___might_sleep+0x185/0x240
[<ffffffff8108eea2>] __might_sleep+0x52/0x90
[<ffffffff811817e8>] __alloc_pages_nodemask+0x268/0x410
[<ffffffff8109a43c>] ? sched_clock_local+0x1c/0x90
[<ffffffff8109a6d1>] ? local_clock+0x21/0x40
[<ffffffff810b9eb0>] ? __lock_release+0x420/0x510
[<ffffffff810b534c>] ? __lock_acquired+0x16c/0x3c0
[<ffffffff811ca265>] alloc_pages_current+0xc5/0x210
[<ffffffffa0577105>] ? rbio_is_full+0x55/0x70 [btrfs]
[<ffffffff810b7ed8>] ? mark_held_locks+0x78/0xa0
[<ffffffff81666d50>] ? _raw_spin_unlock_irqrestore+0x40/0x60
[<ffffffffa0578c0a>] full_stripe_write+0x5a/0xc0 [btrfs]
[<ffffffffa0578ca9>] __raid56_parity_write+0x39/0x60 [btrfs]
[<ffffffffa0578deb>] run_plug+0x11b/0x140 [btrfs]
[<ffffffffa0578e33>] btrfs_raid_unplug+0x23/0x70 [btrfs]
[<ffffffff812d36c2>] blk_flush_plug_list+0x82/0x1f0
[<ffffffff812e0349>] blk_sq_make_request+0x1f9/0x740
[<ffffffff812ceba2>] ? generic_make_request_checks+0x222/0x7c0
[<ffffffff812cf264>] ? blk_queue_enter+0x124/0x310
[<ffffffff812cf1d2>] ? blk_queue_enter+0x92/0x310
[<ffffffff812d0ae2>] generic_make_request+0x172/0x2c0
[<ffffffff812d0ad4>] ? generic_make_request+0x164/0x2c0
[<ffffffff812d0ca0>] submit_bio+0x70/0x140
[<ffffffffa0577b29>] ? rbio_add_io_page+0x99/0x150 [btrfs]
[<ffffffffa0578a89>] finish_rmw+0x4d9/0x600 [btrfs]
[<ffffffffa0578c4c>] full_stripe_write+0x9c/0xc0 [btrfs]
[<ffffffffa057ab7f>] raid56_parity_write+0xef/0x160 [btrfs]
[<ffffffffa052bd83>] btrfs_map_bio+0xe3/0x2d0 [btrfs]
[<ffffffffa04fbd6d>] btrfs_submit_bio_hook+0x8d/0x1d0 [btrfs]
[<ffffffffa05173c4>] submit_one_bio+0x74/0xb0 [btrfs]
[<ffffffffa0517f55>] submit_extent_page+0xe5/0x1c0 [btrfs]
[<ffffffffa0519b18>] __extent_writepage_io+0x408/0x4c0 [btrfs]
[<ffffffffa05179c0>] ? alloc_dummy_extent_buffer+0x140/0x140 [btrfs]
[<ffffffffa051dc88>] __extent_writepage+0x218/0x3a0 [btrfs]
[<ffffffff810b7ed8>] ? mark_held_locks+0x78/0xa0
[<ffffffffa051e2c9>] extent_write_cache_pages.clone.0+0x2f9/0x400 [btrfs]
[<ffffffffa051e422>] extent_writepages+0x52/0x70 [btrfs]
[<ffffffffa05001f0>] ? btrfs_set_inode_index+0x70/0x70 [btrfs]
[<ffffffffa04fcc17>] btrfs_writepages+0x27/0x30 [btrfs]
[<ffffffff81184df3>] do_writepages+0x23/0x40
[<ffffffff81212229>] __writeback_single_inode+0x89/0x4d0
[<ffffffff81212a60>] ? writeback_sb_inodes+0x260/0x480
[<ffffffff81212a60>] ? writeback_sb_inodes+0x260/0x480
[<ffffffff8121295f>] ? writeback_sb_inodes+0x15f/0x480
[<ffffffff81212ad2>] writeback_sb_inodes+0x2d2/0x480
[<ffffffff810b1397>] ? down_read_trylock+0x57/0x60
[<ffffffff811e6805>] ? trylock_super+0x25/0x60
[<ffffffff810d629f>] ? rcu_read_lock_sched_held+0x4f/0x90
[<ffffffff81212d0c>] __writeback_inodes_wb+0x8c/0xc0
[<ffffffff812130b5>] wb_writeback+0x2b5/0x500
[<ffffffff810b7ed8>] ? mark_held_locks+0x78/0xa0
[<ffffffff810660a8>] ? __local_bh_enable_ip+0x68/0xc0
[<ffffffff81213362>] ? wb_do_writeback+0x62/0x310
[<ffffffff812133c1>] wb_do_writeback+0xc1/0x310
[<ffffffff8107c3d9>] ? set_worker_desc+0x79/0x90
[<ffffffff81213842>] wb_workfn+0x92/0x330
[<ffffffff8107f133>] process_one_work+0x223/0x730
[<ffffffff8107f083>] ? process_one_work+0x173/0x730
[<ffffffff8108035f>] ? worker_thread+0x18f/0x430
[<ffffffff810802ed>] worker_thread+0x11d/0x430
[<ffffffff810801d0>] ? maybe_create_worker+0xf0/0xf0
[<ffffffff810801d0>] ? maybe_create_worker+0xf0/0xf0
[<ffffffff810858df>] kthread+0xef/0x110
[<ffffffff8108f74e>] ? schedule_tail+0x1e/0xd0
[<ffffffff810857f0>] ? __init_kthread_worker+0x70/0x70
[<ffffffff816673bf>] ret_from_fork+0x3f/0x70
[<ffffffff810857f0>] ? __init_kthread_worker+0x70/0x70
The issue is that we've got the software context pinned while
calling blk_flush_plug_list(), which flushes callbacks that
are allowed to sleep. btrfs and raid has such callbacks.
Flip the checks around a bit, so we can enable preempt a bit
earlier and flush plugs without having preempt disabled.
This only affects blk-mq driven devices, and only those that
register a single queue.
Reported-by: Liu Bo <bo.li.liu@oracle.com>
Tested-by: Liu Bo <bo.li.liu@oracle.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
If md->signature == MAC_DRIVER_MAGIC and md->block_size == 1023, a single
512 byte sector would be read (secsize / 512). However the partition
structure would be located past the end of the buffer (secsize % 512).
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
kthread_create_on_node takes format+args, so there's no need to do the
pretty-printing in advance. Moreover, "mtip_svc_thd_99" (including its
'\0') only just fits in 16 bytes, so if index could ever go above 99
we'd have a stack buffer overflow.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Make sure that there are no unprocesssed entries on a completion
queue before deleting it, and check for validity of the CQ
door bell before writing completions to it.
This fixes problems with doing a sysfs reset of the device while
it's handling IO.
Tested-by: Jon Derrick <jonathan.derrick@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Add free block, used block, and bad block information to the show debug
interface. This information is used to debug how targets track blocks.
Also, change debug function name to make it more generic.
Signed-off-by: Javier Gonzalez <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Maintain number of in use blocks, free blocks, and bad blocks in a per
lun basis. This allows the upper layers to get information about the
state of each lun.
Also, account for blocks reserved to the device on the free block count.
nr_free_blocks matches now the actual number of blocks on the free list
when the device is booted.
Signed-off-by: Javier Gonzalez <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
According to the Open-Channel SSD Specification, the NVMe-NVM admin
commands use vendor specific opcodes of NVMe, so use the NVMe admin
queue to dispatch these commands.
Signed-off-by: Wenwei Tao <ww.tao0320@gmail.com>
Updated by me to include set bad block table as well and also use
the admin queue for l2p len calculation.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
If either max_phys_sect is out of bound, the nvm_dev structure is not
freed.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
The return value should be non-zero under error conditions.
Remove nvme_free(dev) to avoid free dev more than once.
Signed-off-by: Wenwei Tao <ww.tao0320@gmail.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
The gendisk structure has not been initialized when using lightnvm.
Make sure to not delete it upon exit. Also make sure that we use the
appropriate disk_name at unregistration.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
The linear addressing mode was removed in 7386af2. Make null_blk instead
expose the ppa format geometry and support the generic addressing mode.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Instead of using a page pool, we can save memory by only allocating room
for 64 entries for the ppa command. Introduce a ppa_cache to allocate only
the required memory for the ppa list.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Reported-by: Paul Grabinar <paul.grabinar@ranbarg.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Currently blk_insert_flush() just adds flush request to q->queue_head
when flush is not required. That completely bypasses IO scheduler so
e.g. CFQ can be idling waiting for new request to arrive and will idle
through the whole window unnecessarily. Luckily this only happens in
rare cases as usually checks in generic_make_request_checks() clear
FLUSH and FUA flags early if they are not needed.
When no flushing is actually required, we can easily fix the problem by
properly queueing the request through the IO scheduler. Ideally IO
scheduler should be also made aware of requests queued via
blk_flush_queue_rq(). However inserting flush request through IO
scheduler can have unwanted side-effects since due to flush batching
delaying the flush request in IO scheduler will delay all flush requests
possibly coming from other processes. So we keep adding the request
directly to q->queue_head.
Signed-off-by: Jan Kara <jack@suse.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Add support for registering as a LightNVM device. This allows us to
evaluate the performance of the LightNVM subsystem.
In /drivers/Makefile, LightNVM is moved above block device drivers
to make sure that the LightNVM media managers have been initialized
before drivers under /drivers/block are initialized.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Fix by Jens Axboe to remove unneeded slab cache and the following
memory leak.
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
To make the intention clearer, use list_{first,prev,next}_entry
instead of list_entry.
Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
This prevents outstanding IOs to be sent for completion to target after
the target has been removed. The flow is now: stop new IOs > cleanup
queue > remove target.
Signed-off-by: Javier Gonzalez <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
The specification was updated the remove the double word just after
number of configuration groups and capabilities. Update the identify
structure to reflect it.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
The ppa format was not copied from the NVMe specific ppa format to the
lightnvm specific ppa format. This led to the ppa format not being
communicated to the layers above.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
The linear and device specific address modes can be replaced with a
simple offset and bit length conversion that is generic across all
devices.
This both simplifies the specification and removes the special case for
qemu nvme, that previously relied on the linear address mapping.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Both the nvm_register and nvm_init does a kfree(dev) on error. Make sure
to only free it once.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
We register with nvm_devices when there registration can still fail.
Move the final registration at the end of the nvm_register function
to make sure we are fully registered when added to the nvm_devices list.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Only NAND flash with SLC and MLC is supported. Make sure to not try to
initialize TLC memory or other non-volatile memory types.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
The nvm_id, nvm_id_group and nvm_addr_format data structures contain
reserved attributes. They are unused by media managers and targets.
Remove them.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
The mccap field is required for I/O command option support. It defines the
following flash access modes:
* SLC mode
* Erase/Program Suspension
* Scramble On/Off
* Encryption
It is slotted in between mpos and cpar, changing the offset for
cpar as well.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
A single 8 bit and 16 bit reserve field were inserted in the
specification to align fields appropriately. Reflect this in the
identify group structure.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
The specification was changed to reflect a multi-value bad block table.
Instead of bit-based bad block table, the bad block table now allows
eight bad block categories. Currently four are defined:
* Factory bad blocks
* Grown bad blocks
* Device-side reserved blocks
* Host-side reserved blocks
The factory and grown bad blocks are the regular bad blocks. The
reserved blocks are either for internal use or external use. In
particular, the device-side reserved blocks allows the host to
bootstrap from a limited number of flash blocks. Reducing the flash
blocks to scan upon super block initialization.
Support for both get bad block table and set bad block table is added.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
The max_phys_sect variable is defined as a char. We do a boundary check
to maximally allow 256 physical page descriptors per command. As we are
not indexing from zero. This expression is always false. Bump the
max_phys_sect to an unsigned int to support the range check.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
| |/ /
| | |
| | |
| | |
| | | |
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|\ \ \
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull kselftest fixes from Shuah Khan:
"This update consists of one minor documentation fix and a fix to an
existing test"
* tag 'linux-kselftest-4.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
selftests/seccomp: Get page size from sysconf
tools:testing/selftests: fix typo in futex/README
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
The commit fd88d16c58c2 ("selftests/seccomp: Be more precise with
syscall arguments.") use PAGE_SIZE directly which lead to build
failure on arm64.
Replace it with generic interface(sysconf(_SC_PAGESIZE)) to fix this
failure.
Build and test successful on x86_64 and arm64.
Signed-off-by: Bamvor Jian Zhang <bamvor.zhangjian@linaro.org>
Acked-by: Kees Cook <keescook@chromium.org>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Shuah Khan <shuahkh@osg.samsung.com>
|
| |/ /
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Correct typo in tools/testing/selftests/futex/README.
Signed-off-by: Wei Jiangang <weijg.fnst@cn.fujitsu.com>
Acked-by: Darren Hart <dvhart@linux.intel.com>
Signed-off-by: Shuah Khan <shuahkh@osg.samsung.com>
|
| | | |
|
|\ \ \
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Merge slub bulk allocator updates from Andrew Morton:
"This missed the merge window because I was waiting for some repairs to
come in. Nothing actually uses the bulk allocator yet and the changes
to other code paths are pretty small. And the net guys are waiting
for this so they can start merging the client code"
More comments from Jesper Dangaard Brouer:
"The kmem_cache_alloc_bulk() call, in mm/slub.c, were included in
previous kernel. The present version contains a bug. Vladimir
Davydov noticed it contained a bug, when kernel is compiled with
CONFIG_MEMCG_KMEM (see commit 03ec0ed57ffc: "slub: fix kmem cgroup
bug in kmem_cache_alloc_bulk"). Plus the mem cgroup counterpart in
kmem_cache_free_bulk() were missing (see commit 033745189b1b "slub:
add missing kmem cgroup support to kmem_cache_free_bulk").
I don't consider the fix stable-material because there are no in-tree
users of the API.
But with known bugs (for memcg) I cannot start using the API in the
net-tree"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
slab/slub: adjust kmem_cache_alloc_bulk API
slub: add missing kmem cgroup support to kmem_cache_free_bulk
slub: fix kmem cgroup bug in kmem_cache_alloc_bulk
slub: optimize bulk slowpath free by detached freelist
slub: support for bulk free with SLUB freelists
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Adjust kmem_cache_alloc_bulk API before we have any real users.
Adjust API to return type 'int' instead of previously type 'bool'. This
is done to allow future extension of the bulk alloc API.
A future extension could be to allow SLUB to stop at a page boundary, when
specified by a flag, and then return the number of objects.
The advantage of this approach, would make it easier to make bulk alloc
run without local IRQs disabled. With an approach of cmpxchg "stealing"
the entire c->freelist or page->freelist. To avoid overshooting we would
stop processing at a slab-page boundary. Else we always end up returning
some objects at the cost of another cmpxchg.
To keep compatible with future users of this API linking against an older
kernel when using the new flag, we need to return the number of allocated
objects with this API change.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|