summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
...
* xfs: Properly retry failed inode items in case of error during buffer writebackCarlos Maiolino2017-09-206-5/+108
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit d3a304b6292168b83b45d624784f973fdc1ca674 upstream. When a buffer has been failed during writeback, the inode items into it are kept flush locked, and are never resubmitted due the flush lock, so, if any buffer fails to be written, the items in AIL are never written to disk and never unlocked. This causes unmount operation to hang due these items flush locked in AIL, but this also causes the items in AIL to never be written back, even when the IO device comes back to normal. I've been testing this patch with a DM-thin device, creating a filesystem larger than the real device. When writing enough data to fill the DM-thin device, XFS receives ENOSPC errors from the device, and keep spinning on xfsaild (when 'retry forever' configuration is set). At this point, the filesystem can not be unmounted because of the flush locked items in AIL, but worse, the items in AIL are never retried at all (once xfs_inode_item_push() will skip the items that are flush locked), even if the underlying DM-thin device is expanded to the proper size. This patch fixes both cases, retrying any item that has been failed previously, using the infra-structure provided by the previous patch. Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: Add infrastructure needed for error propagation during buffer IO failureCarlos Maiolino2017-09-202-3/+36
| | | | | | | | | | | | | | | | | | | | | | commit 0b80ae6ed13169bd3a244e71169f2cc020b0c57a upstream. With the current code, XFS never re-submit a failed buffer for IO, because the failed item in the buffer is kept in the flush locked state forever. To be able to resubmit an log item for IO, we need a way to mark an item as failed, if, for any reason the buffer which the item belonged to failed during writeback. Add a new log item callback to be used after an IO completion failure and make the needed clean ups. Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: remove xfs_trans_ail_delete_bulkChristoph Hellwig2017-09-203-60/+55
| | | | | | | | | | | | | | | | commit 27af1bbf524459962d1477a38ac6e0b7f79aaecc upstream. xfs_iflush_done uses an on-stack variable length array to pass the log items to be deleted to xfs_trans_ail_delete_bulk. On-stack VLAs are a nasty gcc extension that can lead to unbounded stack allocations, but fortunately we can easily avoid them by simply open coding xfs_trans_ail_delete_bulk in xfs_iflush_done, which is the only caller of it except for the single-item xfs_trans_ail_delete. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: toggle readonly state around xfs_log_mount_finishEric Sandeen2017-09-201-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | commit 6f4a1eefdd0ad4561543270a7fceadabcca075dd upstream. When we do log recovery on a readonly mount, unlinked inode processing does not happen due to the readonly checks in xfs_inactive(), which are trying to prevent any I/O on a readonly mount. This is misguided - we do I/O on readonly mounts all the time, for consistency; for example, log recovery. So do the same RDONLY flag twiddling around xfs_log_mount_finish() as we do around xfs_log_mount(), for the same reason. This all cries out for a big rework but for now this is a simple fix to an obvious problem. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: write unmount record for ro mountsEric Sandeen2017-09-201-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 757a69ef6cf2bf839bd4088e5609ddddd663b0c4 upstream. There are dueling comments in the xfs code about intent for log writes when unmounting a readonly filesystem. In xfs_mountfs, we see the intent: /* * Now the log is fully replayed, we can transition to full read-only * mode for read-only mounts. This will sync all the metadata and clean * the log so that the recovery we just performed does not have to be * replayed again on the next mount. */ and it calls xfs_quiesce_attr(), but by the time we get to xfs_log_unmount_write(), it returns early for a RDONLY mount: * Don't write out unmount record on read-only mounts. Because of this, sequential ro mounts of a filesystem with a dirty log will replay the log each time, which seems odd. Fix this by writing an unmount record even for RO mounts, as long as norecovery wasn't specified (don't write a clean log record if a dirty log may still be there!) and the log device is writable. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* iomap: fix integer truncation issues in the zeroing and dirtying helpersChristoph Hellwig2017-09-201-2/+2
| | | | | | | | | | | | | | | | | | commit e28ae8e428fefe2facd72cea9f29906ecb9c861d upstream. Fix the min_t calls in the zeroing and dirtying helpers to perform the comparisms on 64-bit types, which prevents them from incorrectly being truncated, and larger zeroing operations being stuck in a never ending loop. Special thanks to Markus Stockhausen for spotting the bug. Reported-by: Paul Menzel <pmenzel@molgen.mpg.de> Tested-by: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: don't leak quotacheck dquots when cow recoveryDarrick J. Wong2017-09-201-0/+2
| | | | | | | | | | | | commit 77aff8c76425c8f49b50d0b9009915066739e7d2 upstream. If we fail a mount on account of cow recovery errors, it's possible that a previous quotacheck left some dquots in memory. The bailout clause of xfs_mountfs forgets to purge these, and so we leak them. Fix that. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: clear MS_ACTIVE after finishing log recoveryDarrick J. Wong2017-09-202-10/+11
| | | | | | | | | | | | | | | | | | | | | | | | commit 8204f8ddaafafcae074746fcf2a05a45e6827603 upstream. Way back when we established inode block-map redo log items, it was discovered that we needed to prevent the VFS from evicting inodes during log recovery because any given inode might be have bmap redo items to replay even if the inode has no link count and is ultimately deleted, and any eviction of an unlinked inode causes the inode to be truncated and freed too early. To make this possible, we set MS_ACTIVE so that inodes would not be torn down immediately upon release. Unfortunately, this also results in the quota inodes not being released at all if a later part of the mount process should fail, because we never reclaim the inodes. So, set MS_ACTIVE right before we do the last part of log recovery and clear it immediately after we finish the log recovery so that everything will be torn down properly if we abort the mount. Fixes: 17c12bcd30 ("xfs: when replaying bmap operations, don't let unlinked inodes get reaped") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: fix inobt inode allocation search optimizationOmar Sandoval2017-09-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit c44245b3d5435f533ca8346ece65918f84c057f9 upstream. When we try to allocate a free inode by searching the inobt, we try to find the inode nearest the parent inode by searching chunks both left and right of the chunk containing the parent. As an optimization, we cache the leftmost and rightmost records that we previously searched; if we do another allocation with the same parent inode, we'll pick up the search where it last left off. There's a bug in the case where we found a free inode to the left of the parent's chunk: we need to update the cached left and right records, but because we already reassigned the right record to point to the left, we end up assigning the left record to both the cached left and right records. This isn't a correctness problem strictly, but it can result in the next allocation rechecking chunks unnecessarily or allocating inodes further away from the parent than it needs to. Fix it by swapping the record pointer after we update the cached left and right records. Fixes: bd169565993b ("xfs: speed up free inode search") Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: Fix per-inode DAX flag inheritanceLukas Czerner2017-09-201-5/+7
| | | | | | | | | | | | | | | | | | | | | | commit 56bdf855e676f1f2ed7033f288f57dfd315725ba upstream. According to the commit that implemented per-inode DAX flag: commit 58f88ca2df72 ("xfs: introduce per-inode DAX enablement") the flag is supposed to act as "inherit flag". Currently this only works in the situations where parent directory already has a flag in di_flags set, otherwise inheritance does not work. This is because setting the XFS_DIFLAG2_DAX flag is done in a wrong branch designated for di_flags, not di_flags2. Fix this by moving the code to branch designated for setting di_flags2, which does test for flags in di_flags2. Fixes: 58f88ca2df72 ("xfs: introduce per-inode DAX enablement") Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: fix multi-AG deadlock in xfs_bunmapiChristoph Hellwig2017-09-201-0/+12
| | | | | | | | | | | | | | commit 5b094d6dac0451ad89b1dc088395c7b399b7e9e8 upstream. Just like in the allocator we must avoid touching multiple AGs out of order when freeing blocks, as freeing still locks the AGF and can cause the same AB-BA deadlocks as in the allocation path. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: fix quotacheck dquot id overflow infinite loopBrian Foster2017-09-201-0/+3
| | | | | | | | | | | | | | | | | commit cfaf2d034360166e569a4929dd83ae9698bed856 upstream. If a dquot has an id of U32_MAX, the next lookup index increment overflows the uint32_t back to 0. This starts the lookup sequence over from the beginning, repeats indefinitely and results in a livelock. Update xfs_qm_dquot_walk() to explicitly check for the lookup overflow and exit the loop. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: check _alloc_read_agf buffer pointer before usingDarrick J. Wong2017-09-202-0/+6
| | | | | | | | | | | | | commit 10479e2dea83d4c421ad05dfc55d918aa8dfc0cd upstream. In some circumstances, _alloc_read_agf can return an error code of zero but also a null AGF buffer pointer. Check for this and jump out. Fixes-coverity-id: 1415250 Fixes-coverity-id: 1415320 Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: set firstfsb to NULLFSBLOCK before feeding it to _bmapi_writeDarrick J. Wong2017-09-202-1/+10
| | | | | | | | | | | | | | commit 4c1a67bd3606540b9b42caff34a1d5cd94b1cf65 upstream. We must initialize the firstfsb parameter to _bmapi_write so that it doesn't incorrectly treat stack garbage as a restriction on which AGs it can search for free space. Fixes-coverity-id: 1402025 Fixes-coverity-id: 1415167 Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: check _btree_check_block valueDarrick J. Wong2017-09-201-2/+4
| | | | | | | | | | | | | | commit 1e86eabe73b73c82e1110c746ed3ec6d5e1c0a0d upstream. Check the _btree_check_block return value for the firstrec and lastrec functions, since we have the ability to signal that the repositioning did not succeed. Fixes-coverity-id: 114067 Fixes-coverity-id: 114068 Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: don't crash on unexpected holes in dir/attr btreesDarrick J. Wong2017-09-204-5/+5
| | | | | | | | | | | | | | | | | | | commit cd87d867920155911d0d2e6485b769d853547750 upstream. In quite a few places we call xfs_da_read_buf with a mappedbno that we don't control, then assume that the function passes back either an error code or a buffer pointer. Unfortunately, if mappedbno == -2 and bno maps to a hole, we get a return code of zero and a NULL buffer, which means that we crash if we actually try to use that buffer pointer. This happens immediately when we set the buffer type for transaction context. Therefore, check that we have no error code and a non-NULL bp before trying to use bp. This patch is a follow-up to an incomplete fix in 96a3aefb8ffde231 ("xfs: don't crash if reading a directory results in an unexpected hole"). Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: free cowblocks and retry on buffered write ENOSPCBrian Foster2017-09-201-0/+1
| | | | | | | | | | | | | | | | | commit cf2cb7845d6e101cb17bd62f8aa08cd514fc8988 upstream. XFS runs an eofblocks reclaim scan before returning an ENOSPC error to userspace for buffered writes. This facilitates aggressive speculative preallocation without causing user visible side effects such as premature ENOSPC. Run a cowblocks scan in the same situation to reclaim lingering COW fork preallocation throughout the filesystem. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: free uncommitted transactions during log recoveryBrian Foster2017-09-201-1/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 39775431f82f890f4aaa08860a30883d081bffc7 upstream. Log recovery allocates in-core transaction and member item data structures on-demand as it processes the on-disk log. Transactions are allocated on first encounter on-disk and stored in a hash table structure where they are easily accessible for subsequent lookups. Transaction items are also allocated on demand and are attached to the associated transactions. When a commit record is encountered in the log, the transaction is committed to the fs and the in-core structures are freed. If a filesystem crashes or shuts down before all in-core log buffers are flushed to the log, however, not all transactions may have commit records in the log. As expected, the modifications in such an incomplete transaction are not replayed to the fs. The in-core data structures for the partial transaction are never freed, however, resulting in a memory leak. Update xlog_do_recovery_pass() to first correctly initialize the hash table array so empty lists can be distinguished from populated lists on function exit. Update xlog_recover_free_trans() to always remove the transaction from the list prior to freeing the associated memory. Finally, walk the hash table of transaction lists as the last step before it goes out of scope and free any transactions that may remain on the lists. This prevents a memory leak of partial transactions in the log. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: don't allow bmap on rt filesDarrick J. Wong2017-09-201-2/+5
| | | | | | | | | | | | | | | | | commit 61d819e7bcb7f33da710bf3f5dcb2bcf1e48203c upstream. bmap returns a dumb LBA address but not the block device that goes with that LBA. Swapfiles don't care about this and will blindly assume that the data volume is the correct blockdev, which is totally bogus for files on the rt subvolume. This results in the swap code doing IOs to arbitrary locations on the data device(!) if the passed in mapping is a realtime file, so just turn off bmap for rt files. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: remove bli from AIL before release on transaction abortBrian Foster2017-09-201-9/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 3d4b4a3e30ae7a949c31e1e10268a3da4723d290 upstream. When a buffer is modified, logged and committed, it ultimately ends up sitting on the AIL with a dirty bli waiting for metadata writeback. If another transaction locks and invalidates the buffer (freeing an inode chunk, for example) in the meantime, the bli is flagged as stale, the dirty state is cleared and the bli remains in the AIL. If a shutdown occurs before the transaction that has invalidated the buffer is committed, the transaction is ultimately aborted. The log items are flagged as such and ->iop_unlock() handles the aborted items. Because the bli is clean (due to the invalidation), ->iop_unlock() unconditionally releases it. The log item may still reside in the AIL, however, which means the I/O completion handler may still run and attempt to access it. This results in assert failure due to the release of the bli while still present in the AIL and a subsequent NULL dereference and panic in the buffer I/O completion handling. This can be reproduced by running generic/388 in repetition. To avoid this problem, update xfs_buf_item_unlock() to first check whether the bli is aborted and if so, remove it from the AIL before it is released. This ensures that the bli is no longer accessed during the shutdown sequence after it has been freed. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: release bli from transaction properly on fs shutdownBrian Foster2017-09-201-7/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 79e641ce29cfae5b8fc55fb77ac62d11d2d849c0 upstream. If a filesystem shutdown occurs with a buffer log item in the CIL and a log force occurs, the ->iop_unpin() handler is generally expected to tear down the bli properly. This entails freeing the bli memory and releasing the associated hold on the buffer so it can be released and the filesystem unmounted. If this sequence occurs while ->bli_refcount is elevated (i.e., another transaction is open and attempting to modify the buffer), however, ->iop_unpin() may not be responsible for releasing the bli. Instead, the transaction may release the final ->bli_refcount reference and thus xfs_trans_brelse() is responsible for tearing down the bli. While xfs_trans_brelse() does drop the reference count, it only attempts to release the bli if it is clean (i.e., not in the CIL/AIL). If the filesystem is shutdown and the bli is sitting dirty in the CIL as noted above, this ends up skipping the last opportunity to release the bli. In turn, this leaves the hold on the buffer and causes an unmount hang. This can be reproduced by running generic/388 in repetition. Update xfs_trans_brelse() to handle this shutdown corner case correctly. If the final bli reference is dropped and the filesystem is shutdown, remove the bli from the AIL (if necessary) and release the bli to drop the buffer hold and ensure an unmount does not hang. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: try to avoid blowing out the transaction reservation when bunmaping a ↵Darrick J. Wong2017-09-207-24/+71
| | | | | | | | | | | | | | | | | | | | | | | | | | shared extent commit e1a4e37cc7b665b6804fba812aca2f4d7402c249 upstream. In a pathological scenario where we are trying to bunmapi a single extent in which every other block is shared, it's possible that trying to unmap the entire large extent in a single transaction can generate so many EFIs that we overflow the transaction reservation. Therefore, use a heuristic to guess at the number of blocks we can safely unmap from a reflink file's data fork in an single transaction. This should prevent problems such as the log head slamming into the tail and ASSERTs that trigger because we've exceeded the transaction reservation. Note that since bunmapi can fail to unmap the entire range, we must also teach the deferred unmap code to roll into a new transaction whenever we get low on reservation. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> [hch: random edits, all bugs are my fault] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: push buffer of flush locked dquot to avoid quotacheck deadlockBrian Foster2017-09-204-1/+89
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 7912e7fef2aebe577f0b46d3cba261f2783c5695 upstream. Reclaim during quotacheck can lead to deadlocks on the dquot flush lock: - Quotacheck populates a local delwri queue with the physical dquot buffers. - Quotacheck performs the xfs_qm_dqusage_adjust() bulkstat and dirties all of the dquots. - Reclaim kicks in and attempts to flush a dquot whose buffer is already queud on the quotacheck queue. The flush succeeds but queueing to the reclaim delwri queue fails as the backing buffer is already queued. The flush unlock is now deferred to I/O completion of the buffer from the quotacheck queue. - The dqadjust bulkstat continues and dirties the recently flushed dquot once again. - Quotacheck proceeds to the xfs_qm_flush_one() walk which requires the flush lock to update the backing buffers with the in-core recalculated values. It deadlocks on the redirtied dquot as the flush lock was already acquired by reclaim, but the buffer resides on the local delwri queue which isn't submitted until the end of quotacheck. This is reproduced by running quotacheck on a filesystem with a couple million inodes in low memory (512MB-1GB) situations. This is a regression as of commit 43ff2122e6 ("xfs: on-stack delayed write buffer lists"), which removed a trylock and buffer I/O submission from the quotacheck dquot flush sequence. Quotacheck first resets and collects the physical dquot buffers in a delwri queue. Then, it traverses the filesystem inodes via bulkstat, updates the in-core dquots, flushes the corrected dquots to the backing buffers and finally submits the delwri queue for I/O. Since the backing buffers are queued across the entire quotacheck operation, dquot reclaim cannot possibly complete a dquot flush before quotacheck completes. Therefore, quotacheck must submit the buffer for I/O in order to cycle the flush lock and flush the dirty in-core dquot to the buffer. Add a delwri queue buffer push mechanism to submit an individual buffer for I/O without losing the delwri queue status and use it from quotacheck to avoid the deadlock. This restores quotacheck behavior to as before the regression was introduced. Reported-by: Martin Svec <martin.svec@zoner.cz> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: fix spurious spin_is_locked() assert failures on non-smp kernelsBrian Foster2017-09-202-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | commit 95989c46d2a156365867b1d795fdefce71bce378 upstream. The 0-day kernel test robot reports assertion failures on !CONFIG_SMP kernels due to failed spin_is_locked() checks. As it turns out, spin_is_locked() is hardcoded to return zero on !CONFIG_SMP kernels and so this function cannot be relied on to verify spinlock state in this configuration. To avoid this problem, replace the associated asserts with lockdep variants that do the right thing regardless of kernel configuration. Drop the one assert that checks for an unlocked lock as there is no suitable lockdep variant for that case. This moves the spinlock checks from XFS debug code to lockdep, but generally provides the same level of protection. Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: Move handling of missing page into one place in ↵Jan Kara2017-09-201-30/+8
| | | | | | | | | | | | | | | | | xfs_find_get_desired_pgoff() commit a54fba8f5a0dc36161cacdf2aa90f007f702ec1a upstream. Currently several places in xfs_find_get_desired_pgoff() handle the case of a missing page. Make them all handled in one place after the loop has terminated. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86/switch_to/64: Rewrite FS/GS switching yet again to fix AMD CPUsAndy Lutomirski2017-09-201-105/+122
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit e137a4d8f4dd2e277e355495b6b2cb241a8693c3 upstream. Switching FS and GS is a mess, and the current code is still subtly wrong: it assumes that "Loading a nonzero value into FS sets the index and base", which is false on AMD CPUs if the value being loaded is 1, 2, or 3. (The current code came from commit 3e2b68d752c9 ("x86/asm, sched/x86: Rewrite the FS and GS context switch code"), which made it better but didn't fully fix it.) Rewrite it to be much simpler and more obviously correct. This should fix it fully on AMD CPUs and shouldn't adversely affect performance. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Chang Seok <chang.seok.bae@intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86/fsgsbase/64: Report FSBASE and GSBASE correctly in core dumpsAndy Lutomirski2017-09-201-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | commit 9584d98bed7a7a904d0702ad06bbcc94703cb5b4 upstream. In ELF_COPY_CORE_REGS, we're copying from the current task, so accessing thread.fsbase and thread.gsbase makes no sense. Just read the values from the CPU registers. In practice, the old code would have been correct most of the time simply because thread.fsbase and thread.gsbase usually matched the CPU registers. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Chang Seok <chang.seok.bae@intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* x86/fsgsbase/64: Fully initialize FS and GS state in start_thread_commonAndy Lutomirski2017-09-201-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 767d035d838f4fd6b5a5bbd7a3f6d293b7f65a49 upstream. execve used to leak FSBASE and GSBASE on AMD CPUs. Fix it. The security impact of this bug is small but not quite zero -- it could weaken ASLR when a privileged task execs a less privileged program, but only if program changed bitness across the exec, or the child binary was highly unusual or actively malicious. A child program that was compromised after the exec would not have access to the leaked base. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Chang Seok <chang.seok.bae@intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* f2fs: check hot_data for roll-forward recoveryJaegeuk Kim2017-09-201-1/+1
| | | | | | | | | | | | commit 125c9fb1ccb53eb2ea9380df40f3c743f3fb2fed upstream. We need to check HOT_DATA to truncate any previous data block when doing roll-forward recovery. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* f2fs: let fill_super handle roll-forward errorsJaegeuk Kim2017-09-201-2/+0
| | | | | | | | | | | | | | | commit afd2b4da40b3b567ef8d8e6881479345a2312a03 upstream. If we set CP_ERROR_FLAG in roll-forward error, f2fs is no longer to proceed any IOs due to f2fs_cp_error(). But, for example, if some stale data is involved on roll-forward process, we're able to get -ENOENT, getting fs stuck. If we get any error, let fill_super set SBI_NEED_FSCK and try to recover back to stable point. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ip_tunnel: fix setting ttl and tos value in collect_md modeHaishuang Yan2017-09-201-2/+2
| | | | | | | | | | | | | | [ Upstream commit 0f693f1995cf002432b70f43ce73f79bf8d0b6c9 ] ttl and tos variables are declared and assigned, but are not used in iptunnel_xmit() function. Fixes: cfc7381b3002 ("ip_tunnel: add collect_md mode to IPIP tunnel") Cc: Alexei Starovoitov <ast@fb.com> Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* sctp: fix missing wake ups in some situationsMarcelo Ricardo Leitner2017-09-201-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 7906b00f5cd1cd484fced7fcda892176e3202c8a ] Commit fb586f25300f ("sctp: delay calls to sk_data_ready() as much as possible") minimized the number of wake ups that are triggered in case the association receives a packet with multiple data chunks on it and/or when io_events are enabled and then commit 0970f5b36659 ("sctp: signal sk_data_ready earlier on data chunks reception") moved the wake up to as soon as possible. It thus relies on the state machine running later to clean the flag that the event was already generated. The issue is that there are 2 call paths that calls sctp_ulpq_tail_event() outside of the state machine, causing the flag to linger and possibly omitting a needed wake up in the sequence. One of the call paths is when enabling SCTP_SENDER_DRY_EVENTS via setsockopt(SCTP_EVENTS), as noticed by Harald Welte. The other is when partial reliability triggers removal of chunks from the send queue when the application calls sendmsg(). This commit fixes it by not setting the flag in case the socket is not owned by the user, as it won't be cleaned later. This works for user-initiated calls and also for rx path processing. Fixes: fb586f25300f ("sctp: delay calls to sk_data_ready() as much as possible") Reported-by: Harald Welte <laforge@gnumonks.org> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ipv6: fix typo in fib6_net_exit()Eric Dumazet2017-09-201-1/+1
| | | | | | | | | | | [ Upstream commit 32a805baf0fb70b6dbedefcd7249ac7f580f9e3b ] IPv6 FIB should use FIB6_TABLE_HASHSZ, not FIB_TABLE_HASHSZ. Fixes: ba1cc08d9488 ("ipv6: fix memory leak with multiple tables during netns destruction") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ipv6: fix memory leak with multiple tables during netns destructionSabrina Dubroca2017-09-201-6/+19
| | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit ba1cc08d9488c94cb8d94f545305688b72a2a300 ] fib6_net_exit only frees the main and local tables. If another table was created with fib6_alloc_table, we leak it when the netns is destroyed. Fix this in the same way ip_fib_net_exit cleans up tables, by walking through the whole hashtable of fib6_table's. We can get rid of the special cases for local and main, since they're also part of the hashtable. Reproducer: ip netns add x ip -net x -6 rule add from 6003:1::/64 table 100 ip netns del x Reported-by: Jianlin Shi <jishi@redhat.com> Fixes: 58f09b78b730 ("[NETNS][IPV6] ip6_fib - make it per network namespace") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ip6_gre: update mtu properly in ip6gre_errXin Long2017-09-201-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 5c25f30c93fdc5bf25e62101aeaae7a4f9b421b3 ] Now when probessing ICMPV6_PKT_TOOBIG, ip6gre_err only subtracts the offset of gre header from mtu info. The expected mtu of gre device should also subtract gre header. Otherwise, the next packets still can't be sent out. Jianlin found this issue when using the topo: client(ip6gre)<---->(nic1)route(nic2)<----->(ip6gre)server and reducing nic2's mtu, then both tcp and sctp's performance with big size data became 0. This patch is to fix it by also subtracting grehdr (tun->tun_hlen) from mtu info when updating gre device's mtu in ip6gre_err(). It also needs to subtract ETH_HLEN if gre dev'type is ARPHRD_ETHER. Reported-by: Jianlin Shi <jishi@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* vhost_net: correctly check tx avail during rx busy pollingJason Wang2017-09-201-1/+6
| | | | | | | | | | | | | | | | | | | | | [ Upstream commit 8b949bef9172ca69d918e93509a4ecb03d0355e0 ] We check tx avail through vhost_enable_notify() in the past which is wrong since it only checks whether or not guest has filled more available buffer since last avail idx synchronization which was just done by vhost_vq_avail_empty() before. What we really want is checking pending buffers in the avail ring. Fix this by calling vhost_vq_avail_empty() instead. This issue could be noticed by doing netperf TCP_RR benchmark as client from guest (but not host). With this fix, TCP_RR from guest to localhost restores from 1375.91 trans per sec to 55235.28 trans per sec on my laptop (Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz). Fixes: 030881372460 ("vhost_net: basic polling support") Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* gianfar: Fix Tx flow control deactivationClaudiu Manoil2017-09-201-1/+1
| | | | | | | | | | | | | | | | | | | [ Upstream commit 5d621672bc1a1e5090c1ac5432a18c79e0e13e03 ] The wrong register is checked for the Tx flow control bit, it should have been maccfg1 not maccfg2. This went unnoticed for so long probably because the impact is hardly visible, not to mention the tangled code from adjust_link(). First, link flow control (i.e. handling of Rx/Tx link level pause frames) is disabled by default (needs to be enabled via 'ethtool -A'). Secondly, maccfg2 always returns 0 for tx_flow_oldval (except for a few old boards), which results in Tx flow control remaining always on once activated. Fixes: 45b679c9a3ccd9e34f28e6ec677b812a860eb8eb ("gianfar: Implement PAUSE frame generation support") Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Revert "net: fix percpu memory leaks"Jesper Dangaard Brouer2017-09-205-41/+13
| | | | | | | | | | | | | | | | | [ Upstream commit 5a63643e583b6a9789d7a225ae076fb4e603991c ] This reverts commit 1d6119baf0610f813eb9d9580eb4fd16de5b4ceb. After reverting commit 6d7b857d541e ("net: use lib/percpu_counter API for fragmentation mem accounting") then here is no need for this fix-up patch. As percpu_counter is no longer used, it cannot memory leak it any-longer. Fixes: 6d7b857d541e ("net: use lib/percpu_counter API for fragmentation mem accounting") Fixes: 1d6119baf061 ("net: fix percpu memory leaks") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Revert "net: use lib/percpu_counter API for fragmentation mem accounting"Jesper Dangaard Brouer2017-09-202-30/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit fb452a1aa3fd4034d7999e309c5466ff2d7005aa ] This reverts commit 6d7b857d541ecd1d9bd997c97242d4ef94b19de2. There is a bug in fragmentation codes use of the percpu_counter API, that can cause issues on systems with many CPUs. The frag_mem_limit() just reads the global counter (fbc->count), without considering other CPUs can have upto batch size (130K) that haven't been subtracted yet. Due to the 3MBytes lower thresh limit, this become dangerous at >=24 CPUs (3*1024*1024/130000=24). The correct API usage would be to use __percpu_counter_compare() which does the right thing, and takes into account the number of (online) CPUs and batch size, to account for this and call __percpu_counter_sum() when needed. We choose to revert the use of the lib/percpu_counter API for frag memory accounting for several reasons: 1) On systems with CPUs > 24, the heavier fully locked __percpu_counter_sum() is always invoked, which will be more expensive than the atomic_t that is reverted to. Given systems with more than 24 CPUs are becoming common this doesn't seem like a good option. To mitigate this, the batch size could be decreased and thresh be increased. 2) The add_frag_mem_limit+sub_frag_mem_limit pairs happen on the RX CPU, before SKBs are pushed into sockets on remote CPUs. Given NICs can only hash on L2 part of the IP-header, the NIC-RXq's will likely be limited. Thus, a fair chance that atomic add+dec happen on the same CPU. Revert note that commit 1d6119baf061 ("net: fix percpu memory leaks") removed init_frag_mem_limit() and instead use inet_frags_init_net(). After this revert, inet_frags_uninit_net() becomes empty. Fixes: 6d7b857d541e ("net: use lib/percpu_counter API for fragmentation mem accounting") Fixes: 1d6119baf061 ("net: fix percpu memory leaks") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* bridge: switchdev: Clear forward mark when transmitting packetIdo Schimmel2017-09-201-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 79e99bdd60b484af9afe0147e85a13e66d5c1cdb ] Commit 6bc506b4fb06 ("bridge: switchdev: Add forward mark support for stacked devices") added the 'offload_fwd_mark' bit to the skb in order to allow drivers to indicate to the bridge driver that they already forwarded the packet in L2. In case the bit is set, before transmitting the packet from each port, the port's mark is compared with the mark stored in the skb's control block. If both marks are equal, we know the packet arrived from a switch device that already forwarded the packet and it's not re-transmitted. However, if the packet is transmitted from the bridge device itself (e.g., br0), we should clear the 'offload_fwd_mark' bit as the mark stored in the skb's control block isn't valid. This scenario can happen in rare cases where a packet was trapped during L3 forwarding and forwarded by the kernel to a bridge device. Fixes: 6bc506b4fb06 ("bridge: switchdev: Add forward mark support for stacked devices") Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reported-by: Yotam Gigi <yotamg@mellanox.com> Tested-by: Yotam Gigi <yotamg@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* mlxsw: spectrum: Forbid linking to devices that have uppersIdo Schimmel2017-09-203-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 25cc72a33835ed8a6f53180a822cadab855852ac ] The mlxsw driver relies on NETDEV_CHANGEUPPER events to configure the device in case a port is enslaved to a master netdev such as bridge or bond. Since the driver ignores events unrelated to its ports and their uppers, it's possible to engineer situations in which the device's data path differs from the kernel's. One example to such a situation is when a port is enslaved to a bond that is already enslaved to a bridge. When the bond was enslaved the driver ignored the event - as the bond wasn't one of its uppers - and therefore a bridge port instance isn't created in the device. Until such configurations are supported forbid them by checking that the upper device doesn't have uppers of its own. Fixes: 0d65fc13042f ("mlxsw: spectrum: Implement LAG port join/leave") Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reported-by: Nogah Frankel <nogahf@mellanox.com> Tested-by: Nogah Frankel <nogahf@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* tcp: initialize rcv_mss to TCP_MIN_MSS instead of 0Wei Wang2017-09-201-0/+4
| | | | | | | | | | | | | | | | | | [ Upstream commit 499350a5a6e7512d9ed369ed63a4244b6536f4f8 ] When tcp_disconnect() is called, inet_csk_delack_init() sets icsk->icsk_ack.rcv_mss to 0. This could potentially cause tcp_recvmsg() => tcp_cleanup_rbuf() => __tcp_select_window() call path to have division by 0 issue. So this patch initializes rcv_mss to TCP_MIN_MSS instead of 0. Reported-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Revert "net: phy: Correctly process PHY_HALTED in phy_stop_machine()"Florian Fainelli2017-09-201-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit ebc8254aeae34226d0bc8fda309fd9790d4dccfe ] This reverts commit 7ad813f208533cebfcc32d3d7474dc1677d1b09a ("net: phy: Correctly process PHY_HALTED in phy_stop_machine()") because it is creating the possibility for a NULL pointer dereference. David Daney provide the following call trace and diagram of events: When ndo_stop() is called we call: phy_disconnect() +---> phy_stop_interrupts() implies: phydev->irq = PHY_POLL; +---> phy_stop_machine() | +---> phy_state_machine() | +----> queue_delayed_work(): Work queued. +--->phy_detach() implies: phydev->attached_dev = NULL; Now at a later time the queued work does: phy_state_machine() +---->netif_carrier_off(phydev->attached_dev): Oh no! It is NULL: CPU 12 Unable to handle kernel paging request at virtual address 0000000000000048, epc == ffffffff80de37ec, ra == ffffffff80c7c Oops[#1]: CPU: 12 PID: 1502 Comm: kworker/12:1 Not tainted 4.9.43-Cavium-Octeon+ #1 Workqueue: events_power_efficient phy_state_machine task: 80000004021ed100 task.stack: 8000000409d70000 $ 0 : 0000000000000000 ffffffff84720060 0000000000000048 0000000000000004 $ 4 : 0000000000000000 0000000000000001 0000000000000004 0000000000000000 $ 8 : 0000000000000000 0000000000000000 00000000ffff98f3 0000000000000000 $12 : 8000000409d73fe0 0000000000009c00 ffffffff846547c8 000000000000af3b $16 : 80000004096bab68 80000004096babd0 0000000000000000 80000004096ba800 $20 : 0000000000000000 0000000000000000 ffffffff81090000 0000000000000008 $24 : 0000000000000061 ffffffff808637b0 $28 : 8000000409d70000 8000000409d73cf0 80000000271bd300 ffffffff80c7804c Hi : 000000000000002a Lo : 000000000000003f epc : ffffffff80de37ec netif_carrier_off+0xc/0x58 ra : ffffffff80c7804c phy_state_machine+0x48c/0x4f8 Status: 14009ce3 KX SX UX KERNEL EXL IE Cause : 00800008 (ExcCode 02) BadVA : 0000000000000048 PrId : 000d9501 (Cavium Octeon III) Modules linked in: Process kworker/12:1 (pid: 1502, threadinfo=8000000409d70000, task=80000004021ed100, tls=0000000000000000) Stack : 8000000409a54000 80000004096bab68 80000000271bd300 80000000271c1e00 0000000000000000 ffffffff808a1708 8000000409a54000 80000000271bd300 80000000271bd320 8000000409a54030 ffffffff80ff0f00 0000000000000001 ffffffff81090000 ffffffff808a1ac0 8000000402182080 ffffffff84650000 8000000402182080 ffffffff84650000 ffffffff80ff0000 8000000409a54000 ffffffff808a1970 0000000000000000 80000004099e8000 8000000402099240 0000000000000000 ffffffff808a8598 0000000000000000 8000000408eeeb00 8000000409a54000 00000000810a1d00 0000000000000000 8000000409d73de8 8000000409d73de8 0000000000000088 000000000c009c00 8000000409d73e08 8000000409d73e08 8000000402182080 ffffffff808a84d0 8000000402182080 ... Call Trace: [<ffffffff80de37ec>] netif_carrier_off+0xc/0x58 [<ffffffff80c7804c>] phy_state_machine+0x48c/0x4f8 [<ffffffff808a1708>] process_one_work+0x158/0x368 [<ffffffff808a1ac0>] worker_thread+0x150/0x4c0 [<ffffffff808a8598>] kthread+0xc8/0xe0 [<ffffffff808617f0>] ret_from_kernel_thread+0x14/0x1c The original motivation for this change originated from Marc Gonzales indicating that his network driver did not have its adjust_link callback executing with phydev->link = 0 while he was expecting it. PHYLIB has never made any such guarantees ever because phy_stop() merely just tells the workqueue to move into PHY_HALTED state which will happen asynchronously. Reported-by: Geert Uytterhoeven <geert+renesas@glider.be> Reported-by: David Daney <ddaney.cavm@gmail.com> Fixes: 7ad813f20853 ("net: phy: Correctly process PHY_HALTED in phy_stop_machine()") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* kcm: do not attach PF_KCM sockets to avoid deadlockEric Dumazet2017-09-201-0/+4
| | | | | | | | | | | | | | | | | | [ Upstream commit 351050ecd6523374b370341cc29fe61e2201556b ] syzkaller had no problem to trigger a deadlock, attaching a KCM socket to another one (or itself). (original syzkaller report was a very confusing lockdep splat during a sendmsg()) It seems KCM claims to only support TCP, but no enforcement is done, so we might need to add additional checks. Fixes: ab7ac4eb9832 ("kcm: Kernel Connection Multiplexor module") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Tom Herbert <tom@quantonium.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* packet: Don't write vnet header beyond end of bufferBenjamin Poirier2017-09-201-3/+9
| | | | | | | | | | | | | [ Upstream commit edbd58be15a957f6a760c4a514cd475217eb97fd ] ... which may happen with certain values of tp_reserve and maclen. Fixes: 58d19b19cd99 ("packet: vnet_hdr support for tpacket_rcv") Signed-off-by: Benjamin Poirier <bpoirier@suse.com> Cc: Willem de Bruijn <willemb@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* cxgb4: Fix stack out-of-bounds read due to wrong size to t4_record_mbox()Stefano Brivio2017-09-201-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 0f3086868e8889a823a6e0f3d299102aa895d947 ] Passing commands for logging to t4_record_mbox() with size MBOX_LEN, when the actual command size is actually smaller, causes out-of-bounds stack accesses in t4_record_mbox() while copying command words here: for (i = 0; i < size / 8; i++) entry->cmd[i] = be64_to_cpu(cmd[i]); Up to 48 bytes from the stack are then leaked to debugfs. This happens whenever we send (and log) commands described by structs fw_sched_cmd (32 bytes leaked), fw_vi_rxmode_cmd (48), fw_hello_cmd (48), fw_bye_cmd (48), fw_initialize_cmd (48), fw_reset_cmd (48), fw_pfvf_cmd (32), fw_eq_eth_cmd (16), fw_eq_ctrl_cmd (32), fw_eq_ofld_cmd (32), fw_acl_mac_cmd(16), fw_rss_glb_config_cmd(32), fw_rss_vi_config_cmd(32), fw_devlog_cmd(32), fw_vi_enable_cmd(48), fw_port_cmd(32), fw_sched_cmd(32), fw_devlog_cmd(32). The cxgb4vf driver got this right instead. When we call t4_record_mbox() to log a command reply, a MBOX_LEN size can be used though, as get_mbox_rpl() will fill cmd_rpl up completely. Fixes: 7f080c3f2ff0 ("cxgb4: Add support to enable logging of firmware mailbox commands") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* netvsc: fix deadlock betwen link status and removalstephen hemminger2017-09-201-1/+6
| | | | | | | | | | | | | | | | [ Upstream commit 9b4e946ce14e20d7addbfb7d9139e604f9fda107 ] There is a deadlock possible when canceling the link status delayed work queue. The removal process is run with RTNL held, and the link status callback is acquring RTNL. Resolve the issue by using trylock and rescheduling. If cancel is in process, that block it from happening. Fixes: 122a5f6410f4 ("staging: hv: use delayed_work for netvsc_send_garp()") Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* qlge: avoid memcpy buffer overflowArnd Bergmann2017-09-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit e58f95831e7468d25eb6e41f234842ecfe6f014f ] gcc-8.0.0 (snapshot) points out that we copy a variable-length string into a fixed length field using memcpy() with the destination length, and that ends up copying whatever follows the string: inlined from 'ql_core_dump' at drivers/net/ethernet/qlogic/qlge/qlge_dbg.c:1106:2: drivers/net/ethernet/qlogic/qlge/qlge_dbg.c:708:2: error: 'memcpy' reading 15 bytes from a region of size 14 [-Werror=stringop-overflow=] memcpy(seg_hdr->description, desc, (sizeof(seg_hdr->description)) - 1); Changing it to use strncpy() will instead zero-pad the destination, which seems to be the right thing to do here. The bug is probably harmless, but it seems like a good idea to address it in stable kernels as well, if only for the purpose of building with gcc-8 without warnings. Fixes: a61f80261306 ("qlge: Add ethtool register dump function.") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* sctp: Avoid out-of-bounds reads from address storageStefano Brivio2017-09-202-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit ee6c88bb754e3d363e568da78086adfedb692447 ] inet_diag_msg_sctp{,l}addr_fill() and sctp_get_sctp_info() copy sizeof(sockaddr_storage) bytes to fill in sockaddr structs used to export diagnostic information to userspace. However, the memory allocated to store sockaddr information is smaller than that and depends on the address family, so we leak up to 100 uninitialized bytes to userspace. Just use the size of the source structs instead, in all the three cases this is what userspace expects. Zero out the remaining memory. Unused bytes (i.e. when IPv4 addresses are used) in source structs sctp_sockaddr_entry and sctp_transport are already cleared by sctp_add_bind_addr() and sctp_transport_new(), respectively. Noticed while testing KASAN-enabled kernel with 'ss': [ 2326.885243] BUG: KASAN: slab-out-of-bounds in inet_sctp_diag_fill+0x42c/0x6c0 [sctp_diag] at addr ffff881be8779800 [ 2326.896800] Read of size 128 by task ss/9527 [ 2326.901564] CPU: 0 PID: 9527 Comm: ss Not tainted 4.11.0-22.el7a.x86_64 #1 [ 2326.909236] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.4.3 01/17/2017 [ 2326.917585] Call Trace: [ 2326.920312] dump_stack+0x63/0x8d [ 2326.924014] kasan_object_err+0x21/0x70 [ 2326.928295] kasan_report+0x288/0x540 [ 2326.932380] ? inet_sctp_diag_fill+0x42c/0x6c0 [sctp_diag] [ 2326.938500] ? skb_put+0x8b/0xd0 [ 2326.942098] ? memset+0x31/0x40 [ 2326.945599] check_memory_region+0x13c/0x1a0 [ 2326.950362] memcpy+0x23/0x50 [ 2326.953669] inet_sctp_diag_fill+0x42c/0x6c0 [sctp_diag] [ 2326.959596] ? inet_diag_msg_sctpasoc_fill+0x460/0x460 [sctp_diag] [ 2326.966495] ? __lock_sock+0x102/0x150 [ 2326.970671] ? sock_def_wakeup+0x60/0x60 [ 2326.975048] ? remove_wait_queue+0xc0/0xc0 [ 2326.979619] sctp_diag_dump+0x44a/0x760 [sctp_diag] [ 2326.985063] ? sctp_ep_dump+0x280/0x280 [sctp_diag] [ 2326.990504] ? memset+0x31/0x40 [ 2326.994007] ? mutex_lock+0x12/0x40 [ 2326.997900] __inet_diag_dump+0x57/0xb0 [inet_diag] [ 2327.003340] ? __sys_sendmsg+0x150/0x150 [ 2327.007715] inet_diag_dump+0x4d/0x80 [inet_diag] [ 2327.012979] netlink_dump+0x1e6/0x490 [ 2327.017064] __netlink_dump_start+0x28e/0x2c0 [ 2327.021924] inet_diag_handler_cmd+0x189/0x1a0 [inet_diag] [ 2327.028045] ? inet_diag_rcv_msg_compat+0x1b0/0x1b0 [inet_diag] [ 2327.034651] ? inet_diag_dump_compat+0x190/0x190 [inet_diag] [ 2327.040965] ? __netlink_lookup+0x1b9/0x260 [ 2327.045631] sock_diag_rcv_msg+0x18b/0x1e0 [ 2327.050199] netlink_rcv_skb+0x14b/0x180 [ 2327.054574] ? sock_diag_bind+0x60/0x60 [ 2327.058850] sock_diag_rcv+0x28/0x40 [ 2327.062837] netlink_unicast+0x2e7/0x3b0 [ 2327.067212] ? netlink_attachskb+0x330/0x330 [ 2327.071975] ? kasan_check_write+0x14/0x20 [ 2327.076544] netlink_sendmsg+0x5be/0x730 [ 2327.080918] ? netlink_unicast+0x3b0/0x3b0 [ 2327.085486] ? kasan_check_write+0x14/0x20 [ 2327.090057] ? selinux_socket_sendmsg+0x24/0x30 [ 2327.095109] ? netlink_unicast+0x3b0/0x3b0 [ 2327.099678] sock_sendmsg+0x74/0x80 [ 2327.103567] ___sys_sendmsg+0x520/0x530 [ 2327.107844] ? __get_locked_pte+0x178/0x200 [ 2327.112510] ? copy_msghdr_from_user+0x270/0x270 [ 2327.117660] ? vm_insert_page+0x360/0x360 [ 2327.122133] ? vm_insert_pfn_prot+0xb4/0x150 [ 2327.126895] ? vm_insert_pfn+0x32/0x40 [ 2327.131077] ? vvar_fault+0x71/0xd0 [ 2327.134968] ? special_mapping_fault+0x69/0x110 [ 2327.140022] ? __do_fault+0x42/0x120 [ 2327.144008] ? __handle_mm_fault+0x1062/0x17a0 [ 2327.148965] ? __fget_light+0xa7/0xc0 [ 2327.153049] __sys_sendmsg+0xcb/0x150 [ 2327.157133] ? __sys_sendmsg+0xcb/0x150 [ 2327.161409] ? SyS_shutdown+0x140/0x140 [ 2327.165688] ? exit_to_usermode_loop+0xd0/0xd0 [ 2327.170646] ? __do_page_fault+0x55d/0x620 [ 2327.175216] ? __sys_sendmsg+0x150/0x150 [ 2327.179591] SyS_sendmsg+0x12/0x20 [ 2327.183384] do_syscall_64+0xe3/0x230 [ 2327.187471] entry_SYSCALL64_slow_path+0x25/0x25 [ 2327.192622] RIP: 0033:0x7f41d18fa3b0 [ 2327.196608] RSP: 002b:00007ffc3b731218 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [ 2327.205055] RAX: ffffffffffffffda RBX: 00007ffc3b731380 RCX: 00007f41d18fa3b0 [ 2327.213017] RDX: 0000000000000000 RSI: 00007ffc3b731340 RDI: 0000000000000003 [ 2327.220978] RBP: 0000000000000002 R08: 0000000000000004 R09: 0000000000000040 [ 2327.228939] R10: 00007ffc3b730f30 R11: 0000000000000246 R12: 0000000000000003 [ 2327.236901] R13: 00007ffc3b731340 R14: 00007ffc3b7313d0 R15: 0000000000000084 [ 2327.244865] Object at ffff881be87797e0, in cache kmalloc-64 size: 64 [ 2327.251953] Allocated: [ 2327.254581] PID = 9484 [ 2327.257215] save_stack_trace+0x1b/0x20 [ 2327.261485] save_stack+0x46/0xd0 [ 2327.265179] kasan_kmalloc+0xad/0xe0 [ 2327.269165] kmem_cache_alloc_trace+0xe6/0x1d0 [ 2327.274138] sctp_add_bind_addr+0x58/0x180 [sctp] [ 2327.279400] sctp_do_bind+0x208/0x310 [sctp] [ 2327.284176] sctp_bind+0x61/0xa0 [sctp] [ 2327.288455] inet_bind+0x5f/0x3a0 [ 2327.292151] SYSC_bind+0x1a4/0x1e0 [ 2327.295944] SyS_bind+0xe/0x10 [ 2327.299349] do_syscall_64+0xe3/0x230 [ 2327.303433] return_from_SYSCALL_64+0x0/0x6a [ 2327.308194] Freed: [ 2327.310434] PID = 4131 [ 2327.313065] save_stack_trace+0x1b/0x20 [ 2327.317344] save_stack+0x46/0xd0 [ 2327.321040] kasan_slab_free+0x73/0xc0 [ 2327.325220] kfree+0x96/0x1a0 [ 2327.328530] dynamic_kobj_release+0x15/0x40 [ 2327.333195] kobject_release+0x99/0x1e0 [ 2327.337472] kobject_put+0x38/0x70 [ 2327.341266] free_notes_attrs+0x66/0x80 [ 2327.345545] mod_sysfs_teardown+0x1a5/0x270 [ 2327.350211] free_module+0x20/0x2a0 [ 2327.354099] SyS_delete_module+0x2cb/0x2f0 [ 2327.358667] do_syscall_64+0xe3/0x230 [ 2327.362750] return_from_SYSCALL_64+0x0/0x6a [ 2327.367510] Memory state around the buggy address: [ 2327.372855] ffff881be8779700: fc fc fc fc 00 00 00 00 00 00 00 00 fc fc fc fc [ 2327.380914] ffff881be8779780: fb fb fb fb fb fb fb fb fc fc fc fc 00 00 00 00 [ 2327.388972] >ffff881be8779800: 00 00 00 00 fc fc fc fc fb fb fb fb fb fb fb fb [ 2327.397031] ^ [ 2327.401792] ffff881be8779880: fc fc fc fc fb fb fb fb fb fb fb fb fc fc fc fc [ 2327.409850] ffff881be8779900: 00 00 00 00 00 04 fc fc fc fc fc fc 00 00 00 00 [ 2327.417907] ================================================================== This fixes CVE-2017-7558. References: https://bugzilla.redhat.com/show_bug.cgi?id=1480266 Fixes: 8f840e47f190 ("sctp: add the sctp_diag.c file") Cc: Xin Long <lucien.xin@gmail.com> Cc: Vlad Yasevich <vyasevich@gmail.com> Cc: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Reviewed-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* fsl/man: Inherit parent device and of_nodeFlorian Fainelli2017-09-201-0/+3
| | | | | | | | | | | | | | | | | | | | [ Upstream commit a1a50c8e4c241a505b7270e1a3c6e50d94e794b1 ] Junote Cai reported that he was not able to get a DSA setup involving the Freescale DPAA/FMAN driver to work and narrowed it down to of_find_net_device_by_node(). This function requires the network device's device reference to be correctly set which is the case here, though we have lost any device_node association there. The problem is that dpaa_eth_add_device() allocates a "dpaa-ethernet" platform device, and later on dpaa_eth_probe() is called but SET_NETDEV_DEV() won't be propagating &pdev->dev.of_node properly. Fix this by inherenting both the parent device and the of_node when dpaa_eth_add_device() creates the platform device. Fixes: 3933961682a3 ("fsl/fman: Add FMan MAC driver") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>