summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* net: phy: Fix not to call phy_resume() if PHY is not attachedYoshihiro Shimoda2018-12-031-5/+6
| | | | | | | | | This patch fixes an issue that mdio_bus_phy_resume() doesn't call phy_resume() if the PHY is not attached. Fixes: 803dd9c77ac3 ("net: phy: avoid suspending twice a PHY") Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tun: remove skb access after netif_receive_skbPrashant Bhole2018-12-031-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In tun.c skb->len was accessed while doing stats accounting after a call to netif_receive_skb. We can not access skb after this call because buffers may be dropped. The fix for this bug would be to store skb->len in local variable and then use it after netif_receive_skb(). IMO using xdp data size for accounting bytes will be better because input for tun_xdp_one() is xdp_buff. Hence this patch: - fixes a bug by removing skb access after netif_receive_skb() - uses xdp data size for accounting bytes [613.019057] BUG: KASAN: use-after-free in tun_sendmsg+0x77c/0xc50 [tun] [613.021062] Read of size 4 at addr ffff8881da9ab7c0 by task vhost-1115/1155 [613.023073] [613.024003] CPU: 0 PID: 1155 Comm: vhost-1115 Not tainted 4.20.0-rc3-vm+ #232 [613.026029] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 [613.029116] Call Trace: [613.031145] dump_stack+0x5b/0x90 [613.032219] print_address_description+0x6c/0x23c [613.034156] ? tun_sendmsg+0x77c/0xc50 [tun] [613.036141] kasan_report.cold.5+0x241/0x308 [613.038125] tun_sendmsg+0x77c/0xc50 [tun] [613.040109] ? tun_get_user+0x1960/0x1960 [tun] [613.042094] ? __isolate_free_page+0x270/0x270 [613.045173] vhost_tx_batch.isra.14+0xeb/0x1f0 [vhost_net] [613.047127] ? peek_head_len.part.13+0x90/0x90 [vhost_net] [613.049096] ? get_tx_bufs+0x5a/0x2c0 [vhost_net] [613.051106] ? vhost_enable_notify+0x2d8/0x420 [vhost] [613.053139] handle_tx_copy+0x2d0/0x8f0 [vhost_net] [613.053139] ? vhost_net_buf_peek+0x340/0x340 [vhost_net] [613.053139] ? __mutex_lock+0x8d9/0xb30 [613.053139] ? finish_task_switch+0x8f/0x3f0 [613.053139] ? handle_tx+0x32/0x120 [vhost_net] [613.053139] ? mutex_trylock+0x110/0x110 [613.053139] ? finish_task_switch+0xcf/0x3f0 [613.053139] ? finish_task_switch+0x240/0x3f0 [613.053139] ? __switch_to_asm+0x34/0x70 [613.053139] ? __switch_to_asm+0x40/0x70 [613.053139] ? __schedule+0x506/0xf10 [613.053139] handle_tx+0xc7/0x120 [vhost_net] [613.053139] vhost_worker+0x166/0x200 [vhost] [613.053139] ? vhost_dev_init+0x580/0x580 [vhost] [613.053139] ? __kthread_parkme+0x77/0x90 [613.053139] ? vhost_dev_init+0x580/0x580 [vhost] [613.053139] kthread+0x1b1/0x1d0 [613.053139] ? kthread_park+0xb0/0xb0 [613.053139] ret_from_fork+0x35/0x40 [613.088705] [613.088705] Allocated by task 1155: [613.088705] kasan_kmalloc+0xbf/0xe0 [613.088705] kmem_cache_alloc+0xdc/0x220 [613.088705] __build_skb+0x2a/0x160 [613.088705] build_skb+0x14/0xc0 [613.088705] tun_sendmsg+0x4f0/0xc50 [tun] [613.088705] vhost_tx_batch.isra.14+0xeb/0x1f0 [vhost_net] [613.088705] handle_tx_copy+0x2d0/0x8f0 [vhost_net] [613.088705] handle_tx+0xc7/0x120 [vhost_net] [613.088705] vhost_worker+0x166/0x200 [vhost] [613.088705] kthread+0x1b1/0x1d0 [613.088705] ret_from_fork+0x35/0x40 [613.088705] [613.088705] Freed by task 1155: [613.088705] __kasan_slab_free+0x12e/0x180 [613.088705] kmem_cache_free+0xa0/0x230 [613.088705] ip6_mc_input+0x40f/0x5a0 [613.088705] ipv6_rcv+0xc9/0x1e0 [613.088705] __netif_receive_skb_one_core+0xc1/0x100 [613.088705] netif_receive_skb_internal+0xc4/0x270 [613.088705] br_pass_frame_up+0x2b9/0x2e0 [613.088705] br_handle_frame_finish+0x2fb/0x7a0 [613.088705] br_handle_frame+0x30f/0x6c0 [613.088705] __netif_receive_skb_core+0x61a/0x15b0 [613.088705] __netif_receive_skb_one_core+0x8e/0x100 [613.088705] netif_receive_skb_internal+0xc4/0x270 [613.088705] tun_sendmsg+0x738/0xc50 [tun] [613.088705] vhost_tx_batch.isra.14+0xeb/0x1f0 [vhost_net] [613.088705] handle_tx_copy+0x2d0/0x8f0 [vhost_net] [613.088705] handle_tx+0xc7/0x120 [vhost_net] [613.088705] vhost_worker+0x166/0x200 [vhost] [613.088705] kthread+0x1b1/0x1d0 [613.088705] ret_from_fork+0x35/0x40 [613.088705] [613.088705] The buggy address belongs to the object at ffff8881da9ab740 [613.088705] which belongs to the cache skbuff_head_cache of size 232 Fixes: 043d222f93ab ("tuntap: accept an array of XDP buffs through sendmsg()") Reviewed-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp> Acked-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: 8139cp: fix a BUG triggered by changing mtu with network trafficSu Yanjun2018-12-031-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When changing mtu many times with traffic, a bug is triggered: [ 1035.684037] kernel BUG at lib/dynamic_queue_limits.c:26! [ 1035.684042] invalid opcode: 0000 [#1] SMP [ 1035.684049] Modules linked in: loop binfmt_misc 8139cp(OE) macsec tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag tcp_lp fuse uinput xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter devlink ip6_tables iptable_filter sunrpc snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep ppdev snd_seq iosf_mbi crc32_pclmul parport_pc snd_seq_device ghash_clmulni_intel parport snd_pcm aesni_intel joydev lrw snd_timer virtio_balloon sg gf128mul glue_helper ablk_helper cryptd snd soundcore i2c_piix4 pcspkr ip_tables xfs libcrc32c sr_mod sd_mod cdrom crc_t10dif crct10dif_generic ata_generic [ 1035.684102] pata_acpi virtio_console qxl drm_kms_helper syscopyarea sysfillrect sysimgblt floppy fb_sys_fops crct10dif_pclmul crct10dif_common ttm crc32c_intel serio_raw ata_piix drm libata 8139too virtio_pci drm_panel_orientation_quirks virtio_ring virtio mii dm_mirror dm_region_hash dm_log dm_mod [last unloaded: 8139cp] [ 1035.684132] CPU: 9 PID: 25140 Comm: if-mtu-change Kdump: loaded Tainted: G OE ------------ T 3.10.0-957.el7.x86_64 #1 [ 1035.684134] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 1035.684136] task: ffff8f59b1f5a080 ti: ffff8f5a2e32c000 task.ti: ffff8f5a2e32c000 [ 1035.684149] RIP: 0010:[<ffffffffba3a40d0>] [<ffffffffba3a40d0>] dql_completed+0x180/0x190 [ 1035.684162] RSP: 0000:ffff8f5a75483e50 EFLAGS: 00010093 [ 1035.684162] RAX: 00000000000000c2 RBX: ffff8f5a6f91c000 RCX: 0000000000000000 [ 1035.684162] RDX: 0000000000000000 RSI: 0000000000000184 RDI: ffff8f599fea3ec0 [ 1035.684162] RBP: ffff8f5a75483ea8 R08: 00000000000000c2 R09: 0000000000000000 [ 1035.684162] R10: 00000000000616ef R11: ffff8f5a75483b56 R12: ffff8f599fea3e00 [ 1035.684162] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000184 [ 1035.684162] FS: 00007fa8434de740(0000) GS:ffff8f5a75480000(0000) knlGS:0000000000000000 [ 1035.684162] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1035.684162] CR2: 00000000004305d0 CR3: 000000024eb66000 CR4: 00000000001406e0 [ 1035.684162] Call Trace: [ 1035.684162] <IRQ> [ 1035.684162] [<ffffffffc08cbaf8>] ? cp_interrupt+0x478/0x580 [8139cp] [ 1035.684162] [<ffffffffba14a294>] __handle_irq_event_percpu+0x44/0x1c0 [ 1035.684162] [<ffffffffba14a442>] handle_irq_event_percpu+0x32/0x80 [ 1035.684162] [<ffffffffba14a4cc>] handle_irq_event+0x3c/0x60 [ 1035.684162] [<ffffffffba14db29>] handle_fasteoi_irq+0x59/0x110 [ 1035.684162] [<ffffffffba02e554>] handle_irq+0xe4/0x1a0 [ 1035.684162] [<ffffffffba7795dd>] do_IRQ+0x4d/0xf0 [ 1035.684162] [<ffffffffba76b362>] common_interrupt+0x162/0x162 [ 1035.684162] <EOI> [ 1035.684162] [<ffffffffba0c2ae4>] ? __wake_up_bit+0x24/0x70 [ 1035.684162] [<ffffffffba1e46f5>] ? do_set_pte+0xd5/0x120 [ 1035.684162] [<ffffffffba1b64fb>] unlock_page+0x2b/0x30 [ 1035.684162] [<ffffffffba1e4879>] do_read_fault.isra.61+0x139/0x1b0 [ 1035.684162] [<ffffffffba1e9134>] handle_pte_fault+0x2f4/0xd10 [ 1035.684162] [<ffffffffba1ebc6d>] handle_mm_fault+0x39d/0x9b0 [ 1035.684162] [<ffffffffba76f5e3>] __do_page_fault+0x203/0x500 [ 1035.684162] [<ffffffffba76f9c6>] trace_do_page_fault+0x56/0x150 [ 1035.684162] [<ffffffffba76ef42>] do_async_page_fault+0x22/0xf0 [ 1035.684162] [<ffffffffba76b788>] async_page_fault+0x28/0x30 [ 1035.684162] Code: 54 c7 47 54 ff ff ff ff 44 0f 49 ce 48 8b 35 48 2f 9c 00 48 89 77 58 e9 fe fe ff ff 0f 1f 80 00 00 00 00 41 89 d1 e9 ef fe ff ff <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 55 8d 42 ff 48 [ 1035.684162] RIP [<ffffffffba3a40d0>] dql_completed+0x180/0x190 [ 1035.684162] RSP <ffff8f5a75483e50> It's not the same as in 7fe0ee09 patch described. As 8139cp uses shared irq mode, other device irq will trigger cp_interrupt to execute. cp_change_mtu -> cp_close -> cp_open In cp_close routine just before free_irq(), some interrupt may occur. In my environment, cp_interrupt exectutes and IntrStatus is 0x4, exactly TxOk. That will cause cp_tx to wake device queue. As device queue is started, cp_start_xmit and cp_open will run at same time which will cause kernel BUG. For example: [#] for tx descriptor At start: [#][#][#] num_queued=3 After cp_init_hw->cp_start_hw->netdev_reset_queue: [#][#][#] num_queued=0 When 8139cp starts to work then cp_tx will check num_queued mismatchs the complete_bytes. The patch will check IntrMask before check IntrStatus in cp_interrupt. When 8139cp interrupt is disabled, just return. Signed-off-by: Su Yanjun <suyj.fnst@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: phy: don't allow __set_phy_supported to add unsupported modesHeiner Kallweit2018-12-031-11/+8
| | | | | | | | | | | | | | Currently __set_phy_supported allows to add modes w/o checking whether the PHY supports them. This is wrong, it should never add modes but only remove modes we don't want to support. The commit marked as fixed didn't do anything wrong, it just copied existing functionality to the helper which is being fixed now. Fixes: f3a6bd393c2c ("phylib: Add phy_set_max_speed helper") Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
* tun: forbid iface creation with rtnl opsNicolas Dichtel2018-11-301-3/+3
| | | | | | | | | | | | | | | | | | | | It's not supported right now (the goal of the initial patch was to support 'ip link del' only). Before the patch: $ ip link add foo type tun [ 239.632660] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 [snip] [ 239.636410] RIP: 0010:register_netdevice+0x8e/0x3a0 This panic occurs because dev->netdev_ops is not set by tun_setup(). But to have something usable, it will require more than just setting netdev_ops. Fixes: f019a7a594d9 ("tun: Implement ip link del tunXXX") CC: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* virtio-net: keep vnet header zeroed after processing XDPJason Wang2018-11-301-5/+9
| | | | | | | | | | | | | | | We copy vnet header unconditionally in page_to_skb() this is wrong since XDP may modify the packet data. So let's keep a zeroed vnet header for not confusing the conversion between vnet header and skb metadata. In the future, we should able to detect whether or not the packet was modified and keep using the vnet header when packet was not touched. Fixes: f600b6905015 ("virtio_net: Add XDP support") Reported-by: Pavel Popa <pashinho1990@gmail.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'tcp-fixes-in-timeout-and-retransmission-accounting'David S. Miller2018-11-302-6/+6
|\ | | | | | | | | | | | | | | | | | | | | | | | | Yuchung Cheng says: ==================== tcp: fixes in timeout and retransmission accounting This patch set has assorted fixes of minor accounting issues in timeout, window probe, and retransmission stats. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: fix SNMP TCP timeout under-estimationYuchung Cheng2018-11-301-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously the SNMP TCPTIMEOUTS counter has inconsistent accounting: 1. It counts all SYN and SYN-ACK timeouts 2. It counts timeouts in other states except recurring timeouts and timeouts after fast recovery or disorder state. Such selective accounting makes analysis difficult and complicated. For example the monitoring system needs to collect many other SNMP counters to infer the total amount of timeout events. This patch makes TCPTIMEOUTS counter simply counts all the retransmit timeout (SYN or data or FIN). Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: fix SNMP under-estimation on failed retransmissionYuchung Cheng2018-11-301-1/+1
| | | | | | | | | | | | | | | | | | | | Previously the SNMP counter LINUX_MIB_TCPRETRANSFAIL is not counting the TSO/GSO properly on failed retransmission. This patch fixes that. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: fix off-by-one bug on aborting window-probing socketYuchung Cheng2018-11-301-1/+1
|/ | | | | | | | | | | Previously there is an off-by-one bug on determining when to abort a stalled window-probing socket. This patch fixes that so it is consistent with tcp_write_timeout(). Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* liquidio: read sc->iq_no before release scPan Bian2018-11-301-1/+3
| | | | | | | | | | | The function lio_vf_rep_packet_sent_callback releases the occupation of sc via octeon_free_soft_command. sc should not be used after that. Unfortunately, sc->iq_no is read. To fix this, the patch stores sc->iq_no into a local variable before releasing sc and then uses the local variable instead of sc->iq_no. Signed-off-by: Pan Bian <bianpan2016@163.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* mlx5: fix get_ip_proto()Cong Wang2018-11-301-3/+3
| | | | | | | | | | | | | | | IP header is not necessarily located right after struct ethhdr, there could be multiple 802.1Q headers in between, this is why we call __vlan_get_protocol(). Fixes: fe1dc069990c ("net/mlx5e: don't set CHECKSUM_COMPLETE on SCTP packets") Cc: Alaa Hleihel <alaa@mellanox.com> Cc: Or Gerlitz <ogerlitz@mellanox.com> Cc: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: dsa: Fix tagging attribute locationFlorian Fainelli2018-11-303-30/+34
| | | | | | | | | | | | | | | While introducing the DSA tagging protocol attribute, it was added to the DSA slave network devices, but those actually see untagged traffic (that is their whole purpose). Correct this mistake by putting the tagging sysfs attribute under the DSA master network device where this is the information that we need. While at it, also correct the sysfs documentation mistake that missed the "dsa/" directory component of the attribute. Fixes: 98cdb4807123 ("net: dsa: Expose tagging protocol to user-space") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/sched: act_police: fix memory leak in case of invalid control actionDavide Caratti2018-11-301-12/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | when users set an invalid control action, kmemleak complains as follows: # echo clear >/sys/kernel/debug/kmemleak # ./tdc.py -e b48b Test b48b: Add police action with exceed goto chain control action All test results: 1..1 ok 1 - b48b # Add police action with exceed goto chain control action about to flush the tap output if tests need to be skipped done flushing skipped test tap output # echo scan >/sys/kernel/debug/kmemleak # cat /sys/kernel/debug/kmemleak unreferenced object 0xffffa0fafbc3dde0 (size 96): comm "tc", pid 2358, jiffies 4294922738 (age 17.022s) hex dump (first 32 bytes): 2a 00 00 20 00 00 00 00 00 00 7d 00 00 00 00 00 *.. ......}..... f8 07 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000648803d2>] tcf_action_init_1+0x384/0x4c0 [<00000000cb69382e>] tcf_action_init+0x12b/0x1a0 [<00000000847ef0d4>] tcf_action_add+0x73/0x170 [<0000000093656e14>] tc_ctl_action+0x122/0x160 [<0000000023c98e32>] rtnetlink_rcv_msg+0x263/0x2d0 [<000000003493ae9c>] netlink_rcv_skb+0x4d/0x130 [<00000000de63f8ba>] netlink_unicast+0x209/0x2d0 [<00000000c3da0ebe>] netlink_sendmsg+0x2c1/0x3c0 [<000000007a9e0753>] sock_sendmsg+0x33/0x40 [<00000000457c6d2e>] ___sys_sendmsg+0x2a0/0x2f0 [<00000000c5c6a086>] __sys_sendmsg+0x5e/0xa0 [<00000000446eafce>] do_syscall_64+0x5b/0x180 [<000000004aa871f2>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [<00000000450c38ef>] 0xffffffffffffffff change tcf_police_init() to avoid leaking 'new' in case TCA_POLICE_RESULT contains TC_ACT_GOTO_CHAIN extended action. Fixes: c08f5ed5d625 ("net/sched: act_police: disallow 'goto chain' on fallback control action") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* nfp: flower: prevent offload if rhashtable insert failsJohn Hurley2018-11-301-5/+9
| | | | | | | | | | | | | | | For flow offload adds, if the rhash insert code fails, the flow will still have been offloaded but the reference to it in the driver freed. Re-order the offload setup calls to ensure that a flow will only be written to FW if a kernel reference is held and stored in the rhashtable. Remove this hashtable entry if the offload fails. Fixes: c01d0efa5136 ("nfp: flower: use rhashtable for flow caching") Signed-off-by: John Hurley <john.hurley@netronome.com> Reviewed-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* nfp: flower: release metadata on offload failureJohn Hurley2018-11-301-2/+4
| | | | | | | | | | | | | | | | | Calling nfp_compile_flow_metadata both assigns a stats context and increments a ref counter on (or allocates) a mask id table entry. These are released by the nfp_modify_flow_metadata call on flow deletion, however, if a flow add fails after metadata is set then the flow entry will be deleted but the metadata assignments leaked. Add an error path to the flow add offload function to ensure allocated metadata is released in the event of an offload fail. Fixes: 81f3ddf2547d ("nfp: add control message passing capabilities to flower offloads") Signed-off-by: John Hurley <john.hurley@netronome.com> Reviewed-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* bonding: fix 802.3ad state sent to partner when unbinding slaveToni Peltonen2018-11-301-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously when unbinding a slave the 802.3ad implementation only told partner that the port is not suitable for aggregation by setting the port aggregation state from aggregatable to individual. This is not enough. If the physical layer still stays up and we only unbinded this port from the bond there is nothing in the aggregation status alone to prevent the partner from sending traffic towards us. To ensure that the partner doesn't consider this port at all anymore we should also disable collecting and distributing to signal that this actor is going away. Also clear AD_STATE_SYNCHRONIZATION to ensure partner exits collecting + distributing state. I have tested this behaviour againts Arista EOS switches with mlx5 cards (physical link stays up even when interface is down) and simulated the same situation virtually Linux <-> Linux with two network namespaces running two veth device pairs. In both cases setting aggregation to individual doesn't alone prevent traffic from being to sent towards this port given that the link stays up in partners end. Partner still keeps it's end in collecting + distributing state and continues until timeout is reached. In most cases this means we are losing the traffic partner sends towards our port while we wait for timeout. This is most visible with slow periodic time (LACP rate slow). Other open source implementations like Open VSwitch and libreswitch, and vendor implementations like Arista EOS, seem to disable collecting + distributing to when doing similar port disabling/detaching/removing change. With this patch kernel implementation would behave the same way and ensure partner doesn't consider our actor viable anymore. Signed-off-by: Toni Peltonen <peltzi@peltzi.fi> Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com> Acked-by: Jonathan Toppins <jtoppins@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: aquantia: fix rx checksum offload bitsDmitry Bogdanov2018-11-301-1/+1
| | | | | | | | | | | | | | | | The last set of csum offload fixes had a leak: Checksum enabled status bits from rx descriptor were incorrectly interpreted. Consequently all the other valid logic worked on zero bits. That caused rx checksum offloads never to trigger. Tested by dumping rx descriptors and validating resulting csum_level. Reported-by: Igor Russkikh <igor.russkikh@aquantia.com> Signed-off-by: Dmitry Bogdanov <dmitry.bogdanov@aquantia.com> Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com> Fixes: ad703c2b9127f ("net: aquantia: invalid checksumm offload implementation") Signed-off-by: David S. Miller <davem@davemloft.net>
* openvswitch: fix spelling mistake "execeeds" -> "exceeds"Colin Ian King2018-11-301-1/+1
| | | | | | | | There is a spelling mistake in a net_warn_ratelimited message, fix this. Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* liquidio: fix spelling mistake "deferal" -> "deferral"Colin Ian King2018-11-301-1/+1
| | | | | | | There is a spelling mistake in the oct_stats_strings array, fix it. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: stmmac: Move debugfs init/exit to ->probe()/->remove()Thierry Reding2018-11-301-10/+13
| | | | | | | | | | | | | | | | | | | | | | | Setting up and tearing down debugfs is current unbalanced, as seen by this error during resume from suspend: [ 752.134067] dwc-eth-dwmac 2490000.ethernet eth0: ERROR failed to create debugfs directory [ 752.134347] dwc-eth-dwmac 2490000.ethernet eth0: stmmac_hw_setup: failed debugFS registration The imbalance happens because the driver creates the debugfs hierarchy when the device is opened and tears it down when the device is closed. There's little gain in that, and it could be argued that it is even surprising because it's not usually done for other devices. Fix the imbalance by moving the debugfs creation and teardown to the driver's ->probe() and ->remove() implementations instead. Note that the ring descriptors cannot be read while the interface is down, so make sure to return an empty file when the descriptors_status debugfs file is read. Signed-off-by: Thierry Reding <treding@nvidia.com> Acked-by: Jose Abreu <joabreu@synopsys.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* sctp: update frag_point when stream_interleave is setXin Long2018-11-302-3/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | sctp_assoc_update_frag_point() should be called whenever asoc->pathmtu changes, but we missed one place in sctp_association_init(). It would cause frag_point is zero when sending data. As says in Jakub's reproducer, if sp->pathmtu is set by socketopt, the new asoc->pathmtu inherits it in sctp_association_init(). Later when transports are added and their pmtu >= asoc->pathmtu, it will never call sctp_assoc_update_frag_point() to set frag_point. This patch is to fix it by updating frag_point after asoc->pathmtu is set as sp->pathmtu in sctp_association_init(). Note that it moved them after sctp_stream_init(), as stream->si needs to be set first. Frag_point's calculation is also related with datachunk's type, so it needs to update frag_point when stream->si may be changed in sctp_process_init(). v1->v2: - call sctp_assoc_update_frag_point() separately in sctp_process_init and sctp_association_init, per Marcelo's suggestion. Fixes: 2f5e3c9df693 ("sctp: introduce sctp_assoc_update_frag_point") Reported-by: Jakub Audykowicz <jakub.audykowicz@gmail.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Prevent invalid access to skb->prev in __qdisc_drop_allChristoph Paasch2018-11-291-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __qdisc_drop_all() accesses skb->prev to get to the tail of the segment-list. With commit 68d2f84a1368 ("net: gro: properly remove skb from list") the skb-list handling has been changed to set skb->next to NULL and set the list-poison on skb->prev. With that change, __qdisc_drop_all() will panic when it tries to dereference skb->prev. Since commit 992cba7e276d ("net: Add and use skb_list_del_init().") __list_del_entry is used, leaving skb->prev unchanged (thus, pointing to the list-head if it's the first skb of the list). This will make __qdisc_drop_all modify the next-pointer of the list-head and result in a panic later on: [ 34.501053] general protection fault: 0000 [#1] SMP KASAN PTI [ 34.501968] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.20.0-rc2.mptcp #108 [ 34.502887] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.5.1 01/01/2011 [ 34.504074] RIP: 0010:dev_gro_receive+0x343/0x1f90 [ 34.504751] Code: e0 48 c1 e8 03 42 80 3c 30 00 0f 85 4a 1c 00 00 4d 8b 24 24 4c 39 65 d0 0f 84 0a 04 00 00 49 8d 7c 24 38 48 89 f8 48 c1 e8 03 <42> 0f b6 04 30 84 c0 74 08 3c 04 [ 34.507060] RSP: 0018:ffff8883af507930 EFLAGS: 00010202 [ 34.507761] RAX: 0000000000000007 RBX: ffff8883970b2c80 RCX: 1ffff11072e165a6 [ 34.508640] RDX: 1ffff11075867008 RSI: ffff8883ac338040 RDI: 0000000000000038 [ 34.509493] RBP: ffff8883af5079d0 R08: ffff8883970b2d40 R09: 0000000000000062 [ 34.510346] R10: 0000000000000034 R11: 0000000000000000 R12: 0000000000000000 [ 34.511215] R13: 0000000000000000 R14: dffffc0000000000 R15: ffff8883ac338008 [ 34.512082] FS: 0000000000000000(0000) GS:ffff8883af500000(0000) knlGS:0000000000000000 [ 34.513036] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 34.513741] CR2: 000055ccc3e9d020 CR3: 00000003abf32000 CR4: 00000000000006e0 [ 34.514593] Call Trace: [ 34.514893] <IRQ> [ 34.515157] napi_gro_receive+0x93/0x150 [ 34.515632] receive_buf+0x893/0x3700 [ 34.516094] ? __netif_receive_skb+0x1f/0x1a0 [ 34.516629] ? virtnet_probe+0x1b40/0x1b40 [ 34.517153] ? __stable_node_chain+0x4d0/0x850 [ 34.517684] ? kfree+0x9a/0x180 [ 34.518067] ? __kasan_slab_free+0x171/0x190 [ 34.518582] ? detach_buf+0x1df/0x650 [ 34.519061] ? lapic_next_event+0x5a/0x90 [ 34.519539] ? virtqueue_get_buf_ctx+0x280/0x7f0 [ 34.520093] virtnet_poll+0x2df/0xd60 [ 34.520533] ? receive_buf+0x3700/0x3700 [ 34.521027] ? qdisc_watchdog_schedule_ns+0xd5/0x140 [ 34.521631] ? htb_dequeue+0x1817/0x25f0 [ 34.522107] ? sch_direct_xmit+0x142/0xf30 [ 34.522595] ? virtqueue_napi_schedule+0x26/0x30 [ 34.523155] net_rx_action+0x2f6/0xc50 [ 34.523601] ? napi_complete_done+0x2f0/0x2f0 [ 34.524126] ? kasan_check_read+0x11/0x20 [ 34.524608] ? _raw_spin_lock+0x7d/0xd0 [ 34.525070] ? _raw_spin_lock_bh+0xd0/0xd0 [ 34.525563] ? kvm_guest_apic_eoi_write+0x6b/0x80 [ 34.526130] ? apic_ack_irq+0x9e/0xe0 [ 34.526567] __do_softirq+0x188/0x4b5 [ 34.527015] irq_exit+0x151/0x180 [ 34.527417] do_IRQ+0xdb/0x150 [ 34.527783] common_interrupt+0xf/0xf [ 34.528223] </IRQ> This patch makes sure that skb->prev is set to NULL when entering netem_enqueue. Cc: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp> Cc: Tyler Hicks <tyhicks@canonical.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Fixes: 68d2f84a1368 ("net: gro: properly remove skb from list") Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Christoph Paasch <cpaasch@apple.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/x25: handle call collisionsMartin Schiller2018-11-291-0/+9
| | | | | | | | | | If a session in X25_STATE_1 (Awaiting Call Accept) receives a call request, the session will be closed (x25_disconnect), cause=0x01 (Number Busy) and diag=0x48 (Call Collision) will be set and a clear request will be send. Signed-off-by: Martin Schiller <ms@dev.tdt.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/x25: fix null_x25_address handlingMartin Schiller2018-11-291-6/+10
| | | | | | | | | | | o x25_find_listener(): the compare for the null_x25_address was wrong. We have to check the x25_addr of the listener socket instead of the x25_addr of the incomming call. o x25_bind(): it was not possible to bind a socket to null_x25_address Signed-off-by: Martin Schiller <ms@dev.tdt.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/x25: fix called/calling length calculation in x25_parse_address_blockMartin Schiller2018-11-291-1/+1
| | | | | | | | The length of the called and calling address was not calculated correctly (BCD encoding). Signed-off-by: Martin Schiller <ms@dev.tdt.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: phy: sfp: correct location of SFP standardsBaruch Siach2018-11-291-1/+1
| | | | | | | | | | | SFP standards are now available from the SNIA (Storage Networking Industry Association) website. Cc: Andrew Lunn <andrew@lunn.ch> Cc: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Baruch Siach <baruch@tkos.co.il> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'xps-fixes'David S. Miller2018-11-291-24/+29
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Sabrina Dubroca says: ==================== fixes for XPS configuration after Symmetric queue selection This fixes some bugs introduced by the "Symmetric queue selection using XPS for Rx queues". First, the refactoring of the cleanup function skipped resetting the queue's NUMA node under some conditions. Second, the accounting on static keys for XPS and RXQS-XPS is unbalanced, so the static key for XPS won't actually disable itself, once enabled. The RXQS-XPS static key can actually be disabled by reconfiguring a device that didn't have RXQS-XPS configured at all. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: fix XPS static_key accountingSabrina Dubroca2018-11-291-21/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 04157469b7b8 ("net: Use static_key for XPS maps") introduced a static key for XPS, but the increments/decrements don't match. First, the static key's counter is incremented once for each queue, but only decremented once for a whole batch of queues, leading to large unbalances. Second, the xps_rxqs_needed key is decremented whenever we reset a batch of queues, whether they had any rxqs mapping or not, so that if we setup cpu-XPS on em1 and RXQS-XPS on em2, resetting the queues on em1 would decrement the xps_rxqs_needed key. This reworks the accounting scheme so that the xps_needed key is incremented only once for each type of XPS for all the queues on a device, and the xps_rxqs_needed key is incremented only once for all queues. This is sufficient to let us retrieve queues via get_xps_queue(). This patch introduces a new reset_xps_maps(), which reinitializes and frees the appropriate map (xps_rxqs_map or xps_cpus_map), and drops a reference to the needed keys: - both xps_needed and xps_rxqs_needed, in case of rxqs maps, - only xps_needed, in case of CPU maps. Now, we also need to call reset_xps_maps() at the end of __netif_set_xps_queue() when there's no active map left, for example when writing '00000000,00000000' to all queues' xps_rxqs setting. Fixes: 04157469b7b8 ("net: Use static_key for XPS maps") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: restore call to netdev_queue_numa_node_write when resetting XPSSabrina Dubroca2018-11-291-7/+9
|/ | | | | | | | | | | | Before commit 80d19669ecd3 ("net: Refactor XPS for CPUs and Rx queues"), netif_reset_xps_queues() did netdev_queue_numa_node_write() for all the queues being reset. Now, this is only done when the "active" variable in clean_xps_maps() is false, ie when on all the CPUs, there's no active XPS mapping left. Fixes: 80d19669ecd3 ("net: Refactor XPS for CPUs and Rx queues") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: phy: sfp: correct store of detected link modesBaruch Siach2018-11-291-1/+1
| | | | | | | | | | | The link modes that sfp_parse_support() detects are stored in the 'modes' bitmap. There is no reason to make an exception for 1000Base-PX or 1000Base-BX10. Fixes: 03145864bd0f ("sfp: support 1G BiDi (eg, FiberStore SFP-GE-BX) modules") Signed-off-by: Baruch Siach <baruch@tkos.co.il> Acked-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'ave-fixes'David S. Miller2018-11-292-10/+21
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | Kunihiko Hayashi says: ==================== fixup AVE ethernet driver This series adds fixup for AVE ethernet driver that includes increse of descriptors, replacing macro for linux-next, and adding missing author information. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: ethernet: ave: Add MODULE_AUTHOR and MAINTAINERS entryKunihiko Hayashi2018-11-292-0/+8
| | | | | | | | | | | | | | Add missing MODULE_AUTHOR of ave driver and an entry to MAINTAINERS. Signed-off-by: Kunihiko Hayashi <hayashi.kunihiko@socionext.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: ethernet: ave: Replace NET_IP_ALIGN with AVE_FRAME_HEADROOMKunihiko Hayashi2018-11-291-6/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | In commit 26a4676faa1a ("arm64: mm: define NET_IP_ALIGN to 0"), AVE controller affects this modification because the controller forces to ignore lower 2bits of buffer start address, and make 2-byte headroom, that is, data reception starts from (buffer + 2). This patch defines AVE_FRAME_HEADROOM macro as hardware-specific value, and replaces NET_IP_ALIGN with it. Signed-off-by: Kunihiko Hayashi <hayashi.kunihiko@socionext.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: ethernet: ave: Increase descriptors to improve performanceKunihiko Hayashi2018-11-291-4/+5
|/ | | | | | | | To improve performance, this increases Rx descriptor to 256, Tx descriptor to 64, and adjusts NAPI weight to NAPI_POLL_WEIGHT. Signed-off-by: Kunihiko Hayashi <hayashi.kunihiko@socionext.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2018-11-2868-261/+1312
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David Miller: 1) ARM64 JIT fixes for subprog handling from Daniel Borkmann. 2) Various sparc64 JIT bug fixes (fused branch convergance, frame pointer usage detection logic, PSEODU call argument handling). 3) Fix to use BH locking in nf_conncount, from Taehee Yoo. 4) Fix race of TX skb freeing in ipheth driver, from Bernd Eckstein. 5) Handle return value of TX NAPI completion properly in lan743x driver, from Bryan Whitehead. 6) MAC filter deletion in i40e driver clears wrong state bit, from Lihong Yang. 7) Fix use after free in rionet driver, from Pan Bian. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (53 commits) s390/qeth: fix length check in SNMP processing net: hisilicon: remove unexpected free_netdev rapidio/rionet: do not free skb before reading its length i40e: fix kerneldoc for xsk methods ixgbe: recognize 1000BaseLX SFP modules as 1Gbps i40e: Fix deletion of MAC filters igb: fix uninitialized variables netfilter: nf_tables: deactivate expressions in rule replecement routine lan743x: Enable driver to work with LAN7431 tipc: fix lockdep warning during node delete lan743x: fix return value for lan743x_tx_napi_poll net: via: via-velocity: fix spelling mistake "alignement" -> "alignment" qed: fix spelling mistake "attnetion" -> "attention" net: thunderx: fix NULL pointer dereference in nic_remove sctp: increase sk_wmem_alloc when head->truesize is increased firestream: fix spelling mistake: "Inititing" -> "Initializing" net: phy: add workaround for issue where PHY driver doesn't bind to the device usbnet: ipheth: fix potential recvmsg bug and recvmsg bug 2 sparc: Adjust bpf JIT prologue for PSEUDO calls. bpf, doc: add entries of who looks over which jits ...
| * Merge branch '1GbE' of ↵David S. Miller2018-11-284-9/+12
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue Jeff Kirsher says: ==================== Intel Wired LAN Driver Fixes 2018-11-28 This series contains fixes to igb, ixgbe and i40e. Yunjian Wang from Huawei resolves a variable that could potentially be NULL before it is used. Lihong fixes an i40e issue which goes back to 4.17 kernels, where deleting any of the MAC filters was causing the incorrect syncing for the PF. Josh Elsasser caught that there were missing enum values in the link capabilities for x550 devices, which was preventing link for 1000BaseLX SFP modules. Jan fixes the function header comments for XSK methods. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * i40e: fix kerneldoc for xsk methodsJan Sokolowski2018-11-281-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | One method, xsk_umem_setup, had an incorrect kernel doc description, which has been corrected. Also fixes small typos found in the comments. Signed-off-by: Jan Sokolowski <jan.sokolowski@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
| | * ixgbe: recognize 1000BaseLX SFP modules as 1GbpsJosh Elsasser2018-11-281-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add the two 1000BaseLX enum values to the X550's check for 1Gbps modules, allowing the core driver code to establish a link over this SFP type. This is done by the out-of-tree driver but the fix wasn't in mainline. Fixes: e23f33367882 ("ixgbe: Fix 1G and 10G link stability for X550EM_x SFP+”) Fixes: 6a14ee0cfb19 ("ixgbe: Add X550 support function pointers") Signed-off-by: Josh Elsasser <jelsasser@appneta.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
| | * i40e: Fix deletion of MAC filtersLihong Yang2018-11-281-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In __i40e_del_filter function, the flag __I40E_MACVLAN_SYNC_PENDING for the PF state is wrongly set for the VSI. Deleting any of the MAC filters has caused the incorrect syncing for the PF. Fix it by setting this state flag to the intended PF. CC: stable <stable@vger.kernel.org> Signed-off-by: Lihong Yang <lihong.yang@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
| | * igb: fix uninitialized variablesYunjian Wang2018-11-281-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | This patch fixes the variable 'phy_word' may be used uninitialized. Signed-off-by: Yunjian Wang <wangyunjian@huawei.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
| * | s390/qeth: fix length check in SNMP processingJulian Wiedmann2018-11-281-15/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The response for a SNMP request can consist of multiple parts, which the cmd callback stages into a kernel buffer until all parts have been received. If the callback detects that the staging buffer provides insufficient space, it bails out with error. This processing is buggy for the first part of the response - while it initially checks for a length of 'data_len', it later copies an additional amount of 'offsetof(struct qeth_snmp_cmd, data)' bytes. Fix the calculation of 'data_len' for the first part of the response. This also nicely cleans up the memcpy code. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Reviewed-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller2018-11-2823-107/+259
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Disable BH while holding list spinlock in nf_conncount, from Taehee Yoo. 2) List corruption in nf_conncount, also from Taehee. 3) Fix race that results in leaving around an empty list node in nf_conncount, from Taehee Yoo. 4) Proper chain handling for inactive chains from the commit path, from Florian Westphal. This includes a selftest for this. 5) Do duplicate rule handles when replacing rules, also from Florian. 6) Remove net_exit path in xt_RATEEST that results in splat, from Taehee. 7) Possible use-after-free in nft_compat when releasing extensions. From Florian. 8) Memory leak in xt_hashlimit, from Taehee. 9) Call ip_vs_dst_notifier after ipv6_dev_notf, from Xin Long. 10) Fix cttimeout with udplite and gre, from Florian. 11) Preserve oif for IPv6 link-local generated traffic from mangle table, from Alin Nastac. 12) Missing error handling in masquerade notifiers, from Taehee Yoo. 13) Use mutex to protect registration/unregistration of masquerade extensions in order to prevent a race, from Taehee. 14) Incorrect condition check in tree_nodes_free(), also from Taehee. 15) Fix chain counter leak in rule replacement path, from Taehee. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | netfilter: nf_tables: deactivate expressions in rule replecement routineTaehee Yoo2018-11-281-11/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is no expression deactivation call from the rule replacement path, hence, chain counter is not decremented. A few steps to reproduce the problem: %nft add table ip filter %nft add chain ip filter c1 %nft add chain ip filter c1 %nft add rule ip filter c1 jump c2 %nft replace rule ip filter c1 handle 3 accept %nft flush ruleset <jump c2> expression means immediate NFT_JUMP to chain c2. Reference count of chain c2 is increased when the rule is added. When rule is deleted or replaced, the reference counter of c2 should be decreased via nft_rule_expr_deactivate() which calls nft_immediate_deactivate(). Splat looks like: [ 214.396453] WARNING: CPU: 1 PID: 21 at net/netfilter/nf_tables_api.c:1432 nf_tables_chain_destroy.isra.38+0x2f9/0x3a0 [nf_tables] [ 214.398983] Modules linked in: nf_tables nfnetlink [ 214.398983] CPU: 1 PID: 21 Comm: kworker/1:1 Not tainted 4.20.0-rc2+ #44 [ 214.398983] Workqueue: events nf_tables_trans_destroy_work [nf_tables] [ 214.398983] RIP: 0010:nf_tables_chain_destroy.isra.38+0x2f9/0x3a0 [nf_tables] [ 214.398983] Code: 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 8e 00 00 00 48 8b 7b 58 e8 e1 2c 4e c6 48 89 df e8 d9 2c 4e c6 eb 9a <0f> 0b eb 96 0f 0b e9 7e fe ff ff e8 a7 7e 4e c6 e9 a4 fe ff ff e8 [ 214.398983] RSP: 0018:ffff8881152874e8 EFLAGS: 00010202 [ 214.398983] RAX: 0000000000000001 RBX: ffff88810ef9fc28 RCX: ffff8881152876f0 [ 214.398983] RDX: dffffc0000000000 RSI: 1ffff11022a50ede RDI: ffff88810ef9fc78 [ 214.398983] RBP: 1ffff11022a50e9d R08: 0000000080000000 R09: 0000000000000000 [ 214.398983] R10: 0000000000000000 R11: 0000000000000000 R12: 1ffff11022a50eba [ 214.398983] R13: ffff888114446e08 R14: ffff8881152876f0 R15: ffffed1022a50ed6 [ 214.398983] FS: 0000000000000000(0000) GS:ffff888116400000(0000) knlGS:0000000000000000 [ 214.398983] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 214.398983] CR2: 00007fab9bb5f868 CR3: 000000012aa16000 CR4: 00000000001006e0 [ 214.398983] Call Trace: [ 214.398983] ? nf_tables_table_destroy.isra.37+0x100/0x100 [nf_tables] [ 214.398983] ? __kasan_slab_free+0x145/0x180 [ 214.398983] ? nf_tables_trans_destroy_work+0x439/0x830 [nf_tables] [ 214.398983] ? kfree+0xdb/0x280 [ 214.398983] nf_tables_trans_destroy_work+0x5f5/0x830 [nf_tables] [ ... ] Fixes: bb7b40aecbf7 ("netfilter: nf_tables: bogus EBUSY in chain deletions") Reported by: Christoph Anton Mitterer <calestyo@scientia.net> Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=914505 Link: https://bugzilla.kernel.org/show_bug.cgi?id=201791 Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * | netfilter: nf_conncount: remove wrong condition check routineTaehee Yoo2018-11-271-5/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All lists that reach the tree_nodes_free() function have both zero counter and true dead flag. The reason for this is that lists to be release are selected by nf_conncount_gc_list() which already decrements the list counter and sets on the dead flag. Therefore, this if statement in tree_nodes_free() is unnecessary and wrong. Fixes: 31568ec09ea0 ("netfilter: nf_conncount: fix list_del corruption in conn_free") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * | netfilter: nat: fix double register in masquerade modulesTaehee Yoo2018-11-272-14/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a reference counter to ensure that masquerade modules register notifiers only once. However, the existing reference counter approach is not safe, test commands are: while : do modprobe ip6t_MASQUERADE & modprobe nft_masq_ipv6 & modprobe -rv ip6t_MASQUERADE & modprobe -rv nft_masq_ipv6 & done numbers below represent the reference counter. -------------------------------------------------------- CPU0 CPU1 CPU2 CPU3 CPU4 [insmod] [insmod] [rmmod] [rmmod] [insmod] -------------------------------------------------------- 0->1 register 1->2 returns 2->1 returns 1->0 0->1 register <-- unregister -------------------------------------------------------- The unregistation of CPU3 should be processed before the registration of CPU4. In order to fix this, use a mutex instead of reference counter. splat looks like: [ 323.869557] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [modprobe:1381] [ 323.869574] Modules linked in: nf_tables(+) nf_nat_ipv6(-) nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 n] [ 323.869574] irq event stamp: 194074 [ 323.898930] hardirqs last enabled at (194073): [<ffffffff90004a0d>] trace_hardirqs_on_thunk+0x1a/0x1c [ 323.898930] hardirqs last disabled at (194074): [<ffffffff90004a29>] trace_hardirqs_off_thunk+0x1a/0x1c [ 323.898930] softirqs last enabled at (182132): [<ffffffff922006ec>] __do_softirq+0x6ec/0xa3b [ 323.898930] softirqs last disabled at (182109): [<ffffffff90193426>] irq_exit+0x1a6/0x1e0 [ 323.898930] CPU: 0 PID: 1381 Comm: modprobe Not tainted 4.20.0-rc2+ #27 [ 323.898930] RIP: 0010:raw_notifier_chain_register+0xea/0x240 [ 323.898930] Code: 3c 03 0f 8e f2 00 00 00 44 3b 6b 10 7f 4d 49 bc 00 00 00 00 00 fc ff df eb 22 48 8d 7b 10 488 [ 323.898930] RSP: 0018:ffff888101597218 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13 [ 323.898930] RAX: 0000000000000000 RBX: ffffffffc04361c0 RCX: 0000000000000000 [ 323.898930] RDX: 1ffffffff26132ae RSI: ffffffffc04aa3c0 RDI: ffffffffc04361d0 [ 323.898930] RBP: ffffffffc04361c8 R08: 0000000000000000 R09: 0000000000000001 [ 323.898930] R10: ffff8881015972b0 R11: fffffbfff26132c4 R12: dffffc0000000000 [ 323.898930] R13: 0000000000000000 R14: 1ffff110202b2e44 R15: ffffffffc04aa3c0 [ 323.898930] FS: 00007f813ed41540(0000) GS:ffff88811ae00000(0000) knlGS:0000000000000000 [ 323.898930] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 323.898930] CR2: 0000559bf2c9f120 CR3: 000000010bc80000 CR4: 00000000001006f0 [ 323.898930] Call Trace: [ 323.898930] ? atomic_notifier_chain_register+0x2d0/0x2d0 [ 323.898930] ? down_read+0x150/0x150 [ 323.898930] ? sched_clock_cpu+0x126/0x170 [ 323.898930] ? nf_tables_core_module_init+0xe4/0xe4 [nf_tables] [ 323.898930] ? nf_tables_core_module_init+0xe4/0xe4 [nf_tables] [ 323.898930] register_netdevice_notifier+0xbb/0x790 [ 323.898930] ? __dev_close_many+0x2d0/0x2d0 [ 323.898930] ? __mutex_unlock_slowpath+0x17f/0x740 [ 323.898930] ? wait_for_completion+0x710/0x710 [ 323.898930] ? nf_tables_core_module_init+0xe4/0xe4 [nf_tables] [ 323.898930] ? up_write+0x6c/0x210 [ 323.898930] ? nf_tables_core_module_init+0xe4/0xe4 [nf_tables] [ 324.127073] ? nf_tables_core_module_init+0xe4/0xe4 [nf_tables] [ 324.127073] nft_chain_filter_init+0x1e/0xe8a [nf_tables] [ 324.127073] nf_tables_module_init+0x37/0x92 [nf_tables] [ ... ] Fixes: 8dd33cc93ec9 ("netfilter: nf_nat: generalize IPv4 masquerading support for nf_tables") Fixes: be6b635cd674 ("netfilter: nf_nat: generalize IPv6 masquerading support for nf_tables") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * | netfilter: add missing error handling code for register functionsTaehee Yoo2018-11-279-22/+63
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | register_{netdevice/inetaddr/inet6addr}_notifier may return an error value, this patch adds the code to handle these error paths. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * | netfilter: ipv6: Preserve link scope traffic original oifAlin Nastac2018-11-271-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When ip6_route_me_harder is invoked, it resets outgoing interface of: - link-local scoped packets sent by neighbor discovery - multicast packets sent by MLD host - multicast packets send by MLD proxy daemon that sets outgoing interface through IPV6_PKTINFO ipi6_ifindex Link-local and multicast packets must keep their original oif after ip6_route_me_harder is called. Signed-off-by: Alin Nastac <alin.nastac@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * | netfilter: nfnetlink_cttimeout: fetch timeouts for udplite and gre, tooFlorian Westphal2018-11-263-14/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | syzbot was able to trigger the WARN in cttimeout_default_get() by passing UDPLITE as l4protocol. Alias UDPLITE to UDP, both use same timeout values. Furthermore, also fetch GRE timeouts. GRE is a bit more complicated, as it still can be a module and its netns_proto_gre struct layout isn't visible outside of the gre module. Can't move timeouts around, it appears conntrack sysctl unregister assumes net_generic() returns nf_proto_net, so we get crash. Expose layout of netns_proto_gre instead. A followup nf-next patch could make gre tracker be built-in as well if needed, its not that large. Last, make the WARN() mention the missing protocol value in case anything else is missing. Reported-by: syzbot+2fae8fa157dd92618cae@syzkaller.appspotmail.com Fixes: 8866df9264a3 ("netfilter: nfnetlink_cttimeout: pass default timeout policy to obj_to_nlattr") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * | ipvs: call ip_vs_dst_notifier earlier than ipv6_dev_notfXin Long2018-11-261-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ip_vs_dst_event is supposed to clean up all dst used in ipvs' destinations when a net dev is going down. But it works only when the dst's dev is the same as the dev from the event. Now with the same priority but late registration, ip_vs_dst_notifier is always called later than ipv6_dev_notf where the dst's dev is set to lo for NETDEV_DOWN event. As the dst's dev lo is not the same as the dev from the event in ip_vs_dst_event, ip_vs_dst_notifier doesn't actually work. Also as these dst have to wait for dest_trash_timer to clean them up. It would cause some non-permanent kernel warnings: unregister_netdevice: waiting for br0 to become free. Usage count = 3 To fix it, call ip_vs_dst_notifier earlier than ipv6_dev_notf by increasing its priority to ADDRCONF_NOTIFY_PRIORITY + 5. Note that for ipv4 route fib_netdev_notifier doesn't set dst's dev to lo in NETDEV_DOWN event, so this fix is only needed when IP_VS_IPV6 is defined. Fixes: 7a4f0761fce3 ("IPVS: init and cleanup restructuring") Reported-by: Li Shuang <shuali@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Julian Anastasov <ja@ssi.bg> Acked-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>