summaryrefslogtreecommitdiffstats
path: root/net/core
Commit message (Collapse)AuthorAgeFilesLines
* net: add skb frag size accessorsEric Dumazet2011-10-195-53/+57
| | | | | | | | | | | To ease skb->truesize sanitization, its better to be able to localize all references to skb frags size. Define accessors : skb_frag_size() to fetch frag size, and skb_frag_size_{set|add|sub}() to manipulate it. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: allow vlan traffic to be received under bondJohn Fastabend2011-10-181-11/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following configuration used to work as I expected. At least we could use the fcoe interfaces to do MPIO and the bond0 iface to do load balancing or failover. ---eth2.228-fcoe | eth2 -----| | |---- bond0 | eth3 -----| | ---eth3.228-fcoe This worked because of a change we added to allow inactive slaves to rx 'exact' matches. This functionality was kept intact with the rx_handler mechanism. However now the vlan interface attached to the active slave never receives traffic because the bonding rx_handler updates the skb->dev and goto's another_round. Previously, the vlan_do_receive() logic was called before the bonding rx_handler. Now by the time vlan_do_receive calls vlan_find_dev() the skb->dev is set to bond0 and it is clear no vlan is attached to this iface. The vlan lookup fails. This patch moves the VLAN check above the rx_handler. A VLAN tagged frame is now routed to the eth2.228-fcoe iface in the above schematic. Untagged frames continue to the bond0 as normal. This case also remains intact, eth2 --> bond0 --> vlan.228 Here the skb is VLAN tagged but the vlan lookup fails on eth2 causing the bonding rx_handler to be called. On the second pass the vlan lookup is on the bond0 iface and completes as expected. Putting a VLAN.228 on both the bond0 and eth2 device will result in eth2.228 receiving the skb. I don't think this is completely unexpected and was the result prior to the rx_handler result. Note, the same setup is also used for other storage traffic that MPIO is used with eg. iSCSI and similar setups can be contrived without storage protocols. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Acked-by: Jesse Gross <jesse@nicira.com> Reviewed-by: Jiri Pirko <jpirko@redhat.com> Tested-by: Hans Schillstrom <hams.schillstrom@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/flow: Fix potential memory leakhuajun li2011-10-171-1/+13
| | | | | | | | While preparing net flow caches, once a fail may cause potential memory leak , fix it. Signed-off-by: Huajun Li <huajun.li.lee@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* if_link: Add additional parameter to IFLA_VF_INFO for spoof checkingGreg Rose2011-10-161-3/+30
| | | | | | | | | | | | | | | | | Add configuration setting for drivers to turn spoof checking on or off for discrete VFs. v2 - Fix indentation problem, wrap the ifla_vf_info structure in #ifdef __KERNEL__ to prevent user space from accessing and change function paramater for the spoof check setting netdev op from u8 to bool. v3 - Preset spoof check setting to -1 so that user space tools such as ip can detect that the driver didn't report a spoofcheck setting. Prevents incorrect display of spoof check settings for drivers that don't report it. Signed-off-by: Greg Rose <gregory.v.rose@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
* net: more accurate skb truesizeEric Dumazet2011-10-132-5/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | skb truesize currently accounts for sk_buff struct and part of skb head. kmalloc() roundings are also ignored. Considering that skb_shared_info is larger than sk_buff, its time to take it into account for better memory accounting. This patch introduces SKB_TRUESIZE(X) macro to centralize various assumptions into a single place. At skb alloc phase, we put skb_shared_info struct at the exact end of skb head, to allow a better use of memory (lowering number of reallocations), since kmalloc() gives us power-of-two memory blocks. Unless SLUB/SLUB debug is active, both skb->head and skb_shared_info are aligned to cache lines, as before. Note: This patch might trigger performance regressions because of misconfigured protocol stacks, hitting per socket or global memory limits that were previously not reached. But its a necessary step for a more accurate memory accounting. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Andi Kleen <ak@linux.intel.com> CC: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: use sock_valbool_flag to set/clear SOCK_RXQ_OVFLJohannes Berg2011-10-071-4/+1
| | | | | | | | There's no point in open-coding sock_valbool_flag(). Signed-off-by: Johannes Berg <johannes.berg@intel.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* RPS: Ensure that an expired hardware filter can be re-added laterBen Hutchings2011-10-031-6/+3
| | | | | | | | | | | | | | | | | Amir Vadai wrote: > When a stream is paused, and its rule is expired while it is paused, > no new rule will be configured to the HW when traffic resume. [...] > - When stream was resumed, traffic was steered again by RSS, and > because current-cpu was equal to desired-cpu, ndo_rx_flow_steer > wasn't called and no rule was configured to the HW. Fix this by setting the flow's current CPU only in the table for the newly selected RX queue. Reported-and-tested-by: Amir Vadai <amirv@dev.mellanox.co.il> Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: rps: fix the support for PPPOEChangli Gao2011-09-281-1/+11
| | | | | | | | The upper protocol numbers of PPPOE are different, and should be treated specially. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* af_unix: dont send SCM_CREDENTIALS by defaultEric Dumazet2011-09-281-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit 7361c36c5224 (af_unix: Allow credentials to work across user and pid namespaces) af_unix performance dropped a lot. This is because we now take a reference on pid and cred in each write(), and release them in read(), usually done from another process, eventually from another cpu. This triggers false sharing. # Events: 154K cycles # # Overhead Command Shared Object Symbol # ........ ....... .................. ......................... # 10.40% hackbench [kernel.kallsyms] [k] put_pid 8.60% hackbench [kernel.kallsyms] [k] unix_stream_recvmsg 7.87% hackbench [kernel.kallsyms] [k] unix_stream_sendmsg 6.11% hackbench [kernel.kallsyms] [k] do_raw_spin_lock 4.95% hackbench [kernel.kallsyms] [k] unix_scm_to_skb 4.87% hackbench [kernel.kallsyms] [k] pid_nr_ns 4.34% hackbench [kernel.kallsyms] [k] cred_to_ucred 2.39% hackbench [kernel.kallsyms] [k] unix_destruct_scm 2.24% hackbench [kernel.kallsyms] [k] sub_preempt_count 1.75% hackbench [kernel.kallsyms] [k] fget_light 1.51% hackbench [kernel.kallsyms] [k] __mutex_lock_interruptible_slowpath 1.42% hackbench [kernel.kallsyms] [k] sock_alloc_send_pskb This patch includes SCM_CREDENTIALS information in a af_unix message/skb only if requested by the sender, [man 7 unix for details how to include ancillary data using sendmsg() system call] Note: This might break buggy applications that expected SCM_CREDENTIAL from an unaware write() system call, and receiver not using SO_PASSCRED socket option. If SOCK_PASSCRED is set on source or destination socket, we still include credentials for mere write() syscalls. Performance boost in hackbench : more than 50% gain on a 16 thread machine (2 quad-core cpus, 2 threads per core) hackbench 20 thread 2000 4.228 sec instead of 9.102 sec Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'master' of github.com:davem330/netDavid S. Miller2011-09-226-25/+57
|\ | | | | | | | | | | | | | | | | | | | | | | Conflicts: MAINTAINERS drivers/net/Kconfig drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c drivers/net/ethernet/broadcom/tg3.c drivers/net/wireless/iwlwifi/iwl-pci.c drivers/net/wireless/iwlwifi/iwl-trans-tx-pcie.c drivers/net/wireless/rt2x00/rt2800usb.c drivers/net/wireless/wl12xx/main.c
| * fib:fix BUG_ON in fib_nl_newrule when add new fib ruleGao feng2011-09-211-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | add new fib rule can cause BUG_ON happen the reproduce shell is ip rule add pref 38 ip rule add pref 38 ip rule add to 192.168.3.0/24 goto 38 ip rule del pref 38 ip rule add to 192.168.3.0/24 goto 38 ip rule add pref 38 then the BUG_ON will happen del BUG_ON and use (ctarget == NULL) identify whether this rule is unresolved Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: Handle different key sizes between address families in flow cachedpward2011-09-161-14/+17
| | | | | | | | | | | | | | | | With the conversion of struct flowi to a union of AF-specific structs, some operations on the flow cache need to account for the exact size of the key. Signed-off-by: David Ward <david.ward@ll.mit.edu> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: copy userspace buffers on device forwardingMichael S. Tsirkin2011-09-152-5/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | dev_forward_skb loops an skb back into host networking stack which might hang on the memory indefinitely. In particular, this can happen in macvtap in bridged mode. Copy the userspace fragments to avoid blocking the sender in that case. As this patch makes skb_copy_ubufs extern now, I also added some documentation and made it clear the SKBTX_DEV_ZEROCOPY flag automatically instead of doing it in all callers. This can be made into a separate patch if people feel it's worth it. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: Make flow cache namespace-awaredpward2011-09-151-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | flow_cache_lookup will return a cached object (or null pointer) that the resolver (i.e. xfrm_policy_lookup) previously found for another namespace using the same key/family/dir. Instead, make the namespace part of what identifies entries in the cache. As before, flow_entry_valid will return 0 for entries where the namespace has been deleted, and they will be removed from the cache the next time flow_cache_gc_task is run. Reported-by: Andrew Dickinson <whydna@whydna.net> Signed-off-by: David Ward <david.ward@ll.mit.edu> Signed-off-by: David S. Miller <davem@davemloft.net>
| * netpoll: fix incorrect access to skb data in __netpoll_rxEric Dumazet2011-08-261-1/+3
| | | | | | | | | | | | | | | | | | | | __netpoll_rx() doesnt properly handle skbs with small header pskb_may_pull() or pskb_trim_rcsum() can change skb->data, we must reload it. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * arp: fix rcu lockdep splat in arp_process()Eric Dumazet2011-08-241-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Dave Jones reported a lockdep splat triggered by an arp_process() call from parp_redo(). Commit faa9dcf793be (arp: RCU changes) is the origin of the bug, since it assumed arp_process() was called under rcu_read_lock(), which is not true in this particular path. Instead of adding rcu_read_lock() in parp_redo(), I chose to add it in neigh_proxy_process() to take care of IPv6 side too. =================================================== [ INFO: suspicious rcu_dereference_check() usage. ] --------------------------------------------------- include/linux/inetdevice.h:209 invoked rcu_dereference_check() without protection! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 0 4 locks held by setfiles/2123: #0: (&sb->s_type->i_mutex_key#13){+.+.+.}, at: [<ffffffff8114cbc4>] walk_component+0x1ef/0x3e8 #1: (&isec->lock){+.+.+.}, at: [<ffffffff81204bca>] inode_doinit_with_dentry+0x3f/0x41f #2: (&tbl->proxy_timer){+.-...}, at: [<ffffffff8106a803>] run_timer_softirq+0x157/0x372 #3: (class){+.-...}, at: [<ffffffff8141f256>] neigh_proxy_process +0x36/0x103 stack backtrace: Pid: 2123, comm: setfiles Tainted: G W 3.1.0-0.rc2.git7.2.fc16.x86_64 #1 Call Trace: <IRQ> [<ffffffff8108ca23>] lockdep_rcu_dereference+0xa7/0xaf [<ffffffff8146a0b7>] __in_dev_get_rcu+0x55/0x5d [<ffffffff8146a751>] arp_process+0x25/0x4d7 [<ffffffff8146ac11>] parp_redo+0xe/0x10 [<ffffffff8141f2ba>] neigh_proxy_process+0x9a/0x103 [<ffffffff8106a8c4>] run_timer_softirq+0x218/0x372 [<ffffffff8106a803>] ? run_timer_softirq+0x157/0x372 [<ffffffff8141f220>] ? neigh_stat_seq_open+0x41/0x41 [<ffffffff8108f2f0>] ? mark_held_locks+0x6d/0x95 [<ffffffff81062bb6>] __do_softirq+0x112/0x25a [<ffffffff8150d27c>] call_softirq+0x1c/0x30 [<ffffffff81010bf5>] do_softirq+0x4b/0xa2 [<ffffffff81062f65>] irq_exit+0x5d/0xcf [<ffffffff8150dc11>] smp_apic_timer_interrupt+0x7c/0x8a [<ffffffff8150baf3>] apic_timer_interrupt+0x73/0x80 <EOI> [<ffffffff8108f439>] ? trace_hardirqs_on_caller+0x121/0x158 [<ffffffff814fc285>] ? __slab_free+0x30/0x24c [<ffffffff814fc283>] ? __slab_free+0x2e/0x24c [<ffffffff81204e74>] ? inode_doinit_with_dentry+0x2e9/0x41f [<ffffffff81204e74>] ? inode_doinit_with_dentry+0x2e9/0x41f [<ffffffff81204e74>] ? inode_doinit_with_dentry+0x2e9/0x41f [<ffffffff81130cb0>] kfree+0x108/0x131 [<ffffffff81204e74>] inode_doinit_with_dentry+0x2e9/0x41f [<ffffffff81204fc6>] selinux_d_instantiate+0x1c/0x1e [<ffffffff81200f4f>] security_d_instantiate+0x21/0x23 [<ffffffff81154625>] d_instantiate+0x5c/0x61 [<ffffffff811563ca>] d_splice_alias+0xbc/0xd2 [<ffffffff811b17ff>] ext4_lookup+0xba/0xeb [<ffffffff8114bf1e>] d_alloc_and_lookup+0x45/0x6b [<ffffffff8114cbea>] walk_component+0x215/0x3e8 [<ffffffff8114cdf8>] lookup_last+0x3b/0x3d [<ffffffff8114daf3>] path_lookupat+0x82/0x2af [<ffffffff8110fc53>] ? might_fault+0xa5/0xac [<ffffffff8110fc0a>] ? might_fault+0x5c/0xac [<ffffffff8114c564>] ? getname_flags+0x31/0x1ca [<ffffffff8114dd48>] do_path_lookup+0x28/0x97 [<ffffffff8114df2c>] user_path_at+0x59/0x96 [<ffffffff811467ad>] ? cp_new_stat+0xf7/0x10d [<ffffffff811469a6>] vfs_fstatat+0x44/0x6e [<ffffffff811469ee>] vfs_lstat+0x1e/0x20 [<ffffffff81146b3d>] sys_newlstat+0x1a/0x33 [<ffffffff8108f439>] ? trace_hardirqs_on_caller+0x121/0x158 [<ffffffff812535fe>] ? trace_hardirqs_on_thunk+0x3a/0x3f [<ffffffff8150af82>] system_call_fastpath+0x16/0x1b Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: consolidate and fix ethtool_ops->get_settings callingJiri Pirko2011-09-153-31/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch does several things: - introduces __ethtool_get_settings which is called from ethtool code and from drivers as well. Put ASSERT_RTNL there. - dev_ethtool_get_settings() is replaced by __ethtool_get_settings() - changes calling in drivers so rtnl locking is respected. In iboe_get_rate was previously ->get_settings() called unlocked. This fixes it. Also prb_calc_retire_blk_tmo() in af_packet.c had the same problem. Also fixed by calling __dev_get_by_index() instead of dev_get_by_index() and holding rtnl_lock for both calls. - introduces rtnl_lock in bnx2fc_vport_create() and fcoe_vport_create() so bnx2fc_if_create() and fcoe_if_create() are called locked as they are from other places. - use __ethtool_get_settings() in bonding code Signed-off-by: Jiri Pirko <jpirko@redhat.com> v2->v3: -removed dev_ethtool_get_settings() -added ASSERT_RTNL into __ethtool_get_settings() -prb_calc_retire_blk_tmo - use __dev_get_by_index() and lock around it and __ethtool_get_settings() call v1->v2: add missing export_symbol Reviewed-by: Ben Hutchings <bhutchings@solarflare.com> [except FCoE bits] Acked-by: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: linkwatch: allow vlans to get carrier changes fasterEric Dumazet2011-09-151-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a time-lag of IFF_RUNNING flag consistency between vlan and real devices when the real devices are in problem such as link or cable broken. This leads to a degradation of Availability such as a delay of failover in HA systems using vlan since the detection of the problem at real device is delayed. We can avoid the linkwatch delay (~1 sec) for devices linked to another ones, since delay is already done for the realdev. Based on a previous patch from Mitsuo Hayasaka Reported-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Patrick McHardy <kaber@trash.net> Cc: "Michał Mirosław" <mirq-linux@rere.qmqm.pl> Cc: Tom Herbert <therbert@google.com> Cc: Stephen Hemminger <shemminger@vyatta.com> Cc: Jesse Gross <jesse@nicira.com> Tested-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: convert core to skb paged frag APIsIan Campbell2011-08-247-36/+36
| | | | | | | | | | | | | | | | | | Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: "Michał Mirosław" <mirq-linux@rere.qmqm.pl> Cc: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
* | rps: support IPIP encapsulationEric Dumazet2011-08-241-0/+2
| | | | | | | | | | | | | | | | | | | | | | Skip IPIP header to get proper layer-4 information. Like GRE tunnels, this only works if rxhash is not already provided by the device itself (ethtool -K ethX rxhash off), to allow kernel compute a software rxhash. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: vlan: goto another_round instead of calling __netif_receive_skbJiri Pirko2011-08-221-4/+3
| | | | | | | | | | | | | | | | | | Now, when vlan tag on untagged in non-accelerated path is stripped from skb, headers are reset right away. Benefit from that and avoid calling __netif_receive_skb recursivelly and just use another_round. Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: Preserve ooo_okay when copying skb headerChangli Gao2011-08-201-0/+1
| | | | | | | | | | Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/netDavid S. Miller2011-08-201-1/+1
|\|
| * scm: Capture the full credentials of the scm senderTim Chen2011-08-111-1/+1
| | | | | | | | | | | | | | | | | | | | This patch corrects an erroneous update of credential's gid with uid introduced in commit 257b5358b32f17 since 2.6.36. Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Reviewed-by: James Morris <jmorris@namei.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: rps: support PPPOE session messagesChangli Gao2011-08-181-0/+8
| | | | | | | | | | | | | | | | Inspect the payload of PPPOE session messages for the 4 tuples to generate skb->rxhash. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: rps: support 802.1QChangli Gao2011-08-181-0/+8
| | | | | | | | | | | | | | | | | | | | | | For the 802.1Q packets, if the NIC doesn't support hw-accel-vlan-rx, RPS won't inspect the internal 4 tuples to generate skb->rxhash, so this kind of traffic can't get any benefit from RPS. This patch adds the support for 802.1Q to RPS. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: remove ndo_set_multicast_list callbackJiri Pirko2011-08-172-6/+4
| | | | | | | | | | | | | | Remove no longer used operation. Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: introduce IFF_UNICAST_FLT private flagJiri Pirko2011-08-171-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Use IFF_UNICAST_FTL to find out if driver handles unicast address filtering. In case it does not, promisc mode is entered. Patch also fixes following drivers: stmmac, niu: support uc filtering and yet it propagated ndo_set_multicast_list bna, benet, pxa168_eth, ks8851, ks8851_mll, ksz884x : has set ndo_set_rx_mode but do not support uc filtering Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | rps: Inspect GRE encapsulated packets to get flow hashTom Herbert2011-08-171-0/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Crack open GRE packets in __skb_get_rxhash to compute 4-tuple hash on in encapsulated packet. Note that this is used only when the __skb_get_rxhash is taken, in particular only when the device does not compute provide the rxhash (ie. feature is disabled). This was tested by creating a single GRE tunnel between two 16 core AMD machines. 200 netperf TCP_RR streams were ran with 1 byte request and response size. Without patch: 157497 tps, 50/90/99% latencies 1250/1292/1364 usecs With patch: 325896 tps, 50/90/99% latencies 603/848/1169 Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | rps: Infrastructure in __skb_get_rxhash for deep inspectionTom Herbert2011-08-171-0/+7
| | | | | | | | | | | | | | Basics for looking for ports in encapsulated packets in tunnels. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | rps: Add flag to skb to indicate rxhash is based on L4 tupleTom Herbert2011-08-172-4/+7
| | | | | | | | | | | | | | | | | | | | | | The l4_rxhash flag was added to the skb structure to indicate that the rxhash value was computed over the 4 tuple for the packet which includes the port information in the encapsulated transport packet. This is used by the stack to preserve the rxhash value in __skb_rx_tunnel. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | rps: Some minor cleanup in get_rps_cpusTom Herbert2011-08-171-5/+7
| | | | | | | | | | | | | | Use some variables for clarity and extensibility. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: cleanup some rcu_dereference_rawEric Dumazet2011-08-123-8/+8
| | | | | | | | | | | | | | | | | | RCU api had been completed and rcu_access_pointer() or rcu_dereference_protected() are better than generic rcu_dereference_raw() Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | neigh: reduce arp latencyEric Dumazet2011-08-121-14/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove the artificial HZ latency on arp resolution. Instead of firing a timer in one jiffy (up to 10 ms if HZ=100), lets send the ARP message immediately. Before patch : # arp -d 192.168.20.108 ; ping -c 3 192.168.20.108 PING 192.168.20.108 (192.168.20.108) 56(84) bytes of data. 64 bytes from 192.168.20.108: icmp_seq=1 ttl=64 time=9.91 ms 64 bytes from 192.168.20.108: icmp_seq=2 ttl=64 time=0.065 ms 64 bytes from 192.168.20.108: icmp_seq=3 ttl=64 time=0.061 ms After patch : $ arp -d 192.168.20.108 ; ping -c 3 192.168.20.108 PING 192.168.20.108 (192.168.20.108) 56(84) bytes of data. 64 bytes from 192.168.20.108: icmp_seq=1 ttl=64 time=0.152 ms 64 bytes from 192.168.20.108: icmp_seq=2 ttl=64 time=0.064 ms 64 bytes from 192.168.20.108: icmp_seq=3 ttl=64 time=0.074 ms Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | rtnetlink: remove initialization of dev->real_num_tx_queuesJiri Pirko2011-08-111-1/+0
| | | | | | | | | | | | | | dev->real_num_tx_queues is correctly set already in alloc_netdev_mqs. Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: fix potential neighbour race in dst_ifdown()Eric Dumazet2011-08-091-5/+10
| | | | | | | | | | | | | | | | | | | | | | | | Followup of commit f2c31e32b378a (fix NULL dereferences in check_peer_redir()). We need to make sure dst neighbour doesnt change in dst_ifdown(). Fix some sparse errors. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/netDavid S. Miller2011-08-073-3/+200
|\|
| * net: Compute protocol sequence numbers and fragment IDs using MD5.David S. Miller2011-08-062-1/+185
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Computers have become a lot faster since we compromised on the partial MD4 hash which we use currently for performance reasons. MD5 is a much safer choice, and is inline with both RFC1948 and other ISS generators (OpenBSD, Solaris, etc.) Furthermore, only having 24-bits of the sequence number be truly unpredictable is a very serious limitation. So the periodic regeneration and 8-bit counter have been removed. We compute and use a full 32-bit sequence number. For ipv6, DCCP was found to use a 32-bit truncated initial sequence number (it needs 43-bits) and that is fixed here as well. Reported-by: Dan Kaminsky <dan@doxpara.com> Tested-by: Willy Tarreau <w@1wt.eu> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: add kerneldoc to skb_copy_bits()Eric Dumazet2011-08-011-2/+15
| | | | | | | | | | | | | | | | Since skb_copy_bits() is called from assembly, add a fat comment to make clear we should think twice before changing its prototype. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | rcu: convert uses of rcu_assign_pointer(x, NULL) to RCU_INIT_POINTERStephen Hemminger2011-08-026-10/+10
|/ | | | | | | | | | | | | | | | | | | | When assigning a NULL value to an RCU protected pointer, no barrier is needed. The rcu_assign_pointer, used to handle that but will soon change to not handle the special case. Convert all rcu_assign_pointer of NULL value. //smpl @@ expression P; @@ - rcu_assign_pointer(P, NULL) + RCU_INIT_POINTER(P, NULL) // </smpl> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2011-07-282-5/+7
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (32 commits) tg3: Remove 5719 jumbo frames and TSO blocks tg3: Break larger frags into 4k chunks for 5719 tg3: Add tx BD budgeting code tg3: Consolidate code that calls tg3_tx_set_bd() tg3: Add partial fragment unmapping code tg3: Generalize tg3_skb_error_unmap() tg3: Remove short DMA check for 1st fragment tg3: Simplify tx bd assignments tg3: Reintroduce tg3_tx_ring_info ASIX: Use only 11 bits of header for data size ASIX: Simplify condition in rx_fixup() Fix cdc-phonet build bonding: reduce noise during init bonding: fix string comparison errors net: Audit drivers to identify those needing IFF_TX_SKB_SHARING cleared net: add IFF_SKB_TX_SHARED flag to priv_flags net: sock_sendmsg_nosec() is static forcedeth: fix vlans gianfar: fix bug caused by 87c288c6e9aa31720b72e2bc2d665e24e1653c3e gro: Only reset frag0 when skb can be pulled ...
| * net: add IFF_SKB_TX_SHARED flag to priv_flagsNeil Horman2011-07-271-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pktgen attempts to transmit shared skbs to net devices, which can't be used by some drivers as they keep state information in skbs. This patch adds a flag marking drivers as being able to handle shared skbs in their tx path. Drivers are defaulted to being unable to do so, but calling ether_setup enables this flag, as 90% of the drivers calling ether_setup touch real hardware and can handle shared skbs. A subsequent patch will audit drivers to ensure that the flag is set properly Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Reported-by: Jiri Pirko <jpirko@redhat.com> CC: Robert Olsson <robert.olsson@its.uu.se> CC: Eric Dumazet <eric.dumazet@gmail.com> CC: Alexey Dobriyan <adobriyan@gmail.com> CC: David S. Miller <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: Convert struct net_device uc_promisc to boolJoe Perches2011-07-251-2/+2
| | | | | | | | | | | | | | | | No need to use int, its uses are boolean. May save a few bytes one day. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | atomic: use <linux/atomic.h>Arun Sharma2011-07-261-1/+1
|/ | | | | | | | | | | | | | This allows us to move duplicated code in <asm/atomic.h> (atomic_inc_not_zero() for now) to <linux/atomic.h> Signed-off-by: Arun Sharma <asharma@fb.com> Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: David Miller <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* net: allow netif_carrier to be called safely from IRQstephen hemminger2011-07-221-1/+1
| | | | | | | | | | | | | | | | | | | | As reported by Ben Greer and Froncois Romieu. The code path in the netif_carrier code leads it to try and disable a late workqueue to reenable it immediately netif_carrier_on -> linkwatch_fire_event -> linkwatch_schedule_work -> cancel_delayed_work -> del_timer_sync If __cancel_delayed_work is used instead then there is no problem of waiting for running linkwatch_event. There is a race between linkwatch_event running re-scheduling but it is harmless to schedule an extra scan of the linkwatch queue. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ethtool: Allow zero-length register dumps againBen Hutchings2011-07-211-2/+2
| | | | | | | | | | | | | | | Some drivers (ab)use the ethtool_ops::get_regs operation to expose only a hardware revision ID. Commit a77f5db361ed9953b5b749353ea2c7fed2bf8d93 ('ethtool: Allocate register dump buffer with vmalloc()') had the side-effect of breaking these, as vmalloc() returns a null pointer for size=0 whereas kmalloc() did not. For backward-compatibility, allow zero-length dumps again. Reported-by: Kalle Valo <kvalo@qca.qualcomm.com> Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Cc: stable@kernel.org [2.6.37+] Signed-off-by: David S. Miller <davem@davemloft.net>
* skbuff: fix error handling in pskb_copy()Dan Carpenter2011-07-211-1/+2
| | | | | | | | | | | There are two problems: 1) "n" was allocated with alloc_skb() so we should free it with kfree_skb() instead of regular kfree(). 2) We return the freed pointer instead of NULL. Signed-off-by: Dan Carpenter <error27@gmail.com> Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Abstract dst->neighbour accesses behind helpers.David S. Miller2011-07-172-8/+9
| | | | | | dst_{get,set}_neighbour() Signed-off-by: David S. Miller <davem@davemloft.net>
* neigh: Pass neighbour entry to output ops.David S. Miller2011-07-171-11/+14
| | | | | | | | | | This will get us closer to being able to do "neigh stuff" completely independent of the underlying dst_entry for protocols (ipv4/ipv6) that wish to do so. We will also be able to make dst entries neigh-less. Signed-off-by: David S. Miller <davem@davemloft.net>
* neigh: Kill ndisc_ops->queue_xmitDavid S. Miller2011-07-161-2/+2
| | | | | | It is always dev_queue_xmit(). Signed-off-by: David S. Miller <davem@davemloft.net>