summaryrefslogtreecommitdiffstats
path: root/net/packet
Commit message (Collapse)AuthorAgeFilesLines
* packet: fix possible dev refcnt leak when bind failWei Yongjun2011-12-271-1/+5
| | | | | | | | If bind is fail when bind is called after set PACKET_FANOUT sock option, the dev refcnt will leak. Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: relax rcvbuf limitsEric Dumazet2011-12-231-4/+2
| | | | | | | | | | | | | | skb->truesize might be big even for a small packet. Its even bigger after commit 87fb4b7b533 (net: more accurate skb truesize) and big MTU. We should allow queueing at least one packet per receiver, even with a low RCVBUF setting. Reported-by: Michal Simek <monstr@monstr.eu> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* af_packet: de-inline some helper functionsOlof Johansson2011-11-031-26/+26
| | | | | | | | | | | | | | | | | | | This popped some compiler errors due to mismatched prototypes. Just remove most manual inlines, the compiler should be able to figure out what makes sense to inline and not. net/packet/af_packet.c:252: warning: 'prb_curr_blk_in_use' declared inline after being called net/packet/af_packet.c:252: warning: previous declaration of 'prb_curr_blk_in_use' was here net/packet/af_packet.c:258: warning: 'prb_queue_frozen' declared inline after being called net/packet/af_packet.c:258: warning: previous declaration of 'prb_queue_frozen' was here net/packet/af_packet.c:248: warning: 'packet_previous_frame' declared inline after being called net/packet/af_packet.c:248: warning: previous declaration of 'packet_previous_frame' was here net/packet/af_packet.c:251: warning: 'packet_increment_head' declared inline after being called net/packet/af_packet.c:251: warning: previous declaration of 'packet_increment_head' was here Signed-off-by: Olof Johansson <olof@lixom.net> Cc: Chetan Loke <loke.chetan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* macvlan: handle fragmented multicast framesEric Dumazet2011-10-181-38/+1
| | | | | | | | | | | Fragmented multicast frames are delivered to a single macvlan port, because ip defrag logic considers other samples are redundant. Implement a defrag step before trying to send the multicast frame. Reported-by: Ben Greear <greearb@candelatech.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* af_packet: remove unnecessary BUG_ON() in tpacket_destruct_skbdanborkmann@iogearbox.net2011-10-101-2/+0
| | | | | | | | | | If skb is NULL, then stack trace is thrown anyway on dereference. Therefore, the stack trace triggered by BUG_ON is duplicate. Signed-off-by: Daniel Borkmann <danborkmann@googlemail.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'master' of github.com:davem330/netDavid S. Miller2011-10-071-1/+4
|\ | | | | | | | | Conflicts: net/batman-adv/soft-interface.c
| * make PACKET_STATISTICS getsockopt report consistently between ring and non-ringWillem de Bruijn2011-10-031-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a minor change. Up until kernel 2.6.32, getsockopt(fd, SOL_PACKET, PACKET_STATISTICS, ...) would return total and dropped packets since its last invocation. The introduction of socket queue overflow reporting [1] changed drop rate calculation in the normal packet socket path, but not when using a packet ring. As a result, the getsockopt now returns different statistics depending on the reception method used. With a ring, it still returns the count since the last call, as counts are incremented in tpacket_rcv and reset in getsockopt. Without a ring, it returns 0 if no drops occurred since the last getsockopt and the total drops over the lifespan of the socket otherwise. The culprit is this line in packet_rcv, executed on a drop: drop_n_acct: po->stats.tp_drops = atomic_inc_return(&sk->sk_drops); As it shows, the new drop number it taken from the socket drop counter, which is not reset at getsockopt. I put together a small example that demonstrates the issue [2]. It runs for 10 seconds and overflows the queue/ring on every odd second. The reported drop rates are: ring: 16, 0, 16, 0, 16, ... non-ring: 0, 15, 0, 30, 0, 46, 0, 60, 0 , 74. Note how the even ring counts monotonically increase. Because the getsockopt adds tp_drops to tp_packets, total counts are similarly reported cumulatively. Long story short, reinstating the original code, as the below patch does, fixes the issue at the cost of additional per-packet cycles. Another solution that does not introduce per-packet overhead is be to keep the current data path, record the value of sk_drops at getsockopt() at call N in a new field in struct packetsock and subtract that when reporting at call N+1. I'll be happy to code that, instead, it's just more messy. [1] http://patchwork.ozlabs.org/patch/35665/ [2] http://kernel.googlecode.com/files/test-packetsock-getstatistics.c Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: consolidate and fix ethtool_ops->get_settings callingJiri Pirko2011-09-151-25/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch does several things: - introduces __ethtool_get_settings which is called from ethtool code and from drivers as well. Put ASSERT_RTNL there. - dev_ethtool_get_settings() is replaced by __ethtool_get_settings() - changes calling in drivers so rtnl locking is respected. In iboe_get_rate was previously ->get_settings() called unlocked. This fixes it. Also prb_calc_retire_blk_tmo() in af_packet.c had the same problem. Also fixed by calling __dev_get_by_index() instead of dev_get_by_index() and holding rtnl_lock for both calls. - introduces rtnl_lock in bnx2fc_vport_create() and fcoe_vport_create() so bnx2fc_if_create() and fcoe_if_create() are called locked as they are from other places. - use __ethtool_get_settings() in bonding code Signed-off-by: Jiri Pirko <jpirko@redhat.com> v2->v3: -removed dev_ethtool_get_settings() -added ASSERT_RTNL into __ethtool_get_settings() -prb_calc_retire_blk_tmo - use __dev_get_by_index() and lock around it and __ethtool_get_settings() call v1->v2: add missing export_symbol Reviewed-by: Ben Hutchings <bhutchings@solarflare.com> [except FCoE bits] Acked-by: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | af_packet: Prefixed tpacket_v3 structs to avoid name space collisionchetan loke2011-08-261-55/+62
| | | | | | | | | | | | | | | | | | | | structs introduced in tpacket_v3 implementation are prefixed with 'tpacket' to avoid namespace collision. Compile tested. Signed-off-by: Chetan Loke <loke.chetan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | af-packet: TPACKET_V3 flexible buffer implementation.chetan loke2011-08-241-46/+891
|/ | | | | | | | | | | | | | | | | | | | | | | | | 1) Blocks can be configured with non-static frame-size. 2) Read/poll is at a block-level(as opposed to packet-level). 3) Added poll timeout to avoid indefinite user-space wait on idle links. 4) Added user-configurable knobs: 4.1) block::timeout. 4.2) tpkt_hdr::sk_rxhash. Changes: C1) tpacket_rcv() C1.1) packet_current_frame() is replaced by packet_current_rx_frame() The bulk of the processing is then moved in the following chain: packet_current_rx_frame() __packet_lookup_frame_in_block fill_curr_block() or retire_current_block dispatch_next_block or return NULL(queue is plugged/paused) Signed-off-by: Chetan Loke <loke.chetan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* af-packet: fix - avoid reading stale dataChetan Loke2011-07-141-1/+2
| | | | | | | | | | Currently we flush tp_status and then flush the remainder of the header+payload. tp_status should be flushed in the end to avoid stale data being read by user-space. Incorrectly re-ordered barriers in v1. Signed-off-by: Chetan Loke <loke.chetan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* packet: Fix build with INET disabled.David S. Miller2011-07-071-0/+2
| | | | | | | | | af_packet.c:(.text+0x3d130): undefined reference to `ip_defrag' or ERROR: "ip_defrag" [net/packet/af_packet.ko] undefined! Reported-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* af_packet: lock imbalanceEric Dumazet2011-07-071-31/+31
| | | | | | | | | fanout_add() might return with fanout_mutex held. Reduce indentation level while we are at it Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* packet: Fix leak in pre-defrag support.David S. Miller2011-07-061-1/+1
| | | | | | | | When we clone the SKB, we forget about the original one. Avoid this problem by using skb_share_check(). Reported-by: Penttilä Mika <mika.penttila@ixonos.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* packet: Add 'cpu' fanout policy.David S. Miller2011-07-061-37/+28
| | | | | | | | Unfortunately we have to use a real modulus here as the multiply trick won't work as effectively with cpu numbers as it does with rxhash values. Signed-off-by: David S. Miller <davem@davemloft.net>
* packet: Add pre-defragmentation support for ipv4 fanouts.David S. Miller2011-07-051-2/+48
| | | | | | | | | The skb->rxhash cannot be properly computed if the packet is a fragment. To alleviate this, allow the AF_PACKET client to ask for defragmentation to be done at demux time. Signed-off-by: David S. Miller <davem@davemloft.net>
* packet: Add fanout support.David S. Miller2011-07-051-5/+251
| | | | | | | | | | | | | | | | | | | | | Fanouts allow packet capturing to be demuxed to a set of AF_PACKET sockets. Two fanout policies are implemented: 1) Hashing based upon skb->rxhash 2) Pure round-robin An AF_PACKET socket must be fully bound before it tries to add itself to a fanout. All AF_PACKET sockets trying to join the same fanout must all have the same bind settings. Fanouts are identified (within a network namespace) by a 16-bit ID. The first socket to try to add itself to a fanout with a particular ID, creates that fanout. When the last socket leaves the fanout (which happens only when the socket is closed), that fanout is destroyed. Signed-off-by: David S. Miller <davem@davemloft.net>
* packet: Add helpers to register/unregister ->prot_hookDavid S. Miller2011-07-051-44/+59
| | | | Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'master' of ↵David S. Miller2011-06-201-0/+2
|\ | | | | | | | | | | | | | | | | master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/wireless/iwlwifi/iwl-agn-rxon.c drivers/net/wireless/rtlwifi/pci.c net/netfilter/ipvs/ip_vs_core.c
| * af_packet: prevent information leakEric Dumazet2011-06-061-0/+2
| | | | | | | | | | | | | | | | | | | | | | In 2.6.27, commit 393e52e33c6c2 (packet: deliver VLAN TCI to userspace) added a small information leak. Add padding field and make sure its zeroed before copy to user. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* | virtio_net: introduce VIRTIO_NET_HDR_F_DATA_VALIDJason Wang2011-06-111-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There's no need for the guest to validate the checksum if it have been validated by host nics. So this patch introduces a new flag - VIRTIO_NET_HDR_F_DATA_VALID which is used to bypass the checksum examing in guest. The backend (tap/macvtap) may set this flag when met skbs with CHECKSUM_UNNECESSARY to save cpu utilization. No feature negotiation is needed as old driver just ignore this flag. Iperf shows 12%-30% performance improvement for UDP traffic. For TCP, when gro is on no difference as it produces skb with partial checksum. But when gro is disabled, 20% or even higher improvement could be measured by netperf. Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | af-packet: Use existing netdev reference for bound sockets.Ben Greear2011-06-051-12/+15
| | | | | | | | | | | | | | | | This saves a network device lookup on each packet transmitted, for sockets that are bound to a network device. Signed-off-by: Ben Greear <greearb@candelatech.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | af-packet: Hold reference to bound network devices.Ben Greear2011-06-051-5/+9
|/ | | | | | | | | Old code was probably safe, but with this change we can actually use the netdev object, not just compare the pointer values. Signed-off-by: Ben Greear <greearb@candelatech.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* af-packet: Add flag to distinguish VID 0 from no-vlan.Ben Greear2011-06-011-3/+12
| | | | | | | | | | | | | | Currently, user-space cannot determine if a 0 tcp_vlan_tci means there is no VLAN tag or the VLAN ID was zero. Add flag to make this explicit. User-space can check for TP_STATUS_VLAN_VALID || tp_vlan_tci > 0, which will be backwards compatible. Older could would have just checked for tp_vlan_tci, so it will work no worse than before. Signed-off-by: Ben Greear <greearb@candelatech.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: convert %p usage to %pKDan Rosenberg2011-05-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The %pK format specifier is designed to hide exposed kernel pointers, specifically via /proc interfaces. Exposing these pointers provides an easy target for kernel write vulnerabilities, since they reveal the locations of writable structures containing easily triggerable function pointers. The behavior of %pK depends on the kptr_restrict sysctl. If kptr_restrict is set to 0, no deviation from the standard %p behavior occurs. If kptr_restrict is set to 1, the default, if the current user (intended to be a reader via seq_printf(), etc.) does not have CAP_SYSLOG (currently in the LSM tree), kernel pointers using %pK are printed as 0's. If kptr_restrict is set to 2, kernel pointers using %pK are printed as 0's regardless of privileges. Replacing with 0's was chosen over the default "(null)", which cannot be parsed by userland %p, which expects "(nil)". The supporting code for kptr_restrict and %pK are currently in the -mm tree. This patch converts users of %p in net/ to %pK. Cases of printing pointers to the syslog are not covered, since this would eliminate useful information for postmortem debugging and the reading of the syslog is already optionally protected by the dmesg_restrict sysctl. Signed-off-by: Dan Rosenberg <drosenberg@vsecurity.com> Cc: James Morris <jmorris@namei.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Thomas Graf <tgraf@infradead.org> Cc: Eugene Teo <eugeneteo@kernel.org> Cc: Kees Cook <kees.cook@canonical.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: David S. Miller <davem@davemloft.net> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Eric Paris <eparis@parisplace.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: filter: Just In Time compiler for x86-64Eric Dumazet2011-04-271-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In order to speedup packet filtering, here is an implementation of a JIT compiler for x86_64 It is disabled by default, and must be enabled by the admin. echo 1 >/proc/sys/net/core/bpf_jit_enable It uses module_alloc() and module_free() to get memory in the 2GB text kernel range since we call helpers functions from the generated code. EAX : BPF A accumulator EBX : BPF X accumulator RDI : pointer to skb (first argument given to JIT function) RBP : frame pointer (even if CONFIG_FRAME_POINTER=n) r9d : skb->len - skb->data_len (headlen) r8 : skb->data To get a trace of generated code, use : echo 2 >/proc/sys/net/core/bpf_jit_enable Example of generated code : # tcpdump -p -n -s 0 -i eth1 host 192.168.20.0/24 flen=18 proglen=147 pass=3 image=ffffffffa00b5000 JIT code: ffffffffa00b5000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 60 JIT code: ffffffffa00b5010: 44 2b 4f 64 4c 8b 87 b8 00 00 00 be 0c 00 00 00 JIT code: ffffffffa00b5020: e8 24 7b f7 e0 3d 00 08 00 00 75 28 be 1a 00 00 JIT code: ffffffffa00b5030: 00 e8 fe 7a f7 e0 24 00 3d 00 14 a8 c0 74 49 be JIT code: ffffffffa00b5040: 1e 00 00 00 e8 eb 7a f7 e0 24 00 3d 00 14 a8 c0 JIT code: ffffffffa00b5050: 74 36 eb 3b 3d 06 08 00 00 74 07 3d 35 80 00 00 JIT code: ffffffffa00b5060: 75 2d be 1c 00 00 00 e8 c8 7a f7 e0 24 00 3d 00 JIT code: ffffffffa00b5070: 14 a8 c0 74 13 be 26 00 00 00 e8 b5 7a f7 e0 24 JIT code: ffffffffa00b5080: 00 3d 00 14 a8 c0 75 07 b8 ff ff 00 00 eb 02 31 JIT code: ffffffffa00b5090: c0 c9 c3 BPF program is 144 bytes long, so native program is almost same size ;) (000) ldh [12] (001) jeq #0x800 jt 2 jf 8 (002) ld [26] (003) and #0xffffff00 (004) jeq #0xc0a81400 jt 16 jf 5 (005) ld [30] (006) and #0xffffff00 (007) jeq #0xc0a81400 jt 16 jf 17 (008) jeq #0x806 jt 10 jf 9 (009) jeq #0x8035 jt 10 jf 17 (010) ld [28] (011) and #0xffffff00 (012) jeq #0xc0a81400 jt 16 jf 13 (013) ld [38] (014) and #0xffffff00 (015) jeq #0xc0a81400 jt 16 jf 17 (016) ret #65535 (017) ret #0 Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Ben Hutchings <bhutchings@solarflare.com> Cc: Hagen Paul Pfeifer <hagen@jauu.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* af_packet: struct socket declared/assigned but unusedHagen Paul Pfeifer2011-03-071-3/+0
| | | | | Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* network: Allow af_packet to transmit +4 bytes for VLAN packets.Ben Greear2011-02-111-2/+29
| | | | | | | | | | | This allows user-space to send a '1500' MTU VLAN packet on a 1500 MTU ethernet frame. The extra 4 bytes of a VLAN header is not usually charged against the MTU when other parts of the network stack is transmitting vlans... Signed-off-by: Ben Greear <greearb@candelatech.com> Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: cleanup unused macros in net directoryShan Wei2011-01-191-1/+0
| | | | | | | | | | | Clean up some unused macros in net/*. 1. be left for code change. e.g. PGV_FROM_VMALLOC, PGV_FROM_VMALLOC, KMEM_SAFETYZONE. 2. never be used since introduced to kernel. e.g. P9_RDMA_MAX_SGE, UTIL_CTRL_PKT_SIZE. Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com> Acked-by: Sjur Braendeland <sjur.brandeland@stericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: filter: dont block softirqs in sk_run_filter()Eric Dumazet2011-01-181-3/+3
| | | | | | | | Packet filter (BPF) doesnt need to disable softirqs, being fully re-entrant and lock-less. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Use skb_checksum_start_offset()Michał Mirosław2010-12-161-2/+1
| | | | | | | | | Replace skb->csum_start - skb_headroom(skb) with skb_checksum_start_offset(). Note for usb/smsc95xx: skb->data - skb->head == skb_headroom(skb). Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>
* af_packet: use swap() instead of the open coded macro XC()Changli Gao2010-12-101-5/+3
| | | | | Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* af_packet: fix freeing pg_vec twice on error pathChangli Gao2010-12-081-1/+0
| | | | | | | | | | | | It is introduced in: commit 0e3125c755445664f00ad036e4fc2cd32fd52877 Author: Neil Horman <nhorman@tuxdriver.com> Date: Tue Nov 16 10:26:47 2010 -0800 packet: Enhance AF_PACKET implementation to not require high order contiguous memory allocation (v4) Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* af_packet: eliminate pgv_to_page on some archesChangli Gao2010-12-081-1/+3
| | | | | | | | Some arches don't need flush_dcache_page(), and don't implement it, so we can eliminate pgv_to_page() calls on those arches. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* filter: constify sk_run_filter()Eric Dumazet2010-12-081-15/+16
| | | | | | | | | | sk_run_filter() doesnt write on skb, change its prototype to reflect this. Fix two af_packet comments. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* af_packet: remove pgv.flagsChangli Gao2010-12-061-15/+5
| | | | | | | | As we can check if an address is vmalloc address with is_vmalloc_addr(), we remove pgv.flags. Then we may get more pg_vecs. Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* af_packet: use vmalloc_to_page() instead for the addresss returned by vmalloc()Changli Gao2010-12-061-17/+19
| | | | | | | | | | | | | | | | | | | | | | | The following commit causes the pgv->buffer may point to the memory returned by vmalloc(). And we can't use virt_to_page() for the vmalloc address. This patch introduces a new inline function pgv_to_page(), which calls vmalloc_to_page() for the vmalloc address, and virt_to_page() for the __get_free_pages address. We used to increase page pointer to get the next page at the next page address, after Neil's patch, it is wrong, as the physical address may be not continuous. This patch also fixes this issue. commit 0e3125c755445664f00ad036e4fc2cd32fd52877 Author: Neil Horman <nhorman@tuxdriver.com> Date: Tue Nov 16 10:26:47 2010 -0800 packet: Enhance AF_PACKET implementation to not require high order contiguous memory allocation (v4) Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* packet: use vzalloc()Eric Dumazet2010-11-211-1/+1
| | | | | | | | | | alloc_one_pg_vec_page() is supposed to return zeroed memory, so use vzalloc() instead of vmalloc() Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Neil Horman <nhorman@tuxdriver.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* filter: optimize sk_run_filterEric Dumazet2010-11-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove pc variable to avoid arithmetic to compute fentry at each filter instruction. Jumps directly manipulate fentry pointer. As the last instruction of filter[] is guaranteed to be a RETURN, and all jumps are before the last instruction, we dont need to check filter bounds (number of instructions in filter array) at each iteration, so we remove it from sk_run_filter() params. On x86_32 remove f_k var introduced in commit 57fe93b374a6b871 (filter: make sure filters dont read uninitialized memory) Note : We could use a CONFIG_ARCH_HAS_{FEW|MANY}_REGISTERS in order to avoid too many ifdefs in this code. This helps compiler to use cpu registers to hold fentry and A accumulator. On x86_32, this saves 401 bytes, and more important, sk_run_filter() runs much faster because less register pressure (One less conditional branch per BPF instruction) # size net/core/filter.o net/core/filter_pre.o text data bss dec hex filename 2948 0 0 2948 b84 net/core/filter.o 3349 0 0 3349 d15 net/core/filter_pre.o on x86_64 : # size net/core/filter.o net/core/filter_pre.o text data bss dec hex filename 5173 0 0 5173 1435 net/core/filter.o 5224 0 0 5224 1468 net/core/filter_pre.o Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* packet: Enhance AF_PACKET implementation to not require high order ↵Neil Horman2010-11-161-16/+69
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | contiguous memory allocation (v4) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Version 4 of this patch. Change notes: 1) Removed extra memset. Didn't think kcalloc added a GFP_ZERO the way kzalloc did :) Summary: It was shown to me recently that systems under high load were driven very deep into swap when tcpdump was run. The reason this happened was because the AF_PACKET protocol has a SET_RINGBUFFER socket option that allows the user space application to specify how many entries an AF_PACKET socket will have and how large each entry will be. It seems the default setting for tcpdump is to set the ring buffer to 32 entries of 64 Kb each, which implies 32 order 5 allocation. Thats difficult under good circumstances, and horrid under memory pressure. I thought it would be good to make that a bit more usable. I was going to do a simple conversion of the ring buffer from contigous pages to iovecs, but unfortunately, the metadata which AF_PACKET places in these buffers can easily span a page boundary, and given that these buffers get mapped into user space, and the data layout doesn't easily allow for a change to padding between frames to avoid that, a simple iovec change is just going to break user space ABI consistency. So I've done this, I've added a three tiered mechanism to the af_packet set_ring socket option. It attempts to allocate memory in the following order: 1) Using __get_free_pages with GFP_NORETRY set, so as to fail quickly without digging into swap 2) Using vmalloc 3) Using __get_free_pages with GFP_NORETRY clear, causing us to try as hard as needed to get the memory The effect is that we don't disturb the system as much when we're under load, while still being able to conduct tcpdumps effectively. Tested successfully by me. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Maciej Żenczykowski <zenczykowski@gmail.com> Reported-by: Maciej Żenczykowski <zenczykowski@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Fix header size check for GSO case in recvmsg (af_packet)Mariusz Kozlowski2010-11-121-1/+3
| | | | | | | Parameter 'len' is size_t type so it will never get negative. Signed-off-by: Mariusz Kozlowski <mk@lab.zgora.pl> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: packet: fix information leak to userlandVasiliy Kulikov2010-11-101-1/+2
| | | | | | | | | | | | | packet_getname_spkt() doesn't initialize all members of sa_data field of sockaddr struct if strlen(dev->name) < 13. This structure is then copied to userland. It leads to leaking of contents of kernel stack memory. We have to fully fill sa_data with strncpy() instead of strlcpy(). The same with packet_getname(): it doesn't initialize sll_pkttype field of sockaddr_ll. Set it to zero. Signed-off-by: Vasiliy Kulikov <segooon@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: simplify flags for tx timestampingOliver Hartkopp2010-08-191-2/+2
| | | | | | | | | | | | | This patch removes the abstraction introduced by the union skb_shared_tx in the shared skb data. The access of the different union elements at several places led to some confusion about accessing the shared tx_flags e.g. in skb_orphan_try(). http://marc.info/?l=linux-netdev&m=128084897415886&w=2 Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* packet_mmap: expose hw packet timestamps to network packet capture utilitiesScott McMillan2010-06-021-2/+35
| | | | | | | | | | | | | | | | | | | This patch adds a setting, PACKET_TIMESTAMP, to specify the packet timestamp source that is exported to capture utilities like tcpdump by packet_mmap. PACKET_TIMESTAMP accepts the same integer bit field as SO_TIMESTAMPING. However, only the SOF_TIMESTAMPING_SYS_HARDWARE and SOF_TIMESTAMPING_RAW_HARDWARE values are currently recognized by PACKET_TIMESTAMP. SOF_TIMESTAMPING_SYS_HARDWARE takes precedence over SOF_TIMESTAMPING_RAW_HARDWARE if both bits are set. If PACKET_TIMESTAMP is not set, a software timestamp generated inside the networking stack is used (the behavior before this setting was added). Signed-off-by: Scott McMillan <scott.a.mcmillan@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'master' of ↵David S. Miller2010-04-211-2/+0
|\ | | | | | | | | | | | | | | master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/wireless/iwlwifi/iwl-6000.c net/core/dev.c
| * packet : remove init_net restrictionDaniel Lezcano2010-04-161-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The af_packet protocol is used by Perl to do ioctls as reported by Stephane Riviere: "Net::RawIP relies on SIOCGIFADDR et SIOCGIFHWADDR to get the IP and MAC addresses of the network interface." But in a new network namespace these ioctl fail because it is disabled for a namespace different from the init_net_ns. These two lines should not be there as af_inet and af_packet are namespace aware since a long time now. I suppose we forget to remove these lines because we sent the af_packet first, before af_inet was supported. Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr> Reported-by: Stephane Riviere <stephane.riviere@regis-dgac.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* | packet: support for TX time stamps on RAW socketsRichard Cochran2010-04-131-1/+60
| | | | | | | | | | | | | | | | | | | | | | Enable the SO_TIMESTAMPING socket infrastructure for raw packet sockets. We introduce PACKET_TX_TIMESTAMP for the control message cmsg_type. Similar support for UDP and CAN sockets was added in commit 51f31cabe3ce5345b51e4a4f82138b38c4d5dc91 Signed-off-by: Richard Cochran <richard.cochran@omicron.at> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge branch 'master' of ↵David S. Miller2010-04-111-0/+1
|\| | | | | | | | | | | | | | | | | | | | | | | master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/stmmac/stmmac_main.c drivers/net/wireless/wl12xx/wl1271_cmd.c drivers/net/wireless/wl12xx/wl1271_main.c drivers/net/wireless/wl12xx/wl1271_spi.c net/core/ethtool.c net/mac80211/scan.c
| * include cleanup: Update gfp.h and slab.h includes to prepare for breaking ↵Tejun Heo2010-03-301-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
* | net: convert multicast list to list_headJiri Pirko2010-04-031-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | Converts the list and the core manipulating with it to be the same as uc_list. +uses two functions for adding/removing mc address (normal and "global" variant) instead of a function parameter. +removes dev_mcast.c completely. +exposes netdev_hw_addr_list_* macros along with __hw_addr_* functions for manipulation with lists on a sandbox (used in bonding and 80211 drivers) Signed-off-by: Jiri Pirko <jpirko@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>