From da70314917862d4da4a8d7601cd47339df8b3c23 Mon Sep 17 00:00:00 2001 From: Andrey Ignatov Date: Wed, 17 Apr 2019 22:28:57 -0700 Subject: bpf: Document BPF_PROG_TYPE_CGROUP_SYSCTL Add documentation for BPF_PROG_TYPE_CGROUP_SYSCTL, including general info, attach type, context, return code, helpers, example and usage considerations. A separate file prog_cgroup_sysctl.rst is added to Documentation/bpf/. In the future more program types can be documented in their own prog_.rst files. Another way to place program type specific documentation would be to group program types somehow (e.g. cgroup.rst for all cgroup-bpf programs), but it may not scale well since some program types may belong to different groups, e.g. BPF_PROG_TYPE_CGROUP_SKB can be documented together with either cgroup-bpf programs or programs that access skb. The new file is added to the index and verified by `make htmldocs` / sanity-check by lynx. Signed-off-by: Andrey Ignatov Acked-by: Yonghong Song Signed-off-by: Alexei Starovoitov --- Documentation/bpf/index.rst | 9 +++ Documentation/bpf/prog_cgroup_sysctl.rst | 125 +++++++++++++++++++++++++++++++ 2 files changed, 134 insertions(+) create mode 100644 Documentation/bpf/prog_cgroup_sysctl.rst (limited to 'Documentation') diff --git a/Documentation/bpf/index.rst b/Documentation/bpf/index.rst index 4e77932959cc..dadcaa9a9f5f 100644 --- a/Documentation/bpf/index.rst +++ b/Documentation/bpf/index.rst @@ -36,6 +36,15 @@ Two sets of Questions and Answers (Q&A) are maintained. bpf_devel_QA +Program types +============= + +.. toctree:: + :maxdepth: 1 + + prog_cgroup_sysctl + + .. Links: .. _Documentation/networking/filter.txt: ../networking/filter.txt .. _man-pages: https://www.kernel.org/doc/man-pages/ diff --git a/Documentation/bpf/prog_cgroup_sysctl.rst b/Documentation/bpf/prog_cgroup_sysctl.rst new file mode 100644 index 000000000000..677d6c637cf3 --- /dev/null +++ b/Documentation/bpf/prog_cgroup_sysctl.rst @@ -0,0 +1,125 @@ +.. SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) + +=========================== +BPF_PROG_TYPE_CGROUP_SYSCTL +=========================== + +This document describes ``BPF_PROG_TYPE_CGROUP_SYSCTL`` program type that +provides cgroup-bpf hook for sysctl. + +The hook has to be attached to a cgroup and will be called every time a +process inside that cgroup tries to read from or write to sysctl knob in proc. + +1. Attach type +************** + +``BPF_CGROUP_SYSCTL`` attach type has to be used to attach +``BPF_PROG_TYPE_CGROUP_SYSCTL`` program to a cgroup. + +2. Context +********** + +``BPF_PROG_TYPE_CGROUP_SYSCTL`` provides access to the following context from +BPF program:: + + struct bpf_sysctl { + __u32 write; + __u32 file_pos; + }; + +* ``write`` indicates whether sysctl value is being read (``0``) or written + (``1``). This field is read-only. + +* ``file_pos`` indicates file position sysctl is being accessed at, read + or written. This field is read-write. Writing to the field sets the starting + position in sysctl proc file ``read(2)`` will be reading from or ``write(2)`` + will be writing to. Writing zero to the field can be used e.g. to override + whole sysctl value by ``bpf_sysctl_set_new_value()`` on ``write(2)`` even + when it's called by user space on ``file_pos > 0``. Writing non-zero + value to the field can be used to access part of sysctl value starting from + specified ``file_pos``. Not all sysctl support access with ``file_pos != + 0``, e.g. writes to numeric sysctl entries must always be at file position + ``0``. See also ``kernel.sysctl_writes_strict`` sysctl. + +See `linux/bpf.h`_ for more details on how context field can be accessed. + +3. Return code +************** + +``BPF_PROG_TYPE_CGROUP_SYSCTL`` program must return one of the following +return codes: + +* ``0`` means "reject access to sysctl"; +* ``1`` means "proceed with access". + +If program returns ``0`` user space will get ``-1`` from ``read(2)`` or +``write(2)`` and ``errno`` will be set to ``EPERM``. + +4. Helpers +********** + +Since sysctl knob is represented by a name and a value, sysctl specific BPF +helpers focus on providing access to these properties: + +* ``bpf_sysctl_get_name()`` to get sysctl name as it is visible in + ``/proc/sys`` into provided by BPF program buffer; + +* ``bpf_sysctl_get_current_value()`` to get string value currently held by + sysctl into provided by BPF program buffer. This helper is available on both + ``read(2)`` from and ``write(2)`` to sysctl; + +* ``bpf_sysctl_get_new_value()`` to get new string value currently being + written to sysctl before actual write happens. This helper can be used only + on ``ctx->write == 1``; + +* ``bpf_sysctl_set_new_value()`` to override new string value currently being + written to sysctl before actual write happens. Sysctl value will be + overridden starting from the current ``ctx->file_pos``. If the whole value + has to be overridden BPF program can set ``file_pos`` to zero before calling + to the helper. This helper can be used only on ``ctx->write == 1``. New + string value set by the helper is treated and verified by kernel same way as + an equivalent string passed by user space. + +BPF program sees sysctl value same way as user space does in proc filesystem, +i.e. as a string. Since many sysctl values represent an integer or a vector +of integers, the following helpers can be used to get numeric value from the +string: + +* ``bpf_strtol()`` to convert initial part of the string to long integer + similar to user space `strtol(3)`_; +* ``bpf_strtoul()`` to convert initial part of the string to unsigned long + integer similar to user space `strtoul(3)`_; + +See `linux/bpf.h`_ for more details on helpers described here. + +5. Examples +*********** + +See `test_sysctl_prog.c`_ for an example of BPF program in C that access +sysctl name and value, parses string value to get vector of integers and uses +the result to make decision whether to allow or deny access to sysctl. + +6. Notes +******** + +``BPF_PROG_TYPE_CGROUP_SYSCTL`` is intended to be used in **trusted** root +environment, for example to monitor sysctl usage or catch unreasonable values +an application, running as root in a separate cgroup, is trying to set. + +Since `task_dfl_cgroup(current)` is called at `sys_read` / `sys_write` time it +may return results different from that at `sys_open` time, i.e. process that +opened sysctl file in proc filesystem may differ from process that is trying +to read from / write to it and two such processes may run in different +cgroups, what means ``BPF_PROG_TYPE_CGROUP_SYSCTL`` should not be used as a +security mechanism to limit sysctl usage. + +As with any cgroup-bpf program additional care should be taken if an +application running as root in a cgroup should not be allowed to +detach/replace BPF program attached by administrator. + +.. Links +.. _linux/bpf.h: ../../include/uapi/linux/bpf.h +.. _strtol(3): http://man7.org/linux/man-pages/man3/strtol.3p.html +.. _strtoul(3): http://man7.org/linux/man-pages/man3/strtoul.3p.html +.. _test_sysctl_prog.c: + ../../tools/testing/selftests/bpf/progs/test_sysctl_prog.c -- cgit v1.2.3 From 80695946737dff4cfc1ecdefd4ebf300f132d8ee Mon Sep 17 00:00:00 2001 From: Stanislav Fomichev Date: Thu, 18 Apr 2019 16:47:52 -0700 Subject: bpf: move BPF_PROG_TYPE_FLOW_DISSECTOR documentation to a new common place In commit da7031491786 ("bpf: Document BPF_PROG_TYPE_CGROUP_SYSCTL") Andrey proposes to put per-prog type docs under Documentation/bpf/ Let's move flow dissector documentation there as well. Signed-off-by: Stanislav Fomichev Signed-off-by: Alexei Starovoitov --- Documentation/bpf/index.rst | 1 + Documentation/bpf/prog_flow_dissector.rst | 126 ++++++++++++++++++++++++ Documentation/networking/bpf_flow_dissector.rst | 126 ------------------------ Documentation/networking/index.rst | 1 - 4 files changed, 127 insertions(+), 127 deletions(-) create mode 100644 Documentation/bpf/prog_flow_dissector.rst delete mode 100644 Documentation/networking/bpf_flow_dissector.rst (limited to 'Documentation') diff --git a/Documentation/bpf/index.rst b/Documentation/bpf/index.rst index dadcaa9a9f5f..d3fe4cac0c90 100644 --- a/Documentation/bpf/index.rst +++ b/Documentation/bpf/index.rst @@ -43,6 +43,7 @@ Program types :maxdepth: 1 prog_cgroup_sysctl + prog_flow_dissector .. Links: diff --git a/Documentation/bpf/prog_flow_dissector.rst b/Documentation/bpf/prog_flow_dissector.rst new file mode 100644 index 000000000000..ed343abe541e --- /dev/null +++ b/Documentation/bpf/prog_flow_dissector.rst @@ -0,0 +1,126 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================ +BPF_PROG_TYPE_FLOW_DISSECTOR +============================ + +Overview +======== + +Flow dissector is a routine that parses metadata out of the packets. It's +used in the various places in the networking subsystem (RFS, flow hash, etc). + +BPF flow dissector is an attempt to reimplement C-based flow dissector logic +in BPF to gain all the benefits of BPF verifier (namely, limits on the +number of instructions and tail calls). + +API +=== + +BPF flow dissector programs operate on an ``__sk_buff``. However, only the +limited set of fields is allowed: ``data``, ``data_end`` and ``flow_keys``. +``flow_keys`` is ``struct bpf_flow_keys`` and contains flow dissector input +and output arguments. + +The inputs are: + * ``nhoff`` - initial offset of the networking header + * ``thoff`` - initial offset of the transport header, initialized to nhoff + * ``n_proto`` - L3 protocol type, parsed out of L2 header + +Flow dissector BPF program should fill out the rest of the ``struct +bpf_flow_keys`` fields. Input arguments ``nhoff/thoff/n_proto`` should be +also adjusted accordingly. + +The return code of the BPF program is either BPF_OK to indicate successful +dissection, or BPF_DROP to indicate parsing error. + +__sk_buff->data +=============== + +In the VLAN-less case, this is what the initial state of the BPF flow +dissector looks like:: + + +------+------+------------+-----------+ + | DMAC | SMAC | ETHER_TYPE | L3_HEADER | + +------+------+------------+-----------+ + ^ + | + +-- flow dissector starts here + + +.. code:: c + + skb->data + flow_keys->nhoff point to the first byte of L3_HEADER + flow_keys->thoff = nhoff + flow_keys->n_proto = ETHER_TYPE + +In case of VLAN, flow dissector can be called with the two different states. + +Pre-VLAN parsing:: + + +------+------+------+-----+-----------+-----------+ + | DMAC | SMAC | TPID | TCI |ETHER_TYPE | L3_HEADER | + +------+------+------+-----+-----------+-----------+ + ^ + | + +-- flow dissector starts here + +.. code:: c + + skb->data + flow_keys->nhoff point the to first byte of TCI + flow_keys->thoff = nhoff + flow_keys->n_proto = TPID + +Please note that TPID can be 802.1AD and, hence, BPF program would +have to parse VLAN information twice for double tagged packets. + + +Post-VLAN parsing:: + + +------+------+------+-----+-----------+-----------+ + | DMAC | SMAC | TPID | TCI |ETHER_TYPE | L3_HEADER | + +------+------+------+-----+-----------+-----------+ + ^ + | + +-- flow dissector starts here + +.. code:: c + + skb->data + flow_keys->nhoff point the to first byte of L3_HEADER + flow_keys->thoff = nhoff + flow_keys->n_proto = ETHER_TYPE + +In this case VLAN information has been processed before the flow dissector +and BPF flow dissector is not required to handle it. + + +The takeaway here is as follows: BPF flow dissector program can be called with +the optional VLAN header and should gracefully handle both cases: when single +or double VLAN is present and when it is not present. The same program +can be called for both cases and would have to be written carefully to +handle both cases. + + +Reference Implementation +======================== + +See ``tools/testing/selftests/bpf/progs/bpf_flow.c`` for the reference +implementation and ``tools/testing/selftests/bpf/flow_dissector_load.[hc]`` +for the loader. bpftool can be used to load BPF flow dissector program as well. + +The reference implementation is organized as follows: + * ``jmp_table`` map that contains sub-programs for each supported L3 protocol + * ``_dissect`` routine - entry point; it does input ``n_proto`` parsing and + does ``bpf_tail_call`` to the appropriate L3 handler + +Since BPF at this point doesn't support looping (or any jumping back), +jmp_table is used instead to handle multiple levels of encapsulation (and +IPv6 options). + + +Current Limitations +=================== +BPF flow dissector doesn't support exporting all the metadata that in-kernel +C-based implementation can export. Notable example is single VLAN (802.1Q) +and double VLAN (802.1AD) tags. Please refer to the ``struct bpf_flow_keys`` +for a set of information that's currently can be exported from the BPF context. diff --git a/Documentation/networking/bpf_flow_dissector.rst b/Documentation/networking/bpf_flow_dissector.rst deleted file mode 100644 index b375ae2ec2c4..000000000000 --- a/Documentation/networking/bpf_flow_dissector.rst +++ /dev/null @@ -1,126 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -================== -BPF Flow Dissector -================== - -Overview -======== - -Flow dissector is a routine that parses metadata out of the packets. It's -used in the various places in the networking subsystem (RFS, flow hash, etc). - -BPF flow dissector is an attempt to reimplement C-based flow dissector logic -in BPF to gain all the benefits of BPF verifier (namely, limits on the -number of instructions and tail calls). - -API -=== - -BPF flow dissector programs operate on an ``__sk_buff``. However, only the -limited set of fields is allowed: ``data``, ``data_end`` and ``flow_keys``. -``flow_keys`` is ``struct bpf_flow_keys`` and contains flow dissector input -and output arguments. - -The inputs are: - * ``nhoff`` - initial offset of the networking header - * ``thoff`` - initial offset of the transport header, initialized to nhoff - * ``n_proto`` - L3 protocol type, parsed out of L2 header - -Flow dissector BPF program should fill out the rest of the ``struct -bpf_flow_keys`` fields. Input arguments ``nhoff/thoff/n_proto`` should be -also adjusted accordingly. - -The return code of the BPF program is either BPF_OK to indicate successful -dissection, or BPF_DROP to indicate parsing error. - -__sk_buff->data -=============== - -In the VLAN-less case, this is what the initial state of the BPF flow -dissector looks like:: - - +------+------+------------+-----------+ - | DMAC | SMAC | ETHER_TYPE | L3_HEADER | - +------+------+------------+-----------+ - ^ - | - +-- flow dissector starts here - - -.. code:: c - - skb->data + flow_keys->nhoff point to the first byte of L3_HEADER - flow_keys->thoff = nhoff - flow_keys->n_proto = ETHER_TYPE - -In case of VLAN, flow dissector can be called with the two different states. - -Pre-VLAN parsing:: - - +------+------+------+-----+-----------+-----------+ - | DMAC | SMAC | TPID | TCI |ETHER_TYPE | L3_HEADER | - +------+------+------+-----+-----------+-----------+ - ^ - | - +-- flow dissector starts here - -.. code:: c - - skb->data + flow_keys->nhoff point the to first byte of TCI - flow_keys->thoff = nhoff - flow_keys->n_proto = TPID - -Please note that TPID can be 802.1AD and, hence, BPF program would -have to parse VLAN information twice for double tagged packets. - - -Post-VLAN parsing:: - - +------+------+------+-----+-----------+-----------+ - | DMAC | SMAC | TPID | TCI |ETHER_TYPE | L3_HEADER | - +------+------+------+-----+-----------+-----------+ - ^ - | - +-- flow dissector starts here - -.. code:: c - - skb->data + flow_keys->nhoff point the to first byte of L3_HEADER - flow_keys->thoff = nhoff - flow_keys->n_proto = ETHER_TYPE - -In this case VLAN information has been processed before the flow dissector -and BPF flow dissector is not required to handle it. - - -The takeaway here is as follows: BPF flow dissector program can be called with -the optional VLAN header and should gracefully handle both cases: when single -or double VLAN is present and when it is not present. The same program -can be called for both cases and would have to be written carefully to -handle both cases. - - -Reference Implementation -======================== - -See ``tools/testing/selftests/bpf/progs/bpf_flow.c`` for the reference -implementation and ``tools/testing/selftests/bpf/flow_dissector_load.[hc]`` -for the loader. bpftool can be used to load BPF flow dissector program as well. - -The reference implementation is organized as follows: - * ``jmp_table`` map that contains sub-programs for each supported L3 protocol - * ``_dissect`` routine - entry point; it does input ``n_proto`` parsing and - does ``bpf_tail_call`` to the appropriate L3 handler - -Since BPF at this point doesn't support looping (or any jumping back), -jmp_table is used instead to handle multiple levels of encapsulation (and -IPv6 options). - - -Current Limitations -=================== -BPF flow dissector doesn't support exporting all the metadata that in-kernel -C-based implementation can export. Notable example is single VLAN (802.1Q) -and double VLAN (802.1AD) tags. Please refer to the ``struct bpf_flow_keys`` -for a set of information that's currently can be exported from the BPF context. diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 984e68f9e026..5449149be496 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -9,7 +9,6 @@ Contents: netdev-FAQ af_xdp batman-adv - bpf_flow_dissector can can_ucan_protocol device_drivers/freescale/dpaa2/index -- cgit v1.2.3 From 3b8802446d27522cd6d32178ba975cc492611f31 Mon Sep 17 00:00:00 2001 From: Alexei Starovoitov Date: Wed, 17 Apr 2019 18:27:01 -0700 Subject: bpf: document the verifier limits Document the verifier limits. Signed-off-by: Alexei Starovoitov Acked-by: Yonghong Song Signed-off-by: Daniel Borkmann --- Documentation/bpf/bpf_design_QA.rst | 29 +++++++++++++++++++++++++++-- 1 file changed, 27 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/bpf/bpf_design_QA.rst b/Documentation/bpf/bpf_design_QA.rst index 10453c627135..cb402c59eca5 100644 --- a/Documentation/bpf/bpf_design_QA.rst +++ b/Documentation/bpf/bpf_design_QA.rst @@ -85,8 +85,33 @@ Q: Can loops be supported in a safe way? A: It's not clear yet. BPF developers are trying to find a way to -support bounded loops where the verifier can guarantee that -the program terminates in less than 4096 instructions. +support bounded loops. + +Q: What are the verifier limits? +-------------------------------- +A: The only limit known to the user space is BPF_MAXINSNS (4096). +It's the maximum number of instructions that the unprivileged bpf +program can have. The verifier has various internal limits. +Like the maximum number of instructions that can be explored during +program analysis. Currently, that limit is set to 1 million. +Which essentially means that the largest program can consist +of 1 million NOP instructions. There is a limit to the maximum number +of subsequent branches, a limit to the number of nested bpf-to-bpf +calls, a limit to the number of the verifier states per instruction, +a limit to the number of maps used by the program. +All these limits can be hit with a sufficiently complex program. +There are also non-numerical limits that can cause the program +to be rejected. The verifier used to recognize only pointer + constant +expressions. Now it can recognize pointer + bounded_register. +bpf_lookup_map_elem(key) had a requirement that 'key' must be +a pointer to the stack. Now, 'key' can be a pointer to map value. +The verifier is steadily getting 'smarter'. The limits are +being removed. The only way to know that the program is going to +be accepted by the verifier is to try to load it. +The bpf development process guarantees that the future kernel +versions will accept all bpf programs that were accepted by +the earlier versions. + Instruction level questions --------------------------- -- cgit v1.2.3