diff options
author | Mauro Carvalho Chehab <mchehab@s-opensource.com> | 2017-07-17 11:17:36 -0300 |
---|---|---|
committer | Mauro Carvalho Chehab <mchehab@s-opensource.com> | 2017-07-17 11:17:36 -0300 |
commit | a3db9d60a118571e696b684a6e8c692a2b064941 (patch) | |
tree | ff7bae0f79b7a2ee0bce03de4f883550200c52a9 /Documentation | |
parent | 2748e76ddb2967c4030171342ebdd3faa6a5e8e8 (diff) | |
parent | 5771a8c08880cdca3bfb4a3fc6d309d6bba20877 (diff) | |
download | linux-a3db9d60a118571e696b684a6e8c692a2b064941.tar.gz linux-a3db9d60a118571e696b684a6e8c692a2b064941.tar.bz2 linux-a3db9d60a118571e696b684a6e8c692a2b064941.zip |
Merge tag 'v4.13-rc1' into patchwork
Linux v4.13-rc1
* tag 'v4.13-rc1': (11136 commits)
Linux v4.13-rc1
random: reorder READ_ONCE() in get_random_uXX
random: suppress spammy warnings about unseeded randomness
replace incorrect strscpy use in FORTIFY_SOURCE
kmod: throttle kmod thread limit
kmod: add test driver to stress test the module loader
MAINTAINERS: give kmod some maintainer love
xtensa: use generic fb.h
fault-inject: add /proc/<pid>/fail-nth
fault-inject: simplify access check for fail-nth
fault-inject: make fail-nth read/write interface symmetric
fault-inject: parse as natural 1-based value for fail-nth write interface
fault-inject: automatically detect the number base for fail-nth write interface
kernel/watchdog.c: use better pr_fmt prefix
MAINTAINERS: move the befs tree to kernel.org
lib/atomic64_test.c: add a test that atomic64_inc_not_zero() returns an int
mm: fix overflow check in expand_upwards()
ubifs: Set double hash cookie also for RENAME_EXCHANGE
ubifs: Massage assert in ubifs_xattr_set() wrt. init_xattrs
ubifs: Don't leak kernel memory to the MTD
...
Diffstat (limited to 'Documentation')
537 files changed, 25488 insertions, 18284 deletions
diff --git a/Documentation/00-INDEX b/Documentation/00-INDEX index ed3e5e949fce..3bec49c33bbb 100644 --- a/Documentation/00-INDEX +++ b/Documentation/00-INDEX @@ -24,8 +24,6 @@ DMA-ISA-LPC.txt - How to do DMA with ISA (and LPC) devices. DMA-attributes.txt - listing of the various possible attributes a DMA region can have -DocBook/ - - directory with DocBook templates etc. for kernel documentation. EDID/ - directory with info on customizing EDID for broken gfx/displays. IPMI.txt @@ -40,8 +38,6 @@ Intel-IOMMU.txt - basic info on the Intel IOMMU virtualization support. Makefile - It's not of interest for those who aren't touching the build system. -Makefile.sphinx - - It's not of interest for those who aren't touching the build system. PCI/ - info related to PCI drivers. RCU/ @@ -246,8 +242,6 @@ kprobes.txt - documents the kernel probes debugging feature. kref.txt - docs on adding reference counters (krefs) to kernel objects. -kselftest.txt - - small unittests for (some) individual codepaths in the kernel. laptops/ - directory with laptop related info and laptop driver documentation. ldm.txt @@ -264,6 +258,8 @@ logo.gif - full colour GIF image of Linux logo (penguin - Tux). logo.txt - info on creator of above logo & site to get additional images from. +lsm.txt + - Linux Security Modules: General Security Hooks for Linux lzo.txt - kernel LZO decompressor input formats m68k/ diff --git a/Documentation/ABI/stable/sysfs-class-udc b/Documentation/ABI/stable/sysfs-class-udc index 85d3dac2e204..d1e2f3ec1fc9 100644 --- a/Documentation/ABI/stable/sysfs-class-udc +++ b/Documentation/ABI/stable/sysfs-class-udc @@ -55,14 +55,6 @@ Description: Indicates the maximum USB speed supported by this port. Users: -What: /sys/class/udc/<udc>/maximum_speed -Date: June 2011 -KernelVersion: 3.1 -Contact: Felipe Balbi <balbi@kernel.org> -Description: - Indicates the maximum USB speed supported by this port. -Users: - What: /sys/class/udc/<udc>/soft_connect Date: June 2011 KernelVersion: 3.1 @@ -91,3 +83,11 @@ Description: 'configured', and 'suspended'; however not all USB Device Controllers support reporting all states. Users: + +What: /sys/class/udc/<udc>/function +Date: June 2017 +KernelVersion: 4.13 +Contact: Felipe Balbi <balbi@kernel.org> +Description: + Prints out name of currently running USB Gadget Driver. +Users: diff --git a/Documentation/ABI/stable/sysfs-driver-aspeed-vuart b/Documentation/ABI/stable/sysfs-driver-aspeed-vuart new file mode 100644 index 000000000000..8062953ce77b --- /dev/null +++ b/Documentation/ABI/stable/sysfs-driver-aspeed-vuart @@ -0,0 +1,15 @@ +What: /sys/bus/platform/drivers/aspeed-vuart/*/lpc_address +Date: April 2017 +Contact: Jeremy Kerr <jk@ozlabs.org> +Description: Configures which IO port the host side of the UART + will appear on the host <-> BMC LPC bus. +Users: OpenBMC. Proposed changes should be mailed to + openbmc@lists.ozlabs.org + +What: /sys/bus/platform/drivers/aspeed-vuart*/sirq +Date: April 2017 +Contact: Jeremy Kerr <jk@ozlabs.org> +Description: Configures which interrupt number the host side of + the UART will appear on the host <-> BMC LPC bus. +Users: OpenBMC. Proposed changes should be mailed to + openbmc@lists.ozlabs.org diff --git a/Documentation/ABI/stable/sysfs-hypervisor-xen b/Documentation/ABI/stable/sysfs-hypervisor-xen new file mode 100644 index 000000000000..3cf5cdfcd9a8 --- /dev/null +++ b/Documentation/ABI/stable/sysfs-hypervisor-xen @@ -0,0 +1,119 @@ +What: /sys/hypervisor/compilation/compile_date +Date: March 2009 +KernelVersion: 2.6.30 +Contact: xen-devel@lists.xenproject.org +Description: If running under Xen: + Contains the build time stamp of the Xen hypervisor + Might return "<denied>" in case of special security settings + in the hypervisor. + +What: /sys/hypervisor/compilation/compiled_by +Date: March 2009 +KernelVersion: 2.6.30 +Contact: xen-devel@lists.xenproject.org +Description: If running under Xen: + Contains information who built the Xen hypervisor + Might return "<denied>" in case of special security settings + in the hypervisor. + +What: /sys/hypervisor/compilation/compiler +Date: March 2009 +KernelVersion: 2.6.30 +Contact: xen-devel@lists.xenproject.org +Description: If running under Xen: + Compiler which was used to build the Xen hypervisor + Might return "<denied>" in case of special security settings + in the hypervisor. + +What: /sys/hypervisor/properties/capabilities +Date: March 2009 +KernelVersion: 2.6.30 +Contact: xen-devel@lists.xenproject.org +Description: If running under Xen: + Space separated list of supported guest system types. Each type + is in the format: <class>-<major>.<minor>-<arch> + With: + <class>: "xen" -- x86: paravirtualized, arm: standard + "hvm" -- x86 only: fully virtualized + <major>: major guest interface version + <minor>: minor guest interface version + <arch>: architecture, e.g.: + "x86_32": 32 bit x86 guest without PAE + "x86_32p": 32 bit x86 guest with PAE + "x86_64": 64 bit x86 guest + "armv7l": 32 bit arm guest + "aarch64": 64 bit arm guest + +What: /sys/hypervisor/properties/changeset +Date: March 2009 +KernelVersion: 2.6.30 +Contact: xen-devel@lists.xenproject.org +Description: If running under Xen: + Changeset of the hypervisor (git commit) + Might return "<denied>" in case of special security settings + in the hypervisor. + +What: /sys/hypervisor/properties/features +Date: March 2009 +KernelVersion: 2.6.30 +Contact: xen-devel@lists.xenproject.org +Description: If running under Xen: + Features the Xen hypervisor supports for the guest as defined + in include/xen/interface/features.h printed as a hex value. + +What: /sys/hypervisor/properties/pagesize +Date: March 2009 +KernelVersion: 2.6.30 +Contact: xen-devel@lists.xenproject.org +Description: If running under Xen: + Default page size of the hypervisor printed as a hex value. + Might return "0" in case of special security settings + in the hypervisor. + +What: /sys/hypervisor/properties/virtual_start +Date: March 2009 +KernelVersion: 2.6.30 +Contact: xen-devel@lists.xenproject.org +Description: If running under Xen: + Virtual address of the hypervisor as a hex value. + +What: /sys/hypervisor/type +Date: March 2009 +KernelVersion: 2.6.30 +Contact: xen-devel@lists.xenproject.org +Description: If running under Xen: + Type of hypervisor: + "xen": Xen hypervisor + +What: /sys/hypervisor/uuid +Date: March 2009 +KernelVersion: 2.6.30 +Contact: xen-devel@lists.xenproject.org +Description: If running under Xen: + UUID of the guest as known to the Xen hypervisor. + +What: /sys/hypervisor/version/extra +Date: March 2009 +KernelVersion: 2.6.30 +Contact: xen-devel@lists.xenproject.org +Description: If running under Xen: + The Xen version is in the format <major>.<minor><extra> + This is the <extra> part of it. + Might return "<denied>" in case of special security settings + in the hypervisor. + +What: /sys/hypervisor/version/major +Date: March 2009 +KernelVersion: 2.6.30 +Contact: xen-devel@lists.xenproject.org +Description: If running under Xen: + The Xen version is in the format <major>.<minor><extra> + This is the <major> part of it. + +What: /sys/hypervisor/version/minor +Date: March 2009 +KernelVersion: 2.6.30 +Contact: xen-devel@lists.xenproject.org +Description: If running under Xen: + The Xen version is in the format <major>.<minor><extra> + This is the <minor> part of it. diff --git a/Documentation/ABI/testing/configfs-usb-gadget-uac1 b/Documentation/ABI/testing/configfs-usb-gadget-uac1 index 8ba9a123316e..abfe447c848f 100644 --- a/Documentation/ABI/testing/configfs-usb-gadget-uac1 +++ b/Documentation/ABI/testing/configfs-usb-gadget-uac1 @@ -1,12 +1,14 @@ What: /config/usb-gadget/gadget/functions/uac1.name -Date: Sep 2014 -KernelVersion: 3.18 +Date: June 2017 +KernelVersion: 4.14 Description: The attributes: - audio_buf_size - audio buffer size - fn_cap - capture pcm device file name - fn_cntl - control device file name - fn_play - playback pcm device file name - req_buf_size - ISO OUT endpoint request buffer size - req_count - ISO OUT endpoint request count + c_chmask - capture channel mask + c_srate - capture sampling rate + c_ssize - capture sample size (bytes) + p_chmask - playback channel mask + p_srate - playback sampling rate + p_ssize - playback sample size (bytes) + req_number - the number of pre-allocated request + for both capture and playback diff --git a/Documentation/ABI/testing/configfs-usb-gadget-uac1_legacy b/Documentation/ABI/testing/configfs-usb-gadget-uac1_legacy new file mode 100644 index 000000000000..b2eaefd9bc49 --- /dev/null +++ b/Documentation/ABI/testing/configfs-usb-gadget-uac1_legacy @@ -0,0 +1,12 @@ +What: /config/usb-gadget/gadget/functions/uac1_legacy.name +Date: Sep 2014 +KernelVersion: 3.18 +Description: + The attributes: + + audio_buf_size - audio buffer size + fn_cap - capture pcm device file name + fn_cntl - control device file name + fn_play - playback pcm device file name + req_buf_size - ISO OUT endpoint request buffer size + req_count - ISO OUT endpoint request count diff --git a/Documentation/ABI/testing/ima_policy b/Documentation/ABI/testing/ima_policy index bb0f9a135e21..e76432b9954d 100644 --- a/Documentation/ABI/testing/ima_policy +++ b/Documentation/ABI/testing/ima_policy @@ -34,9 +34,10 @@ Description: fsuuid:= file system UUID (e.g 8bcbe394-4f13-4144-be8e-5aa9ea2ce2f6) uid:= decimal value euid:= decimal value - fowner:=decimal value + fowner:= decimal value lsm: are LSM specific option: appraise_type:= [imasig] + pcr:= decimal value default policy: # PROC_SUPER_MAGIC @@ -96,3 +97,8 @@ Description: Smack: measure subj_user=_ func=FILE_CHECK mask=MAY_READ + + Example of measure rules using alternate PCRs: + + measure func=KEXEC_KERNEL_CHECK pcr=4 + measure func=KEXEC_INITRAMFS_CHECK pcr=5 diff --git a/Documentation/ABI/testing/sysfs-bus-fsi b/Documentation/ABI/testing/sysfs-bus-fsi new file mode 100644 index 000000000000..57c806350d6c --- /dev/null +++ b/Documentation/ABI/testing/sysfs-bus-fsi @@ -0,0 +1,38 @@ +What: /sys/bus/platform/devices/fsi-master/rescan +Date: May 2017 +KernelVersion: 4.12 +Contact: cbostic@linux.vnet.ibm.com +Description: + Initiates a FSI master scan for all connected slave devices + on its links. + +What: /sys/bus/platform/devices/fsi-master/break +Date: May 2017 +KernelVersion: 4.12 +Contact: cbostic@linux.vnet.ibm.com +Description: + Sends an FSI BREAK command on a master's communication + link to any connnected slaves. A BREAK resets connected + device's logic and preps it to receive further commands + from the master. + +What: /sys/bus/platform/devices/fsi-master/slave@00:00/term +Date: May 2017 +KernelVersion: 4.12 +Contact: cbostic@linux.vnet.ibm.com +Description: + Sends an FSI terminate command from the master to its + connected slave. A terminate resets the slave's state machines + that control access to the internally connected engines. In + addition the slave freezes its internal error register for + debugging purposes. This command is also needed to abort any + ongoing operation in case of an expired 'Master Time Out' + timer. + +What: /sys/bus/platform/devices/fsi-master/slave@00:00/raw +Date: May 2017 +KernelVersion: 4.12 +Contact: cbostic@linux.vnet.ibm.com +Description: + Provides a means of reading/writing a 32 bit value from/to a + specified FSI bus address. diff --git a/Documentation/ABI/testing/sysfs-bus-iio b/Documentation/ABI/testing/sysfs-bus-iio index 8c24d0892f61..2db2cdf42d54 100644 --- a/Documentation/ABI/testing/sysfs-bus-iio +++ b/Documentation/ABI/testing/sysfs-bus-iio @@ -1425,6 +1425,17 @@ Description: guarantees that the hardware fifo is flushed to the device buffer. +What: /sys/bus/iio/devices/iio:device*/buffer/hwfifo_timeout +KernelVersion: 4.12 +Contact: linux-iio@vger.kernel.org +Description: + A read/write property to provide capability to delay reporting of + samples till a timeout is reached. This allows host processors to + sleep, while the sensor is storing samples in its internal fifo. + The maximum timeout in seconds can be specified by setting + hwfifo_timeout.The current delay can be read by reading + hwfifo_timeout. A value of 0 means that there is no timeout. + What: /sys/bus/iio/devices/iio:deviceX/buffer/hwfifo_watermark KernelVersion: 4.2 Contact: linux-iio@vger.kernel.org diff --git a/Documentation/ABI/testing/sysfs-bus-iio-meas-spec b/Documentation/ABI/testing/sysfs-bus-iio-meas-spec index 1a6265e92e2f..6d47e548eee5 100644 --- a/Documentation/ABI/testing/sysfs-bus-iio-meas-spec +++ b/Documentation/ABI/testing/sysfs-bus-iio-meas-spec @@ -5,4 +5,3 @@ Description: Reading returns either '1' or '0'. '1' means that the battery level supplied to sensor is below 2.25V. This ABI is available for tsys02d, htu21, ms8607 - This ABI is available for htu21, ms8607 diff --git a/Documentation/ABI/testing/sysfs-bus-iio-timer-stm32 b/Documentation/ABI/testing/sysfs-bus-iio-timer-stm32 index 230020e06677..161c147d3c40 100644 --- a/Documentation/ABI/testing/sysfs-bus-iio-timer-stm32 +++ b/Documentation/ABI/testing/sysfs-bus-iio-timer-stm32 @@ -16,6 +16,54 @@ Description: - "OC2REF" : OC2REF signal is used as trigger output. - "OC3REF" : OC3REF signal is used as trigger output. - "OC4REF" : OC4REF signal is used as trigger output. + Additional modes (on TRGO2 only): + - "OC5REF" : OC5REF signal is used as trigger output. + - "OC6REF" : OC6REF signal is used as trigger output. + - "compare_pulse_OC4REF": + OC4REF rising or falling edges generate pulses. + - "compare_pulse_OC6REF": + OC6REF rising or falling edges generate pulses. + - "compare_pulse_OC4REF_r_or_OC6REF_r": + OC4REF or OC6REF rising edges generate pulses. + - "compare_pulse_OC4REF_r_or_OC6REF_f": + OC4REF rising or OC6REF falling edges generate pulses. + - "compare_pulse_OC5REF_r_or_OC6REF_r": + OC5REF or OC6REF rising edges generate pulses. + - "compare_pulse_OC5REF_r_or_OC6REF_f": + OC5REF rising or OC6REF falling edges generate pulses. + + +-----------+ +-------------+ +---------+ + | Prescaler +-> | Counter | +-> | Master | TRGO(2) + +-----------+ +--+--------+-+ |-> | Control +--> + | | || +---------+ + +--v--------+-+ OCxREF || +---------+ + | Chx compare +----------> | Output | ChX + +-----------+-+ | | Control +--> + . | | +---------+ + . | | . + +-----------v-+ OC6REF | . + | Ch6 compare +---------+> + +-------------+ + + Example with: "compare_pulse_OC4REF_r_or_OC6REF_r": + + X + X X + X . . X + X . . X + X . . X + count X . . . . X + . . . . + . . . . + +---------------+ + OC4REF | . . | + +-+ . . +-+ + . +---+ . + OC6REF . | | . + +-------+ +-------+ + +-+ +-+ + TRGO2 | | | | + +-+ +---+ +---------+ What: /sys/bus/iio/devices/triggerX/master_mode KernelVersion: 4.11 @@ -90,3 +138,18 @@ Description: Counting is enabled on rising edge of the connected trigger, and remains enabled for the duration of this selected mode. + +What: /sys/bus/iio/devices/iio:deviceX/in_count_trigger_mode_available +KernelVersion: 4.13 +Contact: benjamin.gaignard@st.com +Description: + Reading returns the list possible trigger modes. + +What: /sys/bus/iio/devices/iio:deviceX/in_count0_trigger_mode +KernelVersion: 4.13 +Contact: benjamin.gaignard@st.com +Description: + Configure the device counter trigger mode + counting direction is set by in_count0_count_direction + attribute and the counter is clocked by the connected trigger + rising edges. diff --git a/Documentation/ABI/testing/sysfs-bus-thunderbolt b/Documentation/ABI/testing/sysfs-bus-thunderbolt new file mode 100644 index 000000000000..2a98149943ea --- /dev/null +++ b/Documentation/ABI/testing/sysfs-bus-thunderbolt @@ -0,0 +1,110 @@ +What: /sys/bus/thunderbolt/devices/.../domainX/security +Date: Sep 2017 +KernelVersion: 4.13 +Contact: thunderbolt-software@lists.01.org +Description: This attribute holds current Thunderbolt security level + set by the system BIOS. Possible values are: + + none: All devices are automatically authorized + user: Devices are only authorized based on writing + appropriate value to the authorized attribute + secure: Require devices that support secure connect at + minimum. User needs to authorize each device. + dponly: Automatically tunnel Display port (and USB). No + PCIe tunnels are created. + +What: /sys/bus/thunderbolt/devices/.../authorized +Date: Sep 2017 +KernelVersion: 4.13 +Contact: thunderbolt-software@lists.01.org +Description: This attribute is used to authorize Thunderbolt devices + after they have been connected. If the device is not + authorized, no devices such as PCIe and Display port are + available to the system. + + Contents of this attribute will be 0 when the device is not + yet authorized. + + Possible values are supported: + 1: The device will be authorized and connected + + When key attribute contains 32 byte hex string the possible + values are: + 1: The 32 byte hex string is added to the device NVM and + the device is authorized. + 2: Send a challenge based on the 32 byte hex string. If the + challenge response from device is valid, the device is + authorized. In case of failure errno will be ENOKEY if + the device did not contain a key at all, and + EKEYREJECTED if the challenge response did not match. + +What: /sys/bus/thunderbolt/devices/.../key +Date: Sep 2017 +KernelVersion: 4.13 +Contact: thunderbolt-software@lists.01.org +Description: When a devices supports Thunderbolt secure connect it will + have this attribute. Writing 32 byte hex string changes + authorization to use the secure connection method instead. + +What: /sys/bus/thunderbolt/devices/.../device +Date: Sep 2017 +KernelVersion: 4.13 +Contact: thunderbolt-software@lists.01.org +Description: This attribute contains id of this device extracted from + the device DROM. + +What: /sys/bus/thunderbolt/devices/.../device_name +Date: Sep 2017 +KernelVersion: 4.13 +Contact: thunderbolt-software@lists.01.org +Description: This attribute contains name of this device extracted from + the device DROM. + +What: /sys/bus/thunderbolt/devices/.../vendor +Date: Sep 2017 +KernelVersion: 4.13 +Contact: thunderbolt-software@lists.01.org +Description: This attribute contains vendor id of this device extracted + from the device DROM. + +What: /sys/bus/thunderbolt/devices/.../vendor_name +Date: Sep 2017 +KernelVersion: 4.13 +Contact: thunderbolt-software@lists.01.org +Description: This attribute contains vendor name of this device extracted + from the device DROM. + +What: /sys/bus/thunderbolt/devices/.../unique_id +Date: Sep 2017 +KernelVersion: 4.13 +Contact: thunderbolt-software@lists.01.org +Description: This attribute contains unique_id string of this device. + This is either read from hardware registers (UUID on + newer hardware) or based on UID from the device DROM. + Can be used to uniquely identify particular device. + +What: /sys/bus/thunderbolt/devices/.../nvm_version +Date: Sep 2017 +KernelVersion: 4.13 +Contact: thunderbolt-software@lists.01.org +Description: If the device has upgradeable firmware the version + number is available here. Format: %x.%x, major.minor. + If the device is in safe mode reading the file returns + -ENODATA instead as the NVM version is not available. + +What: /sys/bus/thunderbolt/devices/.../nvm_authenticate +Date: Sep 2017 +KernelVersion: 4.13 +Contact: thunderbolt-software@lists.01.org +Description: When new NVM image is written to the non-active NVM + area (through non_activeX NVMem device), the + authentication procedure is started by writing 1 to + this file. If everything goes well, the device is + restarted with the new NVM firmware. If the image + verification fails an error code is returned instead. + + When read holds status of the last authentication + operation if an error occurred during the process. This + is directly the status value from the DMA configuration + based mailbox before the device is power cycled. Writing + 0 here clears the status. diff --git a/Documentation/ABI/testing/sysfs-class-mtd b/Documentation/ABI/testing/sysfs-class-mtd index 3b5c3bca9186..f34e592301d1 100644 --- a/Documentation/ABI/testing/sysfs-class-mtd +++ b/Documentation/ABI/testing/sysfs-class-mtd @@ -229,6 +229,6 @@ KernelVersion: 4.1 Contact: linux-mtd@lists.infradead.org Description: For a partition, the offset of that partition from the start - of the master device in bytes. This attribute is absent on - main devices, so it can be used to distinguish between - partitions and devices that aren't partitions. + of the parent (another partition or a flash device) in bytes. + This attribute is absent on flash devices, so it can be used + to distinguish them from partitions. diff --git a/Documentation/ABI/testing/sysfs-class-mux b/Documentation/ABI/testing/sysfs-class-mux new file mode 100644 index 000000000000..8715f9c7bd4f --- /dev/null +++ b/Documentation/ABI/testing/sysfs-class-mux @@ -0,0 +1,16 @@ +What: /sys/class/mux/ +Date: April 2017 +KernelVersion: 4.13 +Contact: Peter Rosin <peda@axentia.se> +Description: + The mux/ class sub-directory belongs to the Generic MUX + Framework and provides a sysfs interface for using MUX + controllers. + +What: /sys/class/mux/muxchipN/ +Date: April 2017 +KernelVersion: 4.13 +Contact: Peter Rosin <peda@axentia.se> +Description: + A /sys/class/mux/muxchipN directory is created for each + probed MUX chip where N is a simple enumeration. diff --git a/Documentation/ABI/testing/sysfs-class-net b/Documentation/ABI/testing/sysfs-class-net index 668604fc8e06..6856da99b6f7 100644 --- a/Documentation/ABI/testing/sysfs-class-net +++ b/Documentation/ABI/testing/sysfs-class-net @@ -251,3 +251,11 @@ Contact: netdev@vger.kernel.org Description: Indicates the unique physical switch identifier of a switch this port belongs to, as a string. + +What: /sys/class/net/<iface>/phydev +Date: May 2017 +KernelVersion: 4.13 +Contact: netdev@vger.kernel.org +Description: + Symbolic link to the PHY device this network device is attached + to. diff --git a/Documentation/ABI/testing/sysfs-class-net-phydev b/Documentation/ABI/testing/sysfs-class-net-phydev new file mode 100644 index 000000000000..6ebabfb27912 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-class-net-phydev @@ -0,0 +1,36 @@ +What: /sys/class/mdio_bus/<bus>/<device>/attached_dev +Date: May 2017 +KernelVersion: 4.13 +Contact: netdev@vger.kernel.org +Description: + Symbolic link to the network device this PHY device is + attached to. + +What: /sys/class/mdio_bus/<bus>/<device>/phy_has_fixups +Date: February 2014 +KernelVersion: 3.15 +Contact: netdev@vger.kernel.org +Description: + Boolean value indicating whether the PHY device has + any fixups registered against it (phy_register_fixup) + +What: /sys/class/mdio_bus/<bus>/<device>/phy_id +Date: November 2012 +KernelVersion: 3.8 +Contact: netdev@vger.kernel.org +Description: + 32-bit hexadecimal value corresponding to the PHY device's OUI, + model and revision number. + +What: /sys/class/mdio_bus/<bus>/<device>/phy_interface +Date: February 2014 +KernelVersion: 3.15 +Contact: netdev@vger.kernel.org +Description: + String value indicating the PHY interface, possible + values are:. + <empty> (not available), mii, gmii, sgmii, tbi, rev-mii, + rmii, rgmii, rgmii-id, rgmii-rxid, rgmii-txid, rtbi, smii + xgmii, moca, qsgmii, trgmii, 1000base-x, 2500base-x, rxaui, + xaui, 10gbase-kr, unknown + diff --git a/Documentation/ABI/testing/sysfs-class-power-twl4030 b/Documentation/ABI/testing/sysfs-class-power-twl4030 index be26af0f1895..b4fd32d210c5 100644 --- a/Documentation/ABI/testing/sysfs-class-power-twl4030 +++ b/Documentation/ABI/testing/sysfs-class-power-twl4030 @@ -1,20 +1,3 @@ -What: /sys/class/power_supply/twl4030_ac/max_current - /sys/class/power_supply/twl4030_usb/max_current -Description: - Read/Write limit on current which may - be drawn from the ac (Accessory Charger) or - USB port. - - Value is in micro-Amps. - - Value is set automatically to an appropriate - value when a cable is plugged or unplugged. - - Value can the set by writing to the attribute. - The change will only persist until the next - plug event. These event are reported via udev. - - What: /sys/class/power_supply/twl4030_usb/mode Description: Changing mode for USB port. diff --git a/Documentation/ABI/testing/sysfs-class-typec b/Documentation/ABI/testing/sysfs-class-typec index d4a3d23eb09c..5be552e255e9 100644 --- a/Documentation/ABI/testing/sysfs-class-typec +++ b/Documentation/ABI/testing/sysfs-class-typec @@ -30,6 +30,21 @@ Description: Valid values: source, sink +What: /sys/class/typec/<port>/port_type +Date: May 2017 +Contact: Badhri Jagan Sridharan <Badhri@google.com> +Description: + Indicates the type of the port. This attribute can be used for + requesting a change in the port type. Port type change is + supported as a synchronous operation, so write(2) to the + attribute will not return until the operation has finished. + + Valid values: + - source (The port will behave as source only DFP port) + - sink (The port will behave as sink only UFP port) + - dual (The port will behave as dual-role-data and + dual-role-power port) + What: /sys/class/typec/<port>/vconn_source Date: April 2017 Contact: Heikki Krogerus <heikki.krogerus@linux.intel.com> diff --git a/Documentation/ABI/testing/sysfs-firmware-ofw b/Documentation/ABI/testing/sysfs-firmware-ofw index f562b188e71d..edcab3ccfcc0 100644 --- a/Documentation/ABI/testing/sysfs-firmware-ofw +++ b/Documentation/ABI/testing/sysfs-firmware-ofw @@ -1,6 +1,6 @@ What: /sys/firmware/devicetree/* Date: November 2013 -Contact: Grant Likely <grant.likely@linaro.org> +Contact: Grant Likely <grant.likely@arm.com>, devicetree@vger.kernel.org Description: When using OpenFirmware or a Flattened Device Tree to enumerate hardware, the device tree structure will be exposed in this @@ -26,3 +26,27 @@ Description: name plus address). Properties are represented as files in the directory. The contents of each file is the exact binary data from the device tree. + +What: /sys/firmware/fdt +Date: February 2015 +KernelVersion: 3.19 +Contact: Frank Rowand <frowand.list@gmail.com>, devicetree@vger.kernel.org +Description: + Exports the FDT blob that was passed to the kernel by + the bootloader. This allows userland applications such + as kexec to access the raw binary. This blob is also + useful when debugging since it contains any changes + made to the blob by the bootloader. + + The fact that this node does not reside under + /sys/firmware/device-tree is deliberate: FDT is also used + on arm64 UEFI/ACPI systems to communicate just the UEFI + and ACPI entry points, but the FDT is never unflattened + and used to configure the system. + + A CRC32 checksum is calculated over the entire FDT + blob, and verified at late_initcall time. The sysfs + entry is instantiated only if the checksum is valid, + i.e., if the FDT blob has not been modified in the mean + time. Otherwise, a warning is printed. +Users: kexec, debugging diff --git a/Documentation/ABI/testing/sysfs-fs-f2fs b/Documentation/ABI/testing/sysfs-fs-f2fs index a809f6005f14..84c606fb3ca4 100644 --- a/Documentation/ABI/testing/sysfs-fs-f2fs +++ b/Documentation/ABI/testing/sysfs-fs-f2fs @@ -75,7 +75,7 @@ Contact: "Jaegeuk Kim" <jaegeuk.kim@samsung.com> Description: Controls the memory footprint used by f2fs. -What: /sys/fs/f2fs/<disk>/trim_sections +What: /sys/fs/f2fs/<disk>/batched_trim_sections Date: February 2015 Contact: "Jaegeuk Kim" <jaegeuk@kernel.org> Description: @@ -112,3 +112,21 @@ Date: January 2016 Contact: "Shuoran Liu" <liushuoran@huawei.com> Description: Shows total written kbytes issued to disk. + +What: /sys/fs/f2fs/<disk>/inject_rate +Date: May 2016 +Contact: "Sheng Yong" <shengyong1@huawei.com> +Description: + Controls the injection rate. + +What: /sys/fs/f2fs/<disk>/inject_type +Date: May 2016 +Contact: "Sheng Yong" <shengyong1@huawei.com> +Description: + Controls the injection type. + +What: /sys/fs/f2fs/<disk>/reserved_blocks +Date: June 2017 +Contact: "Chao Yu" <yuchao0@huawei.com> +Description: + Controls current reserved blocks in system. diff --git a/Documentation/ABI/testing/sysfs-hypervisor-pmu b/Documentation/ABI/testing/sysfs-hypervisor-xen index 224faa105e18..53b7b2ea7515 100644 --- a/Documentation/ABI/testing/sysfs-hypervisor-pmu +++ b/Documentation/ABI/testing/sysfs-hypervisor-xen @@ -1,8 +1,19 @@ +What: /sys/hypervisor/guest_type +Date: June 2017 +KernelVersion: 4.13 +Contact: xen-devel@lists.xenproject.org +Description: If running under Xen: + Type of guest: + "Xen": standard guest type on arm + "HVM": fully virtualized guest (x86) + "PV": paravirtualized guest (x86) + "PVH": fully virtualized guest without legacy emulation (x86) + What: /sys/hypervisor/pmu/pmu_mode Date: August 2015 KernelVersion: 4.3 Contact: Boris Ostrovsky <boris.ostrovsky@oracle.com> -Description: +Description: If running under Xen: Describes mode that Xen's performance-monitoring unit (PMU) uses. Accepted values are "off" -- PMU is disabled @@ -17,7 +28,16 @@ What: /sys/hypervisor/pmu/pmu_features Date: August 2015 KernelVersion: 4.3 Contact: Boris Ostrovsky <boris.ostrovsky@oracle.com> -Description: +Description: If running under Xen: Describes Xen PMU features (as an integer). A set bit indicates that the corresponding feature is enabled. See include/xen/interface/xenpmu.h for available features + +What: /sys/hypervisor/properties/buildid +Date: June 2017 +KernelVersion: 4.13 +Contact: xen-devel@lists.xenproject.org +Description: If running under Xen: + Build id of the hypervisor, needed for hypervisor live patching. + Might return "<denied>" in case of special security settings + in the hypervisor. diff --git a/Documentation/ABI/testing/sysfs-platform-ideapad-laptop b/Documentation/ABI/testing/sysfs-platform-ideapad-laptop index b31e782bd985..597a2f3d1efc 100644 --- a/Documentation/ABI/testing/sysfs-platform-ideapad-laptop +++ b/Documentation/ABI/testing/sysfs-platform-ideapad-laptop @@ -17,3 +17,11 @@ Description: * 2 -> Dust Cleaning * 4 -> Efficient Thermal Dissipation Mode +What: /sys/devices/platform/ideapad/touchpad +Date: May 2017 +KernelVersion: 4.13 +Contact: "Ritesh Raj Sarraf <rrs@debian.org>" +Description: + Control touchpad mode. + * 1 -> Switched On + * 0 -> Switched Off diff --git a/Documentation/ABI/testing/sysfs-uevent b/Documentation/ABI/testing/sysfs-uevent new file mode 100644 index 000000000000..aa39f8d7bcdf --- /dev/null +++ b/Documentation/ABI/testing/sysfs-uevent @@ -0,0 +1,47 @@ +What: /sys/.../uevent +Date: May 2017 +KernelVersion: 4.13 +Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org> +Description: + Enable passing additional variables for synthetic uevents that + are generated by writing /sys/.../uevent file. + + Recognized extended format is ACTION [UUID [KEY=VALUE ...]. + + The ACTION is compulsory - it is the name of the uevent action + ("add", "change", "remove"). There is no change compared to + previous functionality here. The rest of the extended format + is optional. + + You need to pass UUID first before any KEY=VALUE pairs. + The UUID must be in "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" + format where 'x' is a hex digit. The UUID is considered to be + a transaction identifier so it's possible to use the same UUID + value for one or more synthetic uevents in which case we + logically group these uevents together for any userspace + listeners. The UUID value appears in uevent as + "SYNTH_UUID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" environment + variable. + + If UUID is not passed in, the generated synthetic uevent gains + "SYNTH_UUID=0" environment variable automatically. + + The KEY=VALUE pairs can contain alphanumeric characters only. + It's possible to define zero or more pairs - each pair is then + delimited by a space character ' '. Each pair appears in + synthetic uevent as "SYNTH_ARG_KEY=VALUE". That means the KEY + name gains "SYNTH_ARG_" prefix to avoid possible collisions + with existing variables. + + Example of valid sequence written to the uevent file: + + add fe4d7c9d-b8c6-4a70-9ef1-3d8a58d18eed A=1 B=abc + + This generates synthetic uevent including these variables: + + ACTION=add + SYNTH_ARG_A=1 + SYNTH_ARG_B=abc + SYNTH_UUID=fe4d7c9d-b8c6-4a70-9ef1-3d8a58d18eed +Users: + udev, userspace tools generating synthetic uevents diff --git a/Documentation/DMA-API-HOWTO.txt b/Documentation/DMA-API-HOWTO.txt index 979228bc9035..f0cc3f772265 100644 --- a/Documentation/DMA-API-HOWTO.txt +++ b/Documentation/DMA-API-HOWTO.txt @@ -1,22 +1,24 @@ - Dynamic DMA mapping Guide - ========================= +========================= +Dynamic DMA mapping Guide +========================= - David S. Miller <davem@redhat.com> - Richard Henderson <rth@cygnus.com> - Jakub Jelinek <jakub@redhat.com> +:Author: David S. Miller <davem@redhat.com> +:Author: Richard Henderson <rth@cygnus.com> +:Author: Jakub Jelinek <jakub@redhat.com> This is a guide to device driver writers on how to use the DMA API with example pseudo-code. For a concise description of the API, see DMA-API.txt. - CPU and DMA addresses +CPU and DMA addresses +===================== There are several kinds of addresses involved in the DMA API, and it's important to understand the differences. The kernel normally uses virtual addresses. Any address returned by kmalloc(), vmalloc(), and similar interfaces is a virtual address and can -be stored in a "void *". +be stored in a ``void *``. The virtual memory system (TLB, page tables, etc.) translates virtual addresses to CPU physical addresses, which are stored as "phys_addr_t" or @@ -37,7 +39,7 @@ be restricted to a subset of that space. For example, even if a system supports 64-bit addresses for main memory and PCI BARs, it may use an IOMMU so devices only need to use 32-bit DMA addresses. -Here's a picture and some examples: +Here's a picture and some examples:: CPU CPU Bus Virtual Physical Address @@ -98,15 +100,16 @@ microprocessor architecture. You should use the DMA API rather than the bus-specific DMA API, i.e., use the dma_map_*() interfaces rather than the pci_map_*() interfaces. -First of all, you should make sure +First of all, you should make sure:: -#include <linux/dma-mapping.h> + #include <linux/dma-mapping.h> is in your driver, which provides the definition of dma_addr_t. This type can hold any valid DMA address for the platform and should be used everywhere you hold a DMA address returned from the DMA mapping functions. - What memory is DMA'able? +What memory is DMA'able? +======================== The first piece of information you must know is what kernel memory can be used with the DMA mapping facilities. There has been an unwritten @@ -143,7 +146,8 @@ What about block I/O and networking buffers? The block I/O and networking subsystems make sure that the buffers they use are valid for you to DMA from/to. - DMA addressing limitations +DMA addressing limitations +========================== Does your device have any DMA addressing limitations? For example, is your device only capable of driving the low order 24-bits of address? @@ -166,7 +170,7 @@ style to do this even if your device holds the default setting, because this shows that you did think about these issues wrt. your device. -The query is performed via a call to dma_set_mask_and_coherent(): +The query is performed via a call to dma_set_mask_and_coherent():: int dma_set_mask_and_coherent(struct device *dev, u64 mask); @@ -175,12 +179,12 @@ If you have some special requirements, then the following two separate queries can be used instead: The query for streaming mappings is performed via a call to - dma_set_mask(): + dma_set_mask():: int dma_set_mask(struct device *dev, u64 mask); The query for consistent allocations is performed via a call - to dma_set_coherent_mask(): + to dma_set_coherent_mask():: int dma_set_coherent_mask(struct device *dev, u64 mask); @@ -209,7 +213,7 @@ of your driver reports that performance is bad or that the device is not even detected, you can ask them for the kernel messages to find out exactly why. -The standard 32-bit addressing device would do something like this: +The standard 32-bit addressing device would do something like this:: if (dma_set_mask_and_coherent(dev, DMA_BIT_MASK(32))) { dev_warn(dev, "mydev: No suitable DMA available\n"); @@ -225,7 +229,7 @@ than 64-bit addressing. For example, Sparc64 PCI SAC addressing is more efficient than DAC addressing. Here is how you would handle a 64-bit capable device which can drive -all 64-bits when accessing streaming DMA: +all 64-bits when accessing streaming DMA:: int using_dac; @@ -239,7 +243,7 @@ all 64-bits when accessing streaming DMA: } If a card is capable of using 64-bit consistent allocations as well, -the case would look like this: +the case would look like this:: int using_dac, consistent_using_dac; @@ -260,7 +264,7 @@ uses consistent allocations, one would have to check the return value from dma_set_coherent_mask(). Finally, if your device can only drive the low 24-bits of -address you might do something like: +address you might do something like:: if (dma_set_mask(dev, DMA_BIT_MASK(24))) { dev_warn(dev, "mydev: 24-bit DMA addressing not available\n"); @@ -280,7 +284,7 @@ only provide the functionality which the machine can handle. It is important that the last call to dma_set_mask() be for the most specific mask. -Here is pseudo-code showing how this might be done: +Here is pseudo-code showing how this might be done:: #define PLAYBACK_ADDRESS_BITS DMA_BIT_MASK(32) #define RECORD_ADDRESS_BITS DMA_BIT_MASK(24) @@ -308,7 +312,8 @@ A sound card was used as an example here because this genre of PCI devices seems to be littered with ISA chips given a PCI front end, and thus retaining the 16MB DMA addressing limitations of ISA. - Types of DMA mappings +Types of DMA mappings +===================== There are two types of DMA mappings: @@ -336,12 +341,14 @@ There are two types of DMA mappings: to memory is immediately visible to the device, and vice versa. Consistent mappings guarantee this. - IMPORTANT: Consistent DMA memory does not preclude the usage of - proper memory barriers. The CPU may reorder stores to + .. important:: + + Consistent DMA memory does not preclude the usage of + proper memory barriers. The CPU may reorder stores to consistent memory just as it may normal memory. Example: if it is important for the device to see the first word of a descriptor updated before the second, you must do - something like: + something like:: desc->word0 = address; wmb(); @@ -377,16 +384,17 @@ Also, systems with caches that aren't DMA-coherent will work better when the underlying buffers don't share cache lines with other data. - Using Consistent DMA mappings. +Using Consistent DMA mappings +============================= To allocate and map large (PAGE_SIZE or so) consistent DMA regions, -you should do: +you should do:: dma_addr_t dma_handle; cpu_addr = dma_alloc_coherent(dev, size, &dma_handle, gfp); -where device is a struct device *. This may be called in interrupt +where device is a ``struct device *``. This may be called in interrupt context with the GFP_ATOMIC flag. Size is the length of the region you want to allocate, in bytes. @@ -415,7 +423,7 @@ exists (for example) to guarantee that if you allocate a chunk which is smaller than or equal to 64 kilobytes, the extent of the buffer you receive will not cross a 64K boundary. -To unmap and free such a DMA region, you call: +To unmap and free such a DMA region, you call:: dma_free_coherent(dev, size, cpu_addr, dma_handle); @@ -430,7 +438,7 @@ a kmem_cache, but it uses dma_alloc_coherent(), not __get_free_pages(). Also, it understands common hardware constraints for alignment, like queue heads needing to be aligned on N byte boundaries. -Create a dma_pool like this: +Create a dma_pool like this:: struct dma_pool *pool; @@ -444,7 +452,7 @@ pass 0 for boundary; passing 4096 says memory allocated from this pool must not cross 4KByte boundaries (but at that time it may be better to use dma_alloc_coherent() directly instead). -Allocate memory from a DMA pool like this: +Allocate memory from a DMA pool like this:: cpu_addr = dma_pool_alloc(pool, flags, &dma_handle); @@ -452,7 +460,7 @@ flags are GFP_KERNEL if blocking is permitted (not in_interrupt nor holding SMP locks), GFP_ATOMIC otherwise. Like dma_alloc_coherent(), this returns two values, cpu_addr and dma_handle. -Free memory that was allocated from a dma_pool like this: +Free memory that was allocated from a dma_pool like this:: dma_pool_free(pool, cpu_addr, dma_handle); @@ -460,7 +468,7 @@ where pool is what you passed to dma_pool_alloc(), and cpu_addr and dma_handle are the values dma_pool_alloc() returned. This function may be called in interrupt context. -Destroy a dma_pool by calling: +Destroy a dma_pool by calling:: dma_pool_destroy(pool); @@ -468,11 +476,12 @@ Make sure you've called dma_pool_free() for all memory allocated from a pool before you destroy the pool. This function may not be called in interrupt context. - DMA Direction +DMA Direction +============= The interfaces described in subsequent portions of this document take a DMA direction argument, which is an integer and takes on -one of the following values: +one of the following values:: DMA_BIDIRECTIONAL DMA_TO_DEVICE @@ -521,14 +530,15 @@ packets, map/unmap them with the DMA_TO_DEVICE direction specifier. For receive packets, just the opposite, map/unmap them with the DMA_FROM_DEVICE direction specifier. - Using Streaming DMA mappings +Using Streaming DMA mappings +============================ The streaming DMA mapping routines can be called from interrupt context. There are two versions of each map/unmap, one which will map/unmap a single memory region, and one which will map/unmap a scatterlist. -To map a single region, you do: +To map a single region, you do:: struct device *dev = &my_dev->dev; dma_addr_t dma_handle; @@ -545,37 +555,16 @@ To map a single region, you do: goto map_error_handling; } -and to unmap it: +and to unmap it:: dma_unmap_single(dev, dma_handle, size, direction); You should call dma_mapping_error() as dma_map_single() could fail and return -error. Not all DMA implementations support the dma_mapping_error() interface. -However, it is a good practice to call dma_mapping_error() interface, which -will invoke the generic mapping error check interface. Doing so will ensure -that the mapping code will work correctly on all DMA implementations without -any dependency on the specifics of the underlying implementation. Using the -returned address without checking for errors could result in failures ranging -from panics to silent data corruption. A couple of examples of incorrect ways -to check for errors that make assumptions about the underlying DMA -implementation are as follows and these are applicable to dma_map_page() as -well. - -Incorrect example 1: - dma_addr_t dma_handle; - - dma_handle = dma_map_single(dev, addr, size, direction); - if ((dma_handle & 0xffff != 0) || (dma_handle >= 0x1000000)) { - goto map_error; - } - -Incorrect example 2: - dma_addr_t dma_handle; - - dma_handle = dma_map_single(dev, addr, size, direction); - if (dma_handle == DMA_ERROR_CODE) { - goto map_error; - } +error. Doing so will ensure that the mapping code will work correctly on all +DMA implementations without any dependency on the specifics of the underlying +implementation. Using the returned address without checking for errors could +result in failures ranging from panics to silent data corruption. The same +applies to dma_map_page() as well. You should call dma_unmap_single() when the DMA activity is finished, e.g., from the interrupt which told you that the DMA transfer is done. @@ -584,7 +573,7 @@ Using CPU pointers like this for single mappings has a disadvantage: you cannot reference HIGHMEM memory in this way. Thus, there is a map/unmap interface pair akin to dma_{map,unmap}_single(). These interfaces deal with page/offset pairs instead of CPU pointers. -Specifically: +Specifically:: struct device *dev = &my_dev->dev; dma_addr_t dma_handle; @@ -614,7 +603,7 @@ error as outlined under the dma_map_single() discussion. You should call dma_unmap_page() when the DMA activity is finished, e.g., from the interrupt which told you that the DMA transfer is done. -With scatterlists, you map a region gathered from several regions by: +With scatterlists, you map a region gathered from several regions by:: int i, count = dma_map_sg(dev, sglist, nents, direction); struct scatterlist *sg; @@ -638,16 +627,18 @@ Then you should loop count times (note: this can be less than nents times) and use sg_dma_address() and sg_dma_len() macros where you previously accessed sg->address and sg->length as shown above. -To unmap a scatterlist, just call: +To unmap a scatterlist, just call:: dma_unmap_sg(dev, sglist, nents, direction); Again, make sure DMA activity has already finished. -PLEASE NOTE: The 'nents' argument to the dma_unmap_sg call must be - the _same_ one you passed into the dma_map_sg call, - it should _NOT_ be the 'count' value _returned_ from the - dma_map_sg call. +.. note:: + + The 'nents' argument to the dma_unmap_sg call must be + the _same_ one you passed into the dma_map_sg call, + it should _NOT_ be the 'count' value _returned_ from the + dma_map_sg call. Every dma_map_{single,sg}() call should have its dma_unmap_{single,sg}() counterpart, because the DMA address space is a shared resource and @@ -659,11 +650,11 @@ properly in order for the CPU and device to see the most up-to-date and correct copy of the DMA buffer. So, firstly, just map it with dma_map_{single,sg}(), and after each DMA -transfer call either: +transfer call either:: dma_sync_single_for_cpu(dev, dma_handle, size, direction); -or: +or:: dma_sync_sg_for_cpu(dev, sglist, nents, direction); @@ -671,17 +662,19 @@ as appropriate. Then, if you wish to let the device get at the DMA area again, finish accessing the data with the CPU, and then before actually -giving the buffer to the hardware call either: +giving the buffer to the hardware call either:: dma_sync_single_for_device(dev, dma_handle, size, direction); -or: +or:: dma_sync_sg_for_device(dev, sglist, nents, direction); as appropriate. -PLEASE NOTE: The 'nents' argument to dma_sync_sg_for_cpu() and +.. note:: + + The 'nents' argument to dma_sync_sg_for_cpu() and dma_sync_sg_for_device() must be the same passed to dma_map_sg(). It is _NOT_ the count returned by dma_map_sg(). @@ -692,7 +685,7 @@ dma_map_*() call till dma_unmap_*(), then you don't have to call the dma_sync_*() routines at all. Here is pseudo code which shows a situation in which you would need -to use the dma_sync_*() interfaces. +to use the dma_sync_*() interfaces:: my_card_setup_receive_buffer(struct my_card *cp, char *buffer, int len) { @@ -768,7 +761,8 @@ is planned to completely remove virt_to_bus() and bus_to_virt() as they are entirely deprecated. Some ports already do not provide these as it is impossible to correctly support them. - Handling Errors +Handling Errors +=============== DMA address space is limited on some architectures and an allocation failure can be determined by: @@ -776,7 +770,7 @@ failure can be determined by: - checking if dma_alloc_coherent() returns NULL or dma_map_sg returns 0 - checking the dma_addr_t returned from dma_map_single() and dma_map_page() - by using dma_mapping_error(): + by using dma_mapping_error():: dma_addr_t dma_handle; @@ -794,7 +788,8 @@ failure can be determined by: of a multiple page mapping attempt. These example are applicable to dma_map_page() as well. -Example 1: +Example 1:: + dma_addr_t dma_handle1; dma_addr_t dma_handle2; @@ -823,8 +818,12 @@ Example 1: dma_unmap_single(dma_handle1); map_error_handling1: -Example 2: (if buffers are allocated in a loop, unmap all mapped buffers when - mapping error is detected in the middle) +Example 2:: + + /* + * if buffers are allocated in a loop, unmap all mapped buffers when + * mapping error is detected in the middle + */ dma_addr_t dma_addr; dma_addr_t array[DMA_BUFFERS]; @@ -867,7 +866,8 @@ SCSI drivers must return SCSI_MLQUEUE_HOST_BUSY if the DMA mapping fails in the queuecommand hook. This means that the SCSI subsystem passes the command to the driver again later. - Optimizing Unmap State Space Consumption +Optimizing Unmap State Space Consumption +======================================== On many platforms, dma_unmap_{single,page}() is simply a nop. Therefore, keeping track of the mapping address and length is a waste @@ -879,7 +879,7 @@ Actually, instead of describing the macros one by one, we'll transform some example code. 1) Use DEFINE_DMA_UNMAP_{ADDR,LEN} in state saving structures. - Example, before: + Example, before:: struct ring_state { struct sk_buff *skb; @@ -887,7 +887,7 @@ transform some example code. __u32 len; }; - after: + after:: struct ring_state { struct sk_buff *skb; @@ -896,23 +896,23 @@ transform some example code. }; 2) Use dma_unmap_{addr,len}_set() to set these values. - Example, before: + Example, before:: ringp->mapping = FOO; ringp->len = BAR; - after: + after:: dma_unmap_addr_set(ringp, mapping, FOO); dma_unmap_len_set(ringp, len, BAR); 3) Use dma_unmap_{addr,len}() to access these values. - Example, before: + Example, before:: dma_unmap_single(dev, ringp->mapping, ringp->len, DMA_FROM_DEVICE); - after: + after:: dma_unmap_single(dev, dma_unmap_addr(ringp, mapping), @@ -923,7 +923,8 @@ It really should be self-explanatory. We treat the ADDR and LEN separately, because it is possible for an implementation to only need the address in order to perform the unmap operation. - Platform Issues +Platform Issues +=============== If you are just writing drivers for Linux and do not maintain an architecture port for the kernel, you can safely skip down @@ -949,12 +950,13 @@ to "Closing". alignment constraints (e.g. the alignment constraints about 64-bit objects). - Closing +Closing +======= This document, and the API itself, would not be in its current form without the feedback and suggestions from numerous individuals. We would like to specifically mention, in no particular order, the -following people: +following people:: Russell King <rmk@arm.linux.org.uk> Leo Dagum <dagum@barrel.engr.sgi.com> diff --git a/Documentation/DMA-API.txt b/Documentation/DMA-API.txt index 6b20128fab8a..45b29326d719 100644 --- a/Documentation/DMA-API.txt +++ b/Documentation/DMA-API.txt @@ -1,7 +1,8 @@ - Dynamic DMA mapping using the generic device - ============================================ +============================================ +Dynamic DMA mapping using the generic device +============================================ - James E.J. Bottomley <James.Bottomley@HansenPartnership.com> +:Author: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> This document describes the DMA API. For a more gentle introduction of the API (and actual examples), see Documentation/DMA-API-HOWTO.txt. @@ -12,10 +13,10 @@ machines. Unless you know that your driver absolutely has to support non-consistent platforms (this is usually only legacy platforms) you should only use the API described in part I. -Part I - dma_ API -------------------------------------- +Part I - dma_API +---------------- -To get the dma_ API, you must #include <linux/dma-mapping.h>. This +To get the dma_API, you must #include <linux/dma-mapping.h>. This provides dma_addr_t and the interfaces described below. A dma_addr_t can hold any valid DMA address for the platform. It can be @@ -26,9 +27,11 @@ address space and the DMA address space. Part Ia - Using large DMA-coherent buffers ------------------------------------------ -void * -dma_alloc_coherent(struct device *dev, size_t size, - dma_addr_t *dma_handle, gfp_t flag) +:: + + void * + dma_alloc_coherent(struct device *dev, size_t size, + dma_addr_t *dma_handle, gfp_t flag) Consistent memory is memory for which a write by either the device or the processor can immediately be read by the processor or device @@ -51,20 +54,24 @@ consolidate your requests for consistent memory as much as possible. The simplest way to do that is to use the dma_pool calls (see below). The flag parameter (dma_alloc_coherent() only) allows the caller to -specify the GFP_ flags (see kmalloc()) for the allocation (the +specify the ``GFP_`` flags (see kmalloc()) for the allocation (the implementation may choose to ignore flags that affect the location of the returned memory, like GFP_DMA). -void * -dma_zalloc_coherent(struct device *dev, size_t size, - dma_addr_t *dma_handle, gfp_t flag) +:: + + void * + dma_zalloc_coherent(struct device *dev, size_t size, + dma_addr_t *dma_handle, gfp_t flag) Wraps dma_alloc_coherent() and also zeroes the returned memory if the allocation attempt succeeded. -void -dma_free_coherent(struct device *dev, size_t size, void *cpu_addr, - dma_addr_t dma_handle) +:: + + void + dma_free_coherent(struct device *dev, size_t size, void *cpu_addr, + dma_addr_t dma_handle) Free a region of consistent memory you previously allocated. dev, size and dma_handle must all be the same as those passed into @@ -78,7 +85,7 @@ may only be called with IRQs enabled. Part Ib - Using small DMA-coherent buffers ------------------------------------------ -To get this part of the dma_ API, you must #include <linux/dmapool.h> +To get this part of the dma_API, you must #include <linux/dmapool.h> Many drivers need lots of small DMA-coherent memory regions for DMA descriptors or I/O buffers. Rather than allocating in units of a page @@ -88,6 +95,8 @@ not __get_free_pages(). Also, they understand common hardware constraints for alignment, like queue heads needing to be aligned on N-byte boundaries. +:: + struct dma_pool * dma_pool_create(const char *name, struct device *dev, size_t size, size_t align, size_t alloc); @@ -103,16 +112,21 @@ in bytes, and must be a power of two). If your device has no boundary crossing restrictions, pass 0 for alloc; passing 4096 says memory allocated from this pool must not cross 4KByte boundaries. +:: - void *dma_pool_zalloc(struct dma_pool *pool, gfp_t mem_flags, - dma_addr_t *handle) + void * + dma_pool_zalloc(struct dma_pool *pool, gfp_t mem_flags, + dma_addr_t *handle) Wraps dma_pool_alloc() and also zeroes the returned memory if the allocation attempt succeeded. - void *dma_pool_alloc(struct dma_pool *pool, gfp_t gfp_flags, - dma_addr_t *dma_handle); +:: + + void * + dma_pool_alloc(struct dma_pool *pool, gfp_t gfp_flags, + dma_addr_t *dma_handle); This allocates memory from the pool; the returned memory will meet the size and alignment requirements specified at creation time. Pass @@ -122,16 +136,20 @@ blocking. Like dma_alloc_coherent(), this returns two values: an address usable by the CPU, and the DMA address usable by the pool's device. +:: - void dma_pool_free(struct dma_pool *pool, void *vaddr, - dma_addr_t addr); + void + dma_pool_free(struct dma_pool *pool, void *vaddr, + dma_addr_t addr); This puts memory back into the pool. The pool is what was passed to dma_pool_alloc(); the CPU (vaddr) and DMA addresses are what were returned when that routine allocated the memory being freed. +:: - void dma_pool_destroy(struct dma_pool *pool); + void + dma_pool_destroy(struct dma_pool *pool); dma_pool_destroy() frees the resources of the pool. It must be called in a context which can sleep. Make sure you've freed all allocated @@ -141,32 +159,40 @@ memory back to the pool before you destroy it. Part Ic - DMA addressing limitations ------------------------------------ -int -dma_set_mask_and_coherent(struct device *dev, u64 mask) +:: + + int + dma_set_mask_and_coherent(struct device *dev, u64 mask) Checks to see if the mask is possible and updates the device streaming and coherent DMA mask parameters if it is. Returns: 0 if successful and a negative error if not. -int -dma_set_mask(struct device *dev, u64 mask) +:: + + int + dma_set_mask(struct device *dev, u64 mask) Checks to see if the mask is possible and updates the device parameters if it is. Returns: 0 if successful and a negative error if not. -int -dma_set_coherent_mask(struct device *dev, u64 mask) +:: + + int + dma_set_coherent_mask(struct device *dev, u64 mask) Checks to see if the mask is possible and updates the device parameters if it is. Returns: 0 if successful and a negative error if not. -u64 -dma_get_required_mask(struct device *dev) +:: + + u64 + dma_get_required_mask(struct device *dev) This API returns the mask that the platform requires to operate efficiently. Usually this means the returned mask @@ -182,94 +208,107 @@ call to set the mask to the value returned. Part Id - Streaming DMA mappings -------------------------------- -dma_addr_t -dma_map_single(struct device *dev, void *cpu_addr, size_t size, - enum dma_data_direction direction) +:: + + dma_addr_t + dma_map_single(struct device *dev, void *cpu_addr, size_t size, + enum dma_data_direction direction) Maps a piece of processor virtual memory so it can be accessed by the device and returns the DMA address of the memory. The direction for both APIs may be converted freely by casting. -However the dma_ API uses a strongly typed enumerator for its +However the dma_API uses a strongly typed enumerator for its direction: +======================= ============================================= DMA_NONE no direction (used for debugging) DMA_TO_DEVICE data is going from the memory to the device DMA_FROM_DEVICE data is coming from the device to the memory DMA_BIDIRECTIONAL direction isn't known +======================= ============================================= + +.. note:: + + Not all memory regions in a machine can be mapped by this API. + Further, contiguous kernel virtual space may not be contiguous as + physical memory. Since this API does not provide any scatter/gather + capability, it will fail if the user tries to map a non-physically + contiguous piece of memory. For this reason, memory to be mapped by + this API should be obtained from sources which guarantee it to be + physically contiguous (like kmalloc). + + Further, the DMA address of the memory must be within the + dma_mask of the device (the dma_mask is a bit mask of the + addressable region for the device, i.e., if the DMA address of + the memory ANDed with the dma_mask is still equal to the DMA + address, then the device can perform DMA to the memory). To + ensure that the memory allocated by kmalloc is within the dma_mask, + the driver may specify various platform-dependent flags to restrict + the DMA address range of the allocation (e.g., on x86, GFP_DMA + guarantees to be within the first 16MB of available DMA addresses, + as required by ISA devices). + + Note also that the above constraints on physical contiguity and + dma_mask may not apply if the platform has an IOMMU (a device which + maps an I/O DMA address to a physical memory address). However, to be + portable, device driver writers may *not* assume that such an IOMMU + exists. + +.. warning:: + + Memory coherency operates at a granularity called the cache + line width. In order for memory mapped by this API to operate + correctly, the mapped region must begin exactly on a cache line + boundary and end exactly on one (to prevent two separately mapped + regions from sharing a single cache line). Since the cache line size + may not be known at compile time, the API will not enforce this + requirement. Therefore, it is recommended that driver writers who + don't take special care to determine the cache line size at run time + only map virtual regions that begin and end on page boundaries (which + are guaranteed also to be cache line boundaries). + + DMA_TO_DEVICE synchronisation must be done after the last modification + of the memory region by the software and before it is handed off to + the device. Once this primitive is used, memory covered by this + primitive should be treated as read-only by the device. If the device + may write to it at any point, it should be DMA_BIDIRECTIONAL (see + below). + + DMA_FROM_DEVICE synchronisation must be done before the driver + accesses data that may be changed by the device. This memory should + be treated as read-only by the driver. If the driver needs to write + to it at any point, it should be DMA_BIDIRECTIONAL (see below). + + DMA_BIDIRECTIONAL requires special handling: it means that the driver + isn't sure if the memory was modified before being handed off to the + device and also isn't sure if the device will also modify it. Thus, + you must always sync bidirectional memory twice: once before the + memory is handed off to the device (to make sure all memory changes + are flushed from the processor) and once before the data may be + accessed after being used by the device (to make sure any processor + cache lines are updated with data that the device may have changed). + +:: -Notes: Not all memory regions in a machine can be mapped by this API. -Further, contiguous kernel virtual space may not be contiguous as -physical memory. Since this API does not provide any scatter/gather -capability, it will fail if the user tries to map a non-physically -contiguous piece of memory. For this reason, memory to be mapped by -this API should be obtained from sources which guarantee it to be -physically contiguous (like kmalloc). - -Further, the DMA address of the memory must be within the -dma_mask of the device (the dma_mask is a bit mask of the -addressable region for the device, i.e., if the DMA address of -the memory ANDed with the dma_mask is still equal to the DMA -address, then the device can perform DMA to the memory). To -ensure that the memory allocated by kmalloc is within the dma_mask, -the driver may specify various platform-dependent flags to restrict -the DMA address range of the allocation (e.g., on x86, GFP_DMA -guarantees to be within the first 16MB of available DMA addresses, -as required by ISA devices). - -Note also that the above constraints on physical contiguity and -dma_mask may not apply if the platform has an IOMMU (a device which -maps an I/O DMA address to a physical memory address). However, to be -portable, device driver writers may *not* assume that such an IOMMU -exists. - -Warnings: Memory coherency operates at a granularity called the cache -line width. In order for memory mapped by this API to operate -correctly, the mapped region must begin exactly on a cache line -boundary and end exactly on one (to prevent two separately mapped -regions from sharing a single cache line). Since the cache line size -may not be known at compile time, the API will not enforce this -requirement. Therefore, it is recommended that driver writers who -don't take special care to determine the cache line size at run time -only map virtual regions that begin and end on page boundaries (which -are guaranteed also to be cache line boundaries). - -DMA_TO_DEVICE synchronisation must be done after the last modification -of the memory region by the software and before it is handed off to -the device. Once this primitive is used, memory covered by this -primitive should be treated as read-only by the device. If the device -may write to it at any point, it should be DMA_BIDIRECTIONAL (see -below). - -DMA_FROM_DEVICE synchronisation must be done before the driver -accesses data that may be changed by the device. This memory should -be treated as read-only by the driver. If the driver needs to write -to it at any point, it should be DMA_BIDIRECTIONAL (see below). - -DMA_BIDIRECTIONAL requires special handling: it means that the driver -isn't sure if the memory was modified before being handed off to the -device and also isn't sure if the device will also modify it. Thus, -you must always sync bidirectional memory twice: once before the -memory is handed off to the device (to make sure all memory changes -are flushed from the processor) and once before the data may be -accessed after being used by the device (to make sure any processor -cache lines are updated with data that the device may have changed). - -void -dma_unmap_single(struct device *dev, dma_addr_t dma_addr, size_t size, - enum dma_data_direction direction) + void + dma_unmap_single(struct device *dev, dma_addr_t dma_addr, size_t size, + enum dma_data_direction direction) Unmaps the region previously mapped. All the parameters passed in must be identical to those passed in (and returned) by the mapping API. -dma_addr_t -dma_map_page(struct device *dev, struct page *page, - unsigned long offset, size_t size, - enum dma_data_direction direction) -void -dma_unmap_page(struct device *dev, dma_addr_t dma_address, size_t size, - enum dma_data_direction direction) +:: + + dma_addr_t + dma_map_page(struct device *dev, struct page *page, + unsigned long offset, size_t size, + enum dma_data_direction direction) + + void + dma_unmap_page(struct device *dev, dma_addr_t dma_address, size_t size, + enum dma_data_direction direction) API for mapping and unmapping for pages. All the notes and warnings for the other mapping APIs apply here. Also, although the <offset> @@ -277,20 +316,24 @@ and <size> parameters are provided to do partial page mapping, it is recommended that you never use these unless you really know what the cache width is. -dma_addr_t -dma_map_resource(struct device *dev, phys_addr_t phys_addr, size_t size, - enum dma_data_direction dir, unsigned long attrs) +:: -void -dma_unmap_resource(struct device *dev, dma_addr_t addr, size_t size, - enum dma_data_direction dir, unsigned long attrs) + dma_addr_t + dma_map_resource(struct device *dev, phys_addr_t phys_addr, size_t size, + enum dma_data_direction dir, unsigned long attrs) + + void + dma_unmap_resource(struct device *dev, dma_addr_t addr, size_t size, + enum dma_data_direction dir, unsigned long attrs) API for mapping and unmapping for MMIO resources. All the notes and warnings for the other mapping APIs apply here. The API should only be used to map device MMIO resources, mapping of RAM is not permitted. -int -dma_mapping_error(struct device *dev, dma_addr_t dma_addr) +:: + + int + dma_mapping_error(struct device *dev, dma_addr_t dma_addr) In some circumstances dma_map_single(), dma_map_page() and dma_map_resource() will fail to create a mapping. A driver can check for these errors by testing @@ -298,9 +341,11 @@ the returned DMA address with dma_mapping_error(). A non-zero return value means the mapping could not be created and the driver should take appropriate action (e.g. reduce current DMA mapping usage or delay and try again later). +:: + int dma_map_sg(struct device *dev, struct scatterlist *sg, - int nents, enum dma_data_direction direction) + int nents, enum dma_data_direction direction) Returns: the number of DMA address segments mapped (this may be shorter than <nents> passed in if some elements of the scatter/gather list are @@ -316,7 +361,7 @@ critical that the driver do something, in the case of a block driver aborting the request or even oopsing is better than doing nothing and corrupting the filesystem. -With scatterlists, you use the resulting mapping like this: +With scatterlists, you use the resulting mapping like this:: int i, count = dma_map_sg(dev, sglist, nents, direction); struct scatterlist *sg; @@ -337,9 +382,11 @@ Then you should loop count times (note: this can be less than nents times) and use sg_dma_address() and sg_dma_len() macros where you previously accessed sg->address and sg->length as shown above. +:: + void dma_unmap_sg(struct device *dev, struct scatterlist *sg, - int nents, enum dma_data_direction direction) + int nents, enum dma_data_direction direction) Unmap the previously mapped scatter/gather list. All the parameters must be the same as those and passed in to the scatter/gather mapping @@ -348,18 +395,27 @@ API. Note: <nents> must be the number you passed in, *not* the number of DMA address entries returned. -void -dma_sync_single_for_cpu(struct device *dev, dma_addr_t dma_handle, size_t size, - enum dma_data_direction direction) -void -dma_sync_single_for_device(struct device *dev, dma_addr_t dma_handle, size_t size, - enum dma_data_direction direction) -void -dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg, int nents, - enum dma_data_direction direction) -void -dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg, int nents, - enum dma_data_direction direction) +:: + + void + dma_sync_single_for_cpu(struct device *dev, dma_addr_t dma_handle, + size_t size, + enum dma_data_direction direction) + + void + dma_sync_single_for_device(struct device *dev, dma_addr_t dma_handle, + size_t size, + enum dma_data_direction direction) + + void + dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg, + int nents, + enum dma_data_direction direction) + + void + dma_sync_sg_for_device(struct device *dev, struct scatterlist *sg, + int nents, + enum dma_data_direction direction) Synchronise a single contiguous or scatter/gather mapping for the CPU and device. With the sync_sg API, all the parameters must be the same @@ -367,36 +423,41 @@ as those passed into the single mapping API. With the sync_single API, you can use dma_handle and size parameters that aren't identical to those passed into the single mapping API to do a partial sync. -Notes: You must do this: -- Before reading values that have been written by DMA from the device - (use the DMA_FROM_DEVICE direction) -- After writing values that will be written to the device using DMA - (use the DMA_TO_DEVICE) direction -- before *and* after handing memory to the device if the memory is - DMA_BIDIRECTIONAL +.. note:: + + You must do this: + + - Before reading values that have been written by DMA from the device + (use the DMA_FROM_DEVICE direction) + - After writing values that will be written to the device using DMA + (use the DMA_TO_DEVICE) direction + - before *and* after handing memory to the device if the memory is + DMA_BIDIRECTIONAL See also dma_map_single(). -dma_addr_t -dma_map_single_attrs(struct device *dev, void *cpu_addr, size_t size, - enum dma_data_direction dir, - unsigned long attrs) +:: + + dma_addr_t + dma_map_single_attrs(struct device *dev, void *cpu_addr, size_t size, + enum dma_data_direction dir, + unsigned long attrs) -void -dma_unmap_single_attrs(struct device *dev, dma_addr_t dma_addr, - size_t size, enum dma_data_direction dir, - unsigned long attrs) + void + dma_unmap_single_attrs(struct device *dev, dma_addr_t dma_addr, + size_t size, enum dma_data_direction dir, + unsigned long attrs) -int -dma_map_sg_attrs(struct device *dev, struct scatterlist *sgl, - int nents, enum dma_data_direction dir, - unsigned long attrs) + int + dma_map_sg_attrs(struct device *dev, struct scatterlist *sgl, + int nents, enum dma_data_direction dir, + unsigned long attrs) -void -dma_unmap_sg_attrs(struct device *dev, struct scatterlist *sgl, - int nents, enum dma_data_direction dir, - unsigned long attrs) + void + dma_unmap_sg_attrs(struct device *dev, struct scatterlist *sgl, + int nents, enum dma_data_direction dir, + unsigned long attrs) The four functions above are just like the counterpart functions without the _attrs suffixes, except that they pass an optional @@ -410,37 +471,38 @@ is identical to those of the corresponding function without the _attrs suffix. As a result dma_map_single_attrs() can generally replace dma_map_single(), etc. -As an example of the use of the *_attrs functions, here's how +As an example of the use of the ``*_attrs`` functions, here's how you could pass an attribute DMA_ATTR_FOO when mapping memory -for DMA: +for DMA:: -#include <linux/dma-mapping.h> -/* DMA_ATTR_FOO should be defined in linux/dma-mapping.h and - * documented in Documentation/DMA-attributes.txt */ -... + #include <linux/dma-mapping.h> + /* DMA_ATTR_FOO should be defined in linux/dma-mapping.h and + * documented in Documentation/DMA-attributes.txt */ + ... - unsigned long attr; - attr |= DMA_ATTR_FOO; - .... - n = dma_map_sg_attrs(dev, sg, nents, DMA_TO_DEVICE, attr); - .... + unsigned long attr; + attr |= DMA_ATTR_FOO; + .... + n = dma_map_sg_attrs(dev, sg, nents, DMA_TO_DEVICE, attr); + .... Architectures that care about DMA_ATTR_FOO would check for its presence in their implementations of the mapping and unmapping -routines, e.g.: - -void whizco_dma_map_sg_attrs(struct device *dev, dma_addr_t dma_addr, - size_t size, enum dma_data_direction dir, - unsigned long attrs) -{ - .... - if (attrs & DMA_ATTR_FOO) - /* twizzle the frobnozzle */ - .... +routines, e.g.::: + + void whizco_dma_map_sg_attrs(struct device *dev, dma_addr_t dma_addr, + size_t size, enum dma_data_direction dir, + unsigned long attrs) + { + .... + if (attrs & DMA_ATTR_FOO) + /* twizzle the frobnozzle */ + .... + } -Part II - Advanced dma_ usage ------------------------------ +Part II - Advanced dma usage +---------------------------- Warning: These pieces of the DMA API should not be used in the majority of cases, since they cater for unlikely corner cases that @@ -450,9 +512,11 @@ If you don't understand how cache line coherency works between a processor and an I/O device, you should not be using this part of the API at all. -void * -dma_alloc_noncoherent(struct device *dev, size_t size, - dma_addr_t *dma_handle, gfp_t flag) +:: + + void * + dma_alloc_noncoherent(struct device *dev, size_t size, + dma_addr_t *dma_handle, gfp_t flag) Identical to dma_alloc_coherent() except that the platform will choose to return either consistent or non-consistent memory as it sees @@ -468,39 +532,49 @@ only use this API if you positively know your driver will be required to work on one of the rare (usually non-PCI) architectures that simply cannot make consistent memory. -void -dma_free_noncoherent(struct device *dev, size_t size, void *cpu_addr, - dma_addr_t dma_handle) +:: + + void + dma_free_noncoherent(struct device *dev, size_t size, void *cpu_addr, + dma_addr_t dma_handle) Free memory allocated by the nonconsistent API. All parameters must be identical to those passed in (and returned by dma_alloc_noncoherent()). -int -dma_get_cache_alignment(void) +:: + + int + dma_get_cache_alignment(void) Returns the processor cache alignment. This is the absolute minimum alignment *and* width that you must observe when either mapping memory or doing partial flushes. -Notes: This API may return a number *larger* than the actual cache -line, but it will guarantee that one or more cache lines fit exactly -into the width returned by this call. It will also always be a power -of two for easy alignment. +.. note:: -void -dma_cache_sync(struct device *dev, void *vaddr, size_t size, - enum dma_data_direction direction) + This API may return a number *larger* than the actual cache + line, but it will guarantee that one or more cache lines fit exactly + into the width returned by this call. It will also always be a power + of two for easy alignment. + +:: + + void + dma_cache_sync(struct device *dev, void *vaddr, size_t size, + enum dma_data_direction direction) Do a partial sync of memory that was allocated by dma_alloc_noncoherent(), starting at virtual address vaddr and continuing on for size. Again, you *must* observe the cache line boundaries when doing this. -int -dma_declare_coherent_memory(struct device *dev, phys_addr_t phys_addr, - dma_addr_t device_addr, size_t size, int - flags) +:: + + int + dma_declare_coherent_memory(struct device *dev, phys_addr_t phys_addr, + dma_addr_t device_addr, size_t size, int + flags) Declare region of memory to be handed out by dma_alloc_coherent() when it's asked for coherent memory for this device. @@ -516,21 +590,21 @@ size is the size of the area (must be multiples of PAGE_SIZE). flags can be ORed together and are: -DMA_MEMORY_MAP - request that the memory returned from -dma_alloc_coherent() be directly writable. +- DMA_MEMORY_MAP - request that the memory returned from + dma_alloc_coherent() be directly writable. -DMA_MEMORY_IO - request that the memory returned from -dma_alloc_coherent() be addressable using read()/write()/memcpy_toio() etc. +- DMA_MEMORY_IO - request that the memory returned from + dma_alloc_coherent() be addressable using read()/write()/memcpy_toio() etc. One or both of these flags must be present. -DMA_MEMORY_INCLUDES_CHILDREN - make the declared memory be allocated by -dma_alloc_coherent of any child devices of this one (for memory residing -on a bridge). +- DMA_MEMORY_INCLUDES_CHILDREN - make the declared memory be allocated by + dma_alloc_coherent of any child devices of this one (for memory residing + on a bridge). -DMA_MEMORY_EXCLUSIVE - only allocate memory from the declared regions. -Do not allow dma_alloc_coherent() to fall back to system memory when -it's out of memory in the declared region. +- DMA_MEMORY_EXCLUSIVE - only allocate memory from the declared regions. + Do not allow dma_alloc_coherent() to fall back to system memory when + it's out of memory in the declared region. The return value will be either DMA_MEMORY_MAP or DMA_MEMORY_IO and must correspond to a passed in flag (i.e. no returning DMA_MEMORY_IO @@ -543,15 +617,17 @@ must be accessed using the correct bus functions. If your driver isn't prepared to handle this contingency, it should not specify DMA_MEMORY_IO in the input flags. -As a simplification for the platforms, only *one* such region of +As a simplification for the platforms, only **one** such region of memory may be declared per device. For reasons of efficiency, most platforms choose to track the declared region only at the granularity of a page. For smaller allocations, you should use the dma_pool() API. -void -dma_release_declared_memory(struct device *dev) +:: + + void + dma_release_declared_memory(struct device *dev) Remove the memory region previously declared from the system. This API performs *no* in-use checking for this region and will return @@ -559,9 +635,11 @@ unconditionally having removed all the required structures. It is the driver's job to ensure that no parts of this memory region are currently in use. -void * -dma_mark_declared_memory_occupied(struct device *dev, - dma_addr_t device_addr, size_t size) +:: + + void * + dma_mark_declared_memory_occupied(struct device *dev, + dma_addr_t device_addr, size_t size) This is used to occupy specific regions of the declared space (dma_alloc_coherent() will hand out the first free region it finds). @@ -592,38 +670,37 @@ option has a performance impact. Do not enable it in production kernels. If you boot the resulting kernel will contain code which does some bookkeeping about what DMA memory was allocated for which device. If this code detects an error it prints a warning message with some details into your kernel log. An -example warning message may look like this: - -------------[ cut here ]------------ -WARNING: at /data2/repos/linux-2.6-iommu/lib/dma-debug.c:448 - check_unmap+0x203/0x490() -Hardware name: -forcedeth 0000:00:08.0: DMA-API: device driver frees DMA memory with wrong - function [device address=0x00000000640444be] [size=66 bytes] [mapped as -single] [unmapped as page] -Modules linked in: nfsd exportfs bridge stp llc r8169 -Pid: 0, comm: swapper Tainted: G W 2.6.28-dmatest-09289-g8bb99c0 #1 -Call Trace: - <IRQ> [<ffffffff80240b22>] warn_slowpath+0xf2/0x130 - [<ffffffff80647b70>] _spin_unlock+0x10/0x30 - [<ffffffff80537e75>] usb_hcd_link_urb_to_ep+0x75/0xc0 - [<ffffffff80647c22>] _spin_unlock_irqrestore+0x12/0x40 - [<ffffffff8055347f>] ohci_urb_enqueue+0x19f/0x7c0 - [<ffffffff80252f96>] queue_work+0x56/0x60 - [<ffffffff80237e10>] enqueue_task_fair+0x20/0x50 - [<ffffffff80539279>] usb_hcd_submit_urb+0x379/0xbc0 - [<ffffffff803b78c3>] cpumask_next_and+0x23/0x40 - [<ffffffff80235177>] find_busiest_group+0x207/0x8a0 - [<ffffffff8064784f>] _spin_lock_irqsave+0x1f/0x50 - [<ffffffff803c7ea3>] check_unmap+0x203/0x490 - [<ffffffff803c8259>] debug_dma_unmap_page+0x49/0x50 - [<ffffffff80485f26>] nv_tx_done_optimized+0xc6/0x2c0 - [<ffffffff80486c13>] nv_nic_irq_optimized+0x73/0x2b0 - [<ffffffff8026df84>] handle_IRQ_event+0x34/0x70 - [<ffffffff8026ffe9>] handle_edge_irq+0xc9/0x150 - [<ffffffff8020e3ab>] do_IRQ+0xcb/0x1c0 - [<ffffffff8020c093>] ret_from_intr+0x0/0xa - <EOI> <4>---[ end trace f6435a98e2a38c0e ]--- +example warning message may look like this:: + + WARNING: at /data2/repos/linux-2.6-iommu/lib/dma-debug.c:448 + check_unmap+0x203/0x490() + Hardware name: + forcedeth 0000:00:08.0: DMA-API: device driver frees DMA memory with wrong + function [device address=0x00000000640444be] [size=66 bytes] [mapped as + single] [unmapped as page] + Modules linked in: nfsd exportfs bridge stp llc r8169 + Pid: 0, comm: swapper Tainted: G W 2.6.28-dmatest-09289-g8bb99c0 #1 + Call Trace: + <IRQ> [<ffffffff80240b22>] warn_slowpath+0xf2/0x130 + [<ffffffff80647b70>] _spin_unlock+0x10/0x30 + [<ffffffff80537e75>] usb_hcd_link_urb_to_ep+0x75/0xc0 + [<ffffffff80647c22>] _spin_unlock_irqrestore+0x12/0x40 + [<ffffffff8055347f>] ohci_urb_enqueue+0x19f/0x7c0 + [<ffffffff80252f96>] queue_work+0x56/0x60 + [<ffffffff80237e10>] enqueue_task_fair+0x20/0x50 + [<ffffffff80539279>] usb_hcd_submit_urb+0x379/0xbc0 + [<ffffffff803b78c3>] cpumask_next_and+0x23/0x40 + [<ffffffff80235177>] find_busiest_group+0x207/0x8a0 + [<ffffffff8064784f>] _spin_lock_irqsave+0x1f/0x50 + [<ffffffff803c7ea3>] check_unmap+0x203/0x490 + [<ffffffff803c8259>] debug_dma_unmap_page+0x49/0x50 + [<ffffffff80485f26>] nv_tx_done_optimized+0xc6/0x2c0 + [<ffffffff80486c13>] nv_nic_irq_optimized+0x73/0x2b0 + [<ffffffff8026df84>] handle_IRQ_event+0x34/0x70 + [<ffffffff8026ffe9>] handle_edge_irq+0xc9/0x150 + [<ffffffff8020e3ab>] do_IRQ+0xcb/0x1c0 + [<ffffffff8020c093>] ret_from_intr+0x0/0xa + <EOI> <4>---[ end trace f6435a98e2a38c0e ]--- The driver developer can find the driver and the device including a stacktrace of the DMA-API call which caused this warning. @@ -637,43 +714,42 @@ details. The debugfs directory for the DMA-API debugging code is called dma-api/. In this directory the following files can currently be found: - dma-api/all_errors This file contains a numeric value. If this +=============================== =============================================== +dma-api/all_errors This file contains a numeric value. If this value is not equal to zero the debugging code will print a warning for every error it finds into the kernel log. Be careful with this option, as it can easily flood your logs. - dma-api/disabled This read-only file contains the character 'Y' +dma-api/disabled This read-only file contains the character 'Y' if the debugging code is disabled. This can happen when it runs out of memory or if it was disabled at boot time - dma-api/error_count This file is read-only and shows the total +dma-api/error_count This file is read-only and shows the total numbers of errors found. - dma-api/num_errors The number in this file shows how many +dma-api/num_errors The number in this file shows how many warnings will be printed to the kernel log before it stops. This number is initialized to one at system boot and be set by writing into this file - dma-api/min_free_entries - This read-only file can be read to get the +dma-api/min_free_entries This read-only file can be read to get the minimum number of free dma_debug_entries the allocator has ever seen. If this value goes down to zero the code will disable itself because it is not longer reliable. - dma-api/num_free_entries - The current number of free dma_debug_entries +dma-api/num_free_entries The current number of free dma_debug_entries in the allocator. - dma-api/driver-filter - You can write a name of a driver into this file +dma-api/driver-filter You can write a name of a driver into this file to limit the debug output to requests from that particular driver. Write an empty string to that file to disable the filter and see all errors again. +=============================== =============================================== If you have this code compiled into your kernel it will be enabled by default. If you want to boot without the bookkeeping anyway you can provide @@ -692,7 +768,10 @@ of preallocated entries is defined per architecture. If it is too low for you boot with 'dma_debug_entries=<your_desired_number>' to overwrite the architectural default. -void debug_dmap_mapping_error(struct device *dev, dma_addr_t dma_addr); +:: + + void + debug_dma_mapping_error(struct device *dev, dma_addr_t dma_addr); dma-debug interface debug_dma_mapping_error() to debug drivers that fail to check DMA mapping errors on addresses returned by dma_map_single() and @@ -702,4 +781,3 @@ the driver. When driver does unmap, debug_dma_unmap() checks the flag and if this flag is still set, prints warning message that includes call trace that leads up to the unmap. This interface can be called from dma_mapping_error() routines to enable DMA mapping error check debugging. - diff --git a/Documentation/DMA-ISA-LPC.txt b/Documentation/DMA-ISA-LPC.txt index c41331398752..8c2b8be6e45b 100644 --- a/Documentation/DMA-ISA-LPC.txt +++ b/Documentation/DMA-ISA-LPC.txt @@ -1,19 +1,20 @@ - DMA with ISA and LPC devices - ============================ +============================ +DMA with ISA and LPC devices +============================ - Pierre Ossman <drzeus@drzeus.cx> +:Author: Pierre Ossman <drzeus@drzeus.cx> This document describes how to do DMA transfers using the old ISA DMA controller. Even though ISA is more or less dead today the LPC bus uses the same DMA system so it will be around for quite some time. -Part I - Headers and dependencies ---------------------------------- +Headers and dependencies +------------------------ -To do ISA style DMA you need to include two headers: +To do ISA style DMA you need to include two headers:: -#include <linux/dma-mapping.h> -#include <asm/dma.h> + #include <linux/dma-mapping.h> + #include <asm/dma.h> The first is the generic DMA API used to convert virtual addresses to bus addresses (see Documentation/DMA-API.txt for details). @@ -23,8 +24,8 @@ this is not present on all platforms make sure you construct your Kconfig to be dependent on ISA_DMA_API (not ISA) so that nobody tries to build your driver on unsupported platforms. -Part II - Buffer allocation ---------------------------- +Buffer allocation +----------------- The ISA DMA controller has some very strict requirements on which memory it can access so extra care must be taken when allocating @@ -42,13 +43,13 @@ requirements you pass the flag GFP_DMA to kmalloc. Unfortunately the memory available for ISA DMA is scarce so unless you allocate the memory during boot-up it's a good idea to also pass -__GFP_REPEAT and __GFP_NOWARN to make the allocator try a bit harder. +__GFP_RETRY_MAYFAIL and __GFP_NOWARN to make the allocator try a bit harder. (This scarcity also means that you should allocate the buffer as early as possible and not release it until the driver is unloaded.) -Part III - Address translation ------------------------------- +Address translation +------------------- To translate the virtual address to a bus address, use the normal DMA API. Do _not_ use isa_virt_to_phys() even though it does the same @@ -61,8 +62,8 @@ Note: x86_64 had a broken DMA API when it came to ISA but has since been fixed. If your arch has problems then fix the DMA API instead of reverting to the ISA functions. -Part IV - Channels ------------------- +Channels +-------- A normal ISA DMA controller has 8 channels. The lower four are for 8-bit transfers and the upper four are for 16-bit transfers. @@ -80,8 +81,8 @@ The ability to use 16-bit or 8-bit transfers is _not_ up to you as a driver author but depends on what the hardware supports. Check your specs or test different channels. -Part V - Transfer data ----------------------- +Transfer data +------------- Now for the good stuff, the actual DMA transfer. :) @@ -112,37 +113,37 @@ Once the DMA transfer is finished (or timed out) you should disable the channel again. You should also check get_dma_residue() to make sure that all data has been transferred. -Example: +Example:: -int flags, residue; + int flags, residue; -flags = claim_dma_lock(); + flags = claim_dma_lock(); -clear_dma_ff(); + clear_dma_ff(); -set_dma_mode(channel, DMA_MODE_WRITE); -set_dma_addr(channel, phys_addr); -set_dma_count(channel, num_bytes); + set_dma_mode(channel, DMA_MODE_WRITE); + set_dma_addr(channel, phys_addr); + set_dma_count(channel, num_bytes); -dma_enable(channel); + dma_enable(channel); -release_dma_lock(flags); + release_dma_lock(flags); -while (!device_done()); + while (!device_done()); -flags = claim_dma_lock(); + flags = claim_dma_lock(); -dma_disable(channel); + dma_disable(channel); -residue = dma_get_residue(channel); -if (residue != 0) - printk(KERN_ERR "driver: Incomplete DMA transfer!" - " %d bytes left!\n", residue); + residue = dma_get_residue(channel); + if (residue != 0) + printk(KERN_ERR "driver: Incomplete DMA transfer!" + " %d bytes left!\n", residue); -release_dma_lock(flags); + release_dma_lock(flags); -Part VI - Suspend/resume ------------------------- +Suspend/resume +-------------- It is the driver's responsibility to make sure that the machine isn't suspended while a DMA transfer is in progress. Also, all DMA settings diff --git a/Documentation/DMA-attributes.txt b/Documentation/DMA-attributes.txt index 44c6bc496eee..8f8d97f65d73 100644 --- a/Documentation/DMA-attributes.txt +++ b/Documentation/DMA-attributes.txt @@ -1,5 +1,6 @@ - DMA attributes - ============== +============== +DMA attributes +============== This document describes the semantics of the DMA attributes that are defined in linux/dma-mapping.h. @@ -108,6 +109,7 @@ This is a hint to the DMA-mapping subsystem that it's probably not worth the time to try to allocate memory to in a way that gives better TLB efficiency (AKA it's not worth trying to build the mapping out of larger pages). You might want to specify this if: + - You know that the accesses to this memory won't thrash the TLB. You might know that the accesses are likely to be sequential or that they aren't sequential but it's unlikely you'll ping-pong @@ -121,11 +123,12 @@ pages). You might want to specify this if: the mapping to have a short lifetime then it may be worth it to optimize allocation (avoid coming up with large pages) instead of getting the slight performance win of larger pages. + Setting this hint doesn't guarantee that you won't get huge pages, but it means that we won't try quite as hard to get them. -NOTE: At the moment DMA_ATTR_ALLOC_SINGLE_PAGES is only implemented on ARM, -though ARM64 patches will likely be posted soon. +.. note:: At the moment DMA_ATTR_ALLOC_SINGLE_PAGES is only implemented on ARM, + though ARM64 patches will likely be posted soon. DMA_ATTR_NO_WARN ---------------- @@ -142,10 +145,10 @@ problem at all, depending on the implementation of the retry mechanism. So, this provides a way for drivers to avoid those error messages on calls where allocation failures are not a problem, and shouldn't bother the logs. -NOTE: At the moment DMA_ATTR_NO_WARN is only implemented on PowerPC. +.. note:: At the moment DMA_ATTR_NO_WARN is only implemented on PowerPC. DMA_ATTR_PRIVILEGED ------------------------------- +------------------- Some advanced peripherals such as remote processors and GPUs perform accesses to DMA buffers in both privileged "supervisor" and unprivileged diff --git a/Documentation/DocBook/.gitignore b/Documentation/DocBook/.gitignore deleted file mode 100644 index e05da3f7aa21..000000000000 --- a/Documentation/DocBook/.gitignore +++ /dev/null @@ -1,17 +0,0 @@ -*.xml -*.ps -*.pdf -*.html -*.9.gz -*.9 -*.aux -*.dvi -*.log -*.out -*.png -*.gif -*.svg -*.proc -*.db -media-indices.tmpl -media-entities.tmpl diff --git a/Documentation/DocBook/Makefile b/Documentation/DocBook/Makefile deleted file mode 100644 index 85916f13d330..000000000000 --- a/Documentation/DocBook/Makefile +++ /dev/null @@ -1,282 +0,0 @@ -### -# This makefile is used to generate the kernel documentation, -# primarily based on in-line comments in various source files. -# See Documentation/kernel-doc-nano-HOWTO.txt for instruction in how -# to document the SRC - and how to read it. -# To add a new book the only step required is to add the book to the -# list of DOCBOOKS. - -DOCBOOKS := z8530book.xml \ - kernel-hacking.xml kernel-locking.xml \ - networking.xml \ - filesystems.xml lsm.xml kgdb.xml \ - libata.xml mtdnand.xml librs.xml rapidio.xml \ - s390-drivers.xml scsi.xml \ - sh.xml w1.xml - -ifeq ($(DOCBOOKS),) - -# Skip DocBook build if the user explicitly requested no DOCBOOKS. -.DEFAULT: - @echo " SKIP DocBook $@ target (DOCBOOKS=\"\" specified)." -else -ifneq ($(SPHINXDIRS),) - -# Skip DocBook build if the user explicitly requested a sphinx dir -.DEFAULT: - @echo " SKIP DocBook $@ target (SPHINXDIRS specified)." -else - - -### -# The build process is as follows (targets): -# (xmldocs) [by docproc] -# file.tmpl --> file.xml +--> file.ps (psdocs) [by db2ps or xmlto] -# +--> file.pdf (pdfdocs) [by db2pdf or xmlto] -# +--> DIR=file (htmldocs) [by xmlto] -# +--> man/ (mandocs) [by xmlto] - - -# for PDF and PS output you can choose between xmlto and docbook-utils tools -PDF_METHOD = $(prefer-db2x) -PS_METHOD = $(prefer-db2x) - - -targets += $(DOCBOOKS) -BOOKS := $(addprefix $(obj)/,$(DOCBOOKS)) -xmldocs: $(BOOKS) -sgmldocs: xmldocs - -PS := $(patsubst %.xml, %.ps, $(BOOKS)) -psdocs: $(PS) - -PDF := $(patsubst %.xml, %.pdf, $(BOOKS)) -pdfdocs: $(PDF) - -HTML := $(sort $(patsubst %.xml, %.html, $(BOOKS))) -htmldocs: $(HTML) - $(call cmd,build_main_index) - -MAN := $(patsubst %.xml, %.9, $(BOOKS)) -mandocs: $(MAN) - find $(obj)/man -name '*.9' | xargs gzip -nf - -# Default location for installed man pages -export INSTALL_MAN_PATH = $(objtree)/usr - -installmandocs: mandocs - mkdir -p $(INSTALL_MAN_PATH)/man/man9/ - find $(obj)/man -name '*.9.gz' -printf '%h %f\n' | \ - sort -k 2 -k 1 | uniq -f 1 | sed -e 's: :/:' | \ - xargs install -m 644 -t $(INSTALL_MAN_PATH)/man/man9/ - -# no-op for the DocBook toolchain -epubdocs: -latexdocs: -linkcheckdocs: - -### -#External programs used -KERNELDOCXMLREF = $(srctree)/scripts/kernel-doc-xml-ref -KERNELDOC = $(srctree)/scripts/kernel-doc -DOCPROC = $(objtree)/scripts/docproc -CHECK_LC_CTYPE = $(objtree)/scripts/check-lc_ctype - -# Use a fixed encoding - UTF-8 if the C library has support built-in -# or ASCII if not -LC_CTYPE := $(call try-run, LC_CTYPE=C.UTF-8 $(CHECK_LC_CTYPE),C.UTF-8,C) -export LC_CTYPE - -XMLTOFLAGS = -m $(srctree)/$(src)/stylesheet.xsl -XMLTOFLAGS += --skip-validation - -### -# DOCPROC is used for two purposes: -# 1) To generate a dependency list for a .tmpl file -# 2) To preprocess a .tmpl file and call kernel-doc with -# appropriate parameters. -# The following rules are used to generate the .xml documentation -# required to generate the final targets. (ps, pdf, html). -quiet_cmd_docproc = DOCPROC $@ - cmd_docproc = SRCTREE=$(srctree)/ $(DOCPROC) doc $< >$@ -define rule_docproc - set -e; \ - $(if $($(quiet)cmd_$(1)),echo ' $($(quiet)cmd_$(1))';) \ - $(cmd_$(1)); \ - ( \ - echo 'cmd_$@ := $(cmd_$(1))'; \ - echo $@: `SRCTREE=$(srctree) $(DOCPROC) depend $<`; \ - ) > $(dir $@).$(notdir $@).cmd -endef - -%.xml: %.tmpl $(KERNELDOC) $(DOCPROC) $(KERNELDOCXMLREF) FORCE - $(call if_changed_rule,docproc) - -# Tell kbuild to always build the programs -always := $(hostprogs-y) - -notfoundtemplate = echo "*** You have to install docbook-utils or xmlto ***"; \ - exit 1 -db2xtemplate = db2TYPE -o $(dir $@) $< -xmltotemplate = xmlto TYPE $(XMLTOFLAGS) -o $(dir $@) $< - -# determine which methods are available -ifeq ($(shell which db2ps >/dev/null 2>&1 && echo found),found) - use-db2x = db2x - prefer-db2x = db2x -else - use-db2x = notfound - prefer-db2x = $(use-xmlto) -endif -ifeq ($(shell which xmlto >/dev/null 2>&1 && echo found),found) - use-xmlto = xmlto - prefer-xmlto = xmlto -else - use-xmlto = notfound - prefer-xmlto = $(use-db2x) -endif - -# the commands, generated from the chosen template -quiet_cmd_db2ps = PS $@ - cmd_db2ps = $(subst TYPE,ps, $($(PS_METHOD)template)) -%.ps : %.xml - $(call cmd,db2ps) - -quiet_cmd_db2pdf = PDF $@ - cmd_db2pdf = $(subst TYPE,pdf, $($(PDF_METHOD)template)) -%.pdf : %.xml - $(call cmd,db2pdf) - - -index = index.html -main_idx = $(obj)/$(index) -quiet_cmd_build_main_index = HTML $(main_idx) - cmd_build_main_index = rm -rf $(main_idx); \ - echo '<h1>Linux Kernel HTML Documentation</h1>' >> $(main_idx) && \ - echo '<h2>Kernel Version: $(KERNELVERSION)</h2>' >> $(main_idx) && \ - cat $(HTML) >> $(main_idx) - -quiet_cmd_db2html = HTML $@ - cmd_db2html = xmlto html $(XMLTOFLAGS) -o $(patsubst %.html,%,$@) $< && \ - echo '<a HREF="$(patsubst %.html,%,$(notdir $@))/index.html"> \ - $(patsubst %.html,%,$(notdir $@))</a><p>' > $@ - -### -# Rules to create an aux XML and .db, and use them to re-process the DocBook XML -# to fill internal hyperlinks - gen_aux_xml = : - quiet_gen_aux_xml = echo ' XMLREF $@' -silent_gen_aux_xml = : -%.aux.xml: %.xml - @$($(quiet)gen_aux_xml) - @rm -rf $@ - @(cat $< | egrep "^<refentry id" | egrep -o "\".*\"" | cut -f 2 -d \" > $<.db) - @$(KERNELDOCXMLREF) -db $<.db $< > $@ -.PRECIOUS: %.aux.xml - -%.html: %.aux.xml - @(which xmlto > /dev/null 2>&1) || \ - (echo "*** You need to install xmlto ***"; \ - exit 1) - @rm -rf $@ $(patsubst %.html,%,$@) - $(call cmd,db2html) - @if [ ! -z "$(PNG-$(basename $(notdir $@)))" ]; then \ - cp $(PNG-$(basename $(notdir $@))) $(patsubst %.html,%,$@); fi - -quiet_cmd_db2man = MAN $@ - cmd_db2man = if grep -q refentry $<; then xmlto man $(XMLTOFLAGS) -o $(obj)/man/$(*F) $< ; fi -%.9 : %.xml - @(which xmlto > /dev/null 2>&1) || \ - (echo "*** You need to install xmlto ***"; \ - exit 1) - $(Q)mkdir -p $(obj)/man/$(*F) - $(call cmd,db2man) - @touch $@ - -### -# Rules to generate postscripts and PNG images from .fig format files -quiet_cmd_fig2eps = FIG2EPS $@ - cmd_fig2eps = fig2dev -Leps $< $@ - -%.eps: %.fig - @(which fig2dev > /dev/null 2>&1) || \ - (echo "*** You need to install transfig ***"; \ - exit 1) - $(call cmd,fig2eps) - -quiet_cmd_fig2png = FIG2PNG $@ - cmd_fig2png = fig2dev -Lpng $< $@ - -%.png: %.fig - @(which fig2dev > /dev/null 2>&1) || \ - (echo "*** You need to install transfig ***"; \ - exit 1) - $(call cmd,fig2png) - -### -# Rule to convert a .c file to inline XML documentation - gen_xml = : - quiet_gen_xml = echo ' GEN $@' -silent_gen_xml = : -%.xml: %.c - @$($(quiet)gen_xml) - @( \ - echo "<programlisting>"; \ - expand --tabs=8 < $< | \ - sed -e "s/&/\\&/g" \ - -e "s/</\\</g" \ - -e "s/>/\\>/g"; \ - echo "</programlisting>") > $@ - -endif # DOCBOOKS="" -endif # SPHINDIR=... - -### -# Help targets as used by the top-level makefile -dochelp: - @echo ' Linux kernel internal documentation in different formats (DocBook):' - @echo ' htmldocs - HTML' - @echo ' pdfdocs - PDF' - @echo ' psdocs - Postscript' - @echo ' xmldocs - XML DocBook' - @echo ' mandocs - man pages' - @echo ' installmandocs - install man pages generated by mandocs to INSTALL_MAN_PATH'; \ - echo ' (default: $(INSTALL_MAN_PATH))'; \ - echo '' - @echo ' cleandocs - clean all generated DocBook files' - @echo - @echo ' make DOCBOOKS="s1.xml s2.xml" [target] Generate only docs s1.xml s2.xml' - @echo ' valid values for DOCBOOKS are: $(DOCBOOKS)' - @echo - @echo " make DOCBOOKS=\"\" [target] Don't generate docs from Docbook" - @echo ' This is useful to generate only the ReST docs (Sphinx)' - - -### -# Temporary files left by various tools -clean-files := $(DOCBOOKS) \ - $(patsubst %.xml, %.dvi, $(DOCBOOKS)) \ - $(patsubst %.xml, %.aux, $(DOCBOOKS)) \ - $(patsubst %.xml, %.tex, $(DOCBOOKS)) \ - $(patsubst %.xml, %.log, $(DOCBOOKS)) \ - $(patsubst %.xml, %.out, $(DOCBOOKS)) \ - $(patsubst %.xml, %.ps, $(DOCBOOKS)) \ - $(patsubst %.xml, %.pdf, $(DOCBOOKS)) \ - $(patsubst %.xml, %.html, $(DOCBOOKS)) \ - $(patsubst %.xml, %.9, $(DOCBOOKS)) \ - $(patsubst %.xml, %.aux.xml, $(DOCBOOKS)) \ - $(patsubst %.xml, %.xml.db, $(DOCBOOKS)) \ - $(patsubst %.xml, %.xml, $(DOCBOOKS)) \ - $(patsubst %.xml, .%.xml.cmd, $(DOCBOOKS)) \ - $(index) - -clean-dirs := $(patsubst %.xml,%,$(DOCBOOKS)) man - -cleandocs: - $(Q)rm -f $(call objectify, $(clean-files)) - $(Q)rm -rf $(call objectify, $(clean-dirs)) - -# Declare the contents of the .PHONY variable as phony. We keep that -# information in a variable so we can use it in if_changed and friends. - -.PHONY: $(PHONY) diff --git a/Documentation/DocBook/filesystems.tmpl b/Documentation/DocBook/filesystems.tmpl deleted file mode 100644 index 6006b6358c86..000000000000 --- a/Documentation/DocBook/filesystems.tmpl +++ /dev/null @@ -1,381 +0,0 @@ -<?xml version="1.0" encoding="UTF-8"?> -<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" - "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []> - -<book id="Linux-filesystems-API"> - <bookinfo> - <title>Linux Filesystems API</title> - - <legalnotice> - <para> - This documentation is free software; you can redistribute - it and/or modify it under the terms of the GNU General Public - License as published by the Free Software Foundation; either - version 2 of the License, or (at your option) any later - version. - </para> - - <para> - This program is distributed in the hope that it will be - useful, but WITHOUT ANY WARRANTY; without even the implied - warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. - See the GNU General Public License for more details. - </para> - - <para> - You should have received a copy of the GNU General Public - License along with this program; if not, write to the Free - Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, - MA 02111-1307 USA - </para> - - <para> - For more details see the file COPYING in the source - distribution of Linux. - </para> - </legalnotice> - </bookinfo> - -<toc></toc> - - <chapter id="vfs"> - <title>The Linux VFS</title> - <sect1 id="the_filesystem_types"><title>The Filesystem types</title> -!Iinclude/linux/fs.h - </sect1> - <sect1 id="the_directory_cache"><title>The Directory Cache</title> -!Efs/dcache.c -!Iinclude/linux/dcache.h - </sect1> - <sect1 id="inode_handling"><title>Inode Handling</title> -!Efs/inode.c -!Efs/bad_inode.c - </sect1> - <sect1 id="registration_and_superblocks"><title>Registration and Superblocks</title> -!Efs/super.c - </sect1> - <sect1 id="file_locks"><title>File Locks</title> -!Efs/locks.c -!Ifs/locks.c - </sect1> - <sect1 id="other_functions"><title>Other Functions</title> -!Efs/mpage.c -!Efs/namei.c -!Efs/buffer.c -!Eblock/bio.c -!Efs/seq_file.c -!Efs/filesystems.c -!Efs/fs-writeback.c -!Efs/block_dev.c - </sect1> - </chapter> - - <chapter id="proc"> - <title>The proc filesystem</title> - - <sect1 id="sysctl_interface"><title>sysctl interface</title> -!Ekernel/sysctl.c - </sect1> - - <sect1 id="proc_filesystem_interface"><title>proc filesystem interface</title> -!Ifs/proc/base.c - </sect1> - </chapter> - - <chapter id="fs_events"> - <title>Events based on file descriptors</title> -!Efs/eventfd.c - </chapter> - - <chapter id="sysfs"> - <title>The Filesystem for Exporting Kernel Objects</title> -!Efs/sysfs/file.c -!Efs/sysfs/symlink.c - </chapter> - - <chapter id="debugfs"> - <title>The debugfs filesystem</title> - - <sect1 id="debugfs_interface"><title>debugfs interface</title> -!Efs/debugfs/inode.c -!Efs/debugfs/file.c - </sect1> - </chapter> - - <chapter id="LinuxJDBAPI"> - <chapterinfo> - <title>The Linux Journalling API</title> - - <authorgroup> - <author> - <firstname>Roger</firstname> - <surname>Gammans</surname> - <affiliation> - <address> - <email>rgammans@computer-surgery.co.uk</email> - </address> - </affiliation> - </author> - </authorgroup> - - <authorgroup> - <author> - <firstname>Stephen</firstname> - <surname>Tweedie</surname> - <affiliation> - <address> - <email>sct@redhat.com</email> - </address> - </affiliation> - </author> - </authorgroup> - - <copyright> - <year>2002</year> - <holder>Roger Gammans</holder> - </copyright> - </chapterinfo> - - <title>The Linux Journalling API</title> - - <sect1 id="journaling_overview"> - <title>Overview</title> - <sect2 id="journaling_details"> - <title>Details</title> -<para> -The journalling layer is easy to use. You need to -first of all create a journal_t data structure. There are -two calls to do this dependent on how you decide to allocate the physical -media on which the journal resides. The jbd2_journal_init_inode() call -is for journals stored in filesystem inodes, or the jbd2_journal_init_dev() -call can be used for journal stored on a raw device (in a continuous range -of blocks). A journal_t is a typedef for a struct pointer, so when -you are finally finished make sure you call jbd2_journal_destroy() on it -to free up any used kernel memory. -</para> - -<para> -Once you have got your journal_t object you need to 'mount' or load the journal -file. The journalling layer expects the space for the journal was already -allocated and initialized properly by the userspace tools. When loading the -journal you must call jbd2_journal_load() to process journal contents. If the -client file system detects the journal contents does not need to be processed -(or even need not have valid contents), it may call jbd2_journal_wipe() to -clear the journal contents before calling jbd2_journal_load(). -</para> - -<para> -Note that jbd2_journal_wipe(..,0) calls jbd2_journal_skip_recovery() for you if -it detects any outstanding transactions in the journal and similarly -jbd2_journal_load() will call jbd2_journal_recover() if necessary. I would -advise reading ext4_load_journal() in fs/ext4/super.c for examples on this -stage. -</para> - -<para> -Now you can go ahead and start modifying the underlying -filesystem. Almost. -</para> - -<para> - -You still need to actually journal your filesystem changes, this -is done by wrapping them into transactions. Additionally you -also need to wrap the modification of each of the buffers -with calls to the journal layer, so it knows what the modifications -you are actually making are. To do this use jbd2_journal_start() which -returns a transaction handle. -</para> - -<para> -jbd2_journal_start() -and its counterpart jbd2_journal_stop(), which indicates the end of a -transaction are nestable calls, so you can reenter a transaction if necessary, -but remember you must call jbd2_journal_stop() the same number of times as -jbd2_journal_start() before the transaction is completed (or more accurately -leaves the update phase). Ext4/VFS makes use of this feature to simplify -handling of inode dirtying, quota support, etc. -</para> - -<para> -Inside each transaction you need to wrap the modifications to the -individual buffers (blocks). Before you start to modify a buffer you -need to call jbd2_journal_get_{create,write,undo}_access() as appropriate, -this allows the journalling layer to copy the unmodified data if it -needs to. After all the buffer may be part of a previously uncommitted -transaction. -At this point you are at last ready to modify a buffer, and once -you are have done so you need to call jbd2_journal_dirty_{meta,}data(). -Or if you've asked for access to a buffer you now know is now longer -required to be pushed back on the device you can call jbd2_journal_forget() -in much the same way as you might have used bforget() in the past. -</para> - -<para> -A jbd2_journal_flush() may be called at any time to commit and checkpoint -all your transactions. -</para> - -<para> -Then at umount time , in your put_super() you can then call jbd2_journal_destroy() -to clean up your in-core journal object. -</para> - -<para> -Unfortunately there a couple of ways the journal layer can cause a deadlock. -The first thing to note is that each task can only have -a single outstanding transaction at any one time, remember nothing -commits until the outermost jbd2_journal_stop(). This means -you must complete the transaction at the end of each file/inode/address -etc. operation you perform, so that the journalling system isn't re-entered -on another journal. Since transactions can't be nested/batched -across differing journals, and another filesystem other than -yours (say ext4) may be modified in a later syscall. -</para> - -<para> -The second case to bear in mind is that jbd2_journal_start() can -block if there isn't enough space in the journal for your transaction -(based on the passed nblocks param) - when it blocks it merely(!) needs to -wait for transactions to complete and be committed from other tasks, -so essentially we are waiting for jbd2_journal_stop(). So to avoid -deadlocks you must treat jbd2_journal_start/stop() as if they -were semaphores and include them in your semaphore ordering rules to prevent -deadlocks. Note that jbd2_journal_extend() has similar blocking behaviour to -jbd2_journal_start() so you can deadlock here just as easily as on -jbd2_journal_start(). -</para> - -<para> -Try to reserve the right number of blocks the first time. ;-). This will -be the maximum number of blocks you are going to touch in this transaction. -I advise having a look at at least ext4_jbd.h to see the basis on which -ext4 uses to make these decisions. -</para> - -<para> -Another wriggle to watch out for is your on-disk block allocation strategy. -Why? Because, if you do a delete, you need to ensure you haven't reused any -of the freed blocks until the transaction freeing these blocks commits. If you -reused these blocks and crash happens, there is no way to restore the contents -of the reallocated blocks at the end of the last fully committed transaction. - -One simple way of doing this is to mark blocks as free in internal in-memory -block allocation structures only after the transaction freeing them commits. -Ext4 uses journal commit callback for this purpose. -</para> - -<para> -With journal commit callbacks you can ask the journalling layer to call a -callback function when the transaction is finally committed to disk, so that -you can do some of your own management. You ask the journalling layer for -calling the callback by simply setting journal->j_commit_callback function -pointer and that function is called after each transaction commit. You can also -use transaction->t_private_list for attaching entries to a transaction that -need processing when the transaction commits. -</para> - -<para> -JBD2 also provides a way to block all transaction updates via -jbd2_journal_{un,}lock_updates(). Ext4 uses this when it wants a window with a -clean and stable fs for a moment. E.g. -</para> - -<programlisting> - - jbd2_journal_lock_updates() //stop new stuff happening.. - jbd2_journal_flush() // checkpoint everything. - ..do stuff on stable fs - jbd2_journal_unlock_updates() // carry on with filesystem use. -</programlisting> - -<para> -The opportunities for abuse and DOS attacks with this should be obvious, -if you allow unprivileged userspace to trigger codepaths containing these -calls. -</para> - - </sect2> - - <sect2 id="jbd_summary"> - <title>Summary</title> -<para> -Using the journal is a matter of wrapping the different context changes, -being each mount, each modification (transaction) and each changed buffer -to tell the journalling layer about them. -</para> - - </sect2> - - </sect1> - - <sect1 id="data_types"> - <title>Data Types</title> - <para> - The journalling layer uses typedefs to 'hide' the concrete definitions - of the structures used. As a client of the JBD2 layer you can - just rely on the using the pointer as a magic cookie of some sort. - - Obviously the hiding is not enforced as this is 'C'. - </para> - <sect2 id="structures"><title>Structures</title> -!Iinclude/linux/jbd2.h - </sect2> - </sect1> - - <sect1 id="functions"> - <title>Functions</title> - <para> - The functions here are split into two groups those that - affect a journal as a whole, and those which are used to - manage transactions - </para> - <sect2 id="journal_level"><title>Journal Level</title> -!Efs/jbd2/journal.c -!Ifs/jbd2/recovery.c - </sect2> - <sect2 id="transaction_level"><title>Transasction Level</title> -!Efs/jbd2/transaction.c - </sect2> - </sect1> - <sect1 id="see_also"> - <title>See also</title> - <para> - <citation> - <ulink url="http://kernel.org/pub/linux/kernel/people/sct/ext3/journal-design.ps.gz"> - Journaling the Linux ext2fs Filesystem, LinuxExpo 98, Stephen Tweedie - </ulink> - </citation> - </para> - <para> - <citation> - <ulink url="http://olstrans.sourceforge.net/release/OLS2000-ext3/OLS2000-ext3.html"> - Ext3 Journalling FileSystem, OLS 2000, Dr. Stephen Tweedie - </ulink> - </citation> - </para> - </sect1> - - </chapter> - - <chapter id="splice"> - <title>splice API</title> - <para> - splice is a method for moving blocks of data around inside the - kernel, without continually transferring them between the kernel - and user space. - </para> -!Ffs/splice.c - </chapter> - - <chapter id="pipes"> - <title>pipes API</title> - <para> - Pipe interfaces are all for in-kernel (builtin image) use. - They are not exported for use by modules. - </para> -!Iinclude/linux/pipe_fs_i.h -!Ffs/pipe.c - </chapter> - -</book> diff --git a/Documentation/DocBook/kernel-hacking.tmpl b/Documentation/DocBook/kernel-hacking.tmpl deleted file mode 100644 index da5c087462b1..000000000000 --- a/Documentation/DocBook/kernel-hacking.tmpl +++ /dev/null @@ -1,1312 +0,0 @@ -<?xml version="1.0" encoding="UTF-8"?> -<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" - "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []> - -<book id="lk-hacking-guide"> - <bookinfo> - <title>Unreliable Guide To Hacking The Linux Kernel</title> - - <authorgroup> - <author> - <firstname>Rusty</firstname> - <surname>Russell</surname> - <affiliation> - <address> - <email>rusty@rustcorp.com.au</email> - </address> - </affiliation> - </author> - </authorgroup> - - <copyright> - <year>2005</year> - <holder>Rusty Russell</holder> - </copyright> - - <legalnotice> - <para> - This documentation is free software; you can redistribute - it and/or modify it under the terms of the GNU General Public - License as published by the Free Software Foundation; either - version 2 of the License, or (at your option) any later - version. - </para> - - <para> - This program is distributed in the hope that it will be - useful, but WITHOUT ANY WARRANTY; without even the implied - warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. - See the GNU General Public License for more details. - </para> - - <para> - You should have received a copy of the GNU General Public - License along with this program; if not, write to the Free - Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, - MA 02111-1307 USA - </para> - - <para> - For more details see the file COPYING in the source - distribution of Linux. - </para> - </legalnotice> - - <releaseinfo> - This is the first release of this document as part of the kernel tarball. - </releaseinfo> - - </bookinfo> - - <toc></toc> - - <chapter id="introduction"> - <title>Introduction</title> - <para> - Welcome, gentle reader, to Rusty's Remarkably Unreliable Guide to Linux - Kernel Hacking. This document describes the common routines and - general requirements for kernel code: its goal is to serve as a - primer for Linux kernel development for experienced C - programmers. I avoid implementation details: that's what the - code is for, and I ignore whole tracts of useful routines. - </para> - <para> - Before you read this, please understand that I never wanted to - write this document, being grossly under-qualified, but I always - wanted to read it, and this was the only way. I hope it will - grow into a compendium of best practice, common starting points - and random information. - </para> - </chapter> - - <chapter id="basic-players"> - <title>The Players</title> - - <para> - At any time each of the CPUs in a system can be: - </para> - - <itemizedlist> - <listitem> - <para> - not associated with any process, serving a hardware interrupt; - </para> - </listitem> - - <listitem> - <para> - not associated with any process, serving a softirq or tasklet; - </para> - </listitem> - - <listitem> - <para> - running in kernel space, associated with a process (user context); - </para> - </listitem> - - <listitem> - <para> - running a process in user space. - </para> - </listitem> - </itemizedlist> - - <para> - There is an ordering between these. The bottom two can preempt - each other, but above that is a strict hierarchy: each can only be - preempted by the ones above it. For example, while a softirq is - running on a CPU, no other softirq will preempt it, but a hardware - interrupt can. However, any other CPUs in the system execute - independently. - </para> - - <para> - We'll see a number of ways that the user context can block - interrupts, to become truly non-preemptable. - </para> - - <sect1 id="basics-usercontext"> - <title>User Context</title> - - <para> - User context is when you are coming in from a system call or other - trap: like userspace, you can be preempted by more important tasks - and by interrupts. You can sleep, by calling - <function>schedule()</function>. - </para> - - <note> - <para> - You are always in user context on module load and unload, - and on operations on the block device layer. - </para> - </note> - - <para> - In user context, the <varname>current</varname> pointer (indicating - the task we are currently executing) is valid, and - <function>in_interrupt()</function> - (<filename>include/linux/interrupt.h</filename>) is <returnvalue>false - </returnvalue>. - </para> - - <caution> - <para> - Beware that if you have preemption or softirqs disabled - (see below), <function>in_interrupt()</function> will return a - false positive. - </para> - </caution> - </sect1> - - <sect1 id="basics-hardirqs"> - <title>Hardware Interrupts (Hard IRQs)</title> - - <para> - Timer ticks, <hardware>network cards</hardware> and - <hardware>keyboard</hardware> are examples of real - hardware which produce interrupts at any time. The kernel runs - interrupt handlers, which services the hardware. The kernel - guarantees that this handler is never re-entered: if the same - interrupt arrives, it is queued (or dropped). Because it - disables interrupts, this handler has to be fast: frequently it - simply acknowledges the interrupt, marks a 'software interrupt' - for execution and exits. - </para> - - <para> - You can tell you are in a hardware interrupt, because - <function>in_irq()</function> returns <returnvalue>true</returnvalue>. - </para> - <caution> - <para> - Beware that this will return a false positive if interrupts are disabled - (see below). - </para> - </caution> - </sect1> - - <sect1 id="basics-softirqs"> - <title>Software Interrupt Context: Softirqs and Tasklets</title> - - <para> - Whenever a system call is about to return to userspace, or a - hardware interrupt handler exits, any 'software interrupts' - which are marked pending (usually by hardware interrupts) are - run (<filename>kernel/softirq.c</filename>). - </para> - - <para> - Much of the real interrupt handling work is done here. Early in - the transition to <acronym>SMP</acronym>, there were only 'bottom - halves' (BHs), which didn't take advantage of multiple CPUs. Shortly - after we switched from wind-up computers made of match-sticks and snot, - we abandoned this limitation and switched to 'softirqs'. - </para> - - <para> - <filename class="headerfile">include/linux/interrupt.h</filename> lists the - different softirqs. A very important softirq is the - timer softirq (<filename - class="headerfile">include/linux/timer.h</filename>): you can - register to have it call functions for you in a given length of - time. - </para> - - <para> - Softirqs are often a pain to deal with, since the same softirq - will run simultaneously on more than one CPU. For this reason, - tasklets (<filename - class="headerfile">include/linux/interrupt.h</filename>) are more - often used: they are dynamically-registrable (meaning you can have - as many as you want), and they also guarantee that any tasklet - will only run on one CPU at any time, although different tasklets - can run simultaneously. - </para> - <caution> - <para> - The name 'tasklet' is misleading: they have nothing to do with 'tasks', - and probably more to do with some bad vodka Alexey Kuznetsov had at the - time. - </para> - </caution> - - <para> - You can tell you are in a softirq (or tasklet) - using the <function>in_softirq()</function> macro - (<filename class="headerfile">include/linux/interrupt.h</filename>). - </para> - <caution> - <para> - Beware that this will return a false positive if a bh lock (see below) - is held. - </para> - </caution> - </sect1> - </chapter> - - <chapter id="basic-rules"> - <title>Some Basic Rules</title> - - <variablelist> - <varlistentry> - <term>No memory protection</term> - <listitem> - <para> - If you corrupt memory, whether in user context or - interrupt context, the whole machine will crash. Are you - sure you can't do what you want in userspace? - </para> - </listitem> - </varlistentry> - - <varlistentry> - <term>No floating point or <acronym>MMX</acronym></term> - <listitem> - <para> - The <acronym>FPU</acronym> context is not saved; even in user - context the <acronym>FPU</acronym> state probably won't - correspond with the current process: you would mess with some - user process' <acronym>FPU</acronym> state. If you really want - to do this, you would have to explicitly save/restore the full - <acronym>FPU</acronym> state (and avoid context switches). It - is generally a bad idea; use fixed point arithmetic first. - </para> - </listitem> - </varlistentry> - - <varlistentry> - <term>A rigid stack limit</term> - <listitem> - <para> - Depending on configuration options the kernel stack is about 3K to 6K for most 32-bit architectures: it's - about 14K on most 64-bit archs, and often shared with interrupts - so you can't use it all. Avoid deep recursion and huge local - arrays on the stack (allocate them dynamically instead). - </para> - </listitem> - </varlistentry> - - <varlistentry> - <term>The Linux kernel is portable</term> - <listitem> - <para> - Let's keep it that way. Your code should be 64-bit clean, - and endian-independent. You should also minimize CPU - specific stuff, e.g. inline assembly should be cleanly - encapsulated and minimized to ease porting. Generally it - should be restricted to the architecture-dependent part of - the kernel tree. - </para> - </listitem> - </varlistentry> - </variablelist> - </chapter> - - <chapter id="ioctls"> - <title>ioctls: Not writing a new system call</title> - - <para> - A system call generally looks like this - </para> - - <programlisting> -asmlinkage long sys_mycall(int arg) -{ - return 0; -} - </programlisting> - - <para> - First, in most cases you don't want to create a new system call. - You create a character device and implement an appropriate ioctl - for it. This is much more flexible than system calls, doesn't have - to be entered in every architecture's - <filename class="headerfile">include/asm/unistd.h</filename> and - <filename>arch/kernel/entry.S</filename> file, and is much more - likely to be accepted by Linus. - </para> - - <para> - If all your routine does is read or write some parameter, consider - implementing a <function>sysfs</function> interface instead. - </para> - - <para> - Inside the ioctl you're in user context to a process. When a - error occurs you return a negated errno (see - <filename class="headerfile">include/linux/errno.h</filename>), - otherwise you return <returnvalue>0</returnvalue>. - </para> - - <para> - After you slept you should check if a signal occurred: the - Unix/Linux way of handling signals is to temporarily exit the - system call with the <constant>-ERESTARTSYS</constant> error. The - system call entry code will switch back to user context, process - the signal handler and then your system call will be restarted - (unless the user disabled that). So you should be prepared to - process the restart, e.g. if you're in the middle of manipulating - some data structure. - </para> - - <programlisting> -if (signal_pending(current)) - return -ERESTARTSYS; - </programlisting> - - <para> - If you're doing longer computations: first think userspace. If you - <emphasis>really</emphasis> want to do it in kernel you should - regularly check if you need to give up the CPU (remember there is - cooperative multitasking per CPU). Idiom: - </para> - - <programlisting> -cond_resched(); /* Will sleep */ - </programlisting> - - <para> - A short note on interface design: the UNIX system call motto is - "Provide mechanism not policy". - </para> - </chapter> - - <chapter id="deadlock-recipes"> - <title>Recipes for Deadlock</title> - - <para> - You cannot call any routines which may sleep, unless: - </para> - <itemizedlist> - <listitem> - <para> - You are in user context. - </para> - </listitem> - - <listitem> - <para> - You do not own any spinlocks. - </para> - </listitem> - - <listitem> - <para> - You have interrupts enabled (actually, Andi Kleen says - that the scheduling code will enable them for you, but - that's probably not what you wanted). - </para> - </listitem> - </itemizedlist> - - <para> - Note that some functions may sleep implicitly: common ones are - the user space access functions (*_user) and memory allocation - functions without <symbol>GFP_ATOMIC</symbol>. - </para> - - <para> - You should always compile your kernel - <symbol>CONFIG_DEBUG_ATOMIC_SLEEP</symbol> on, and it will warn - you if you break these rules. If you <emphasis>do</emphasis> break - the rules, you will eventually lock up your box. - </para> - - <para> - Really. - </para> - </chapter> - - <chapter id="common-routines"> - <title>Common Routines</title> - - <sect1 id="routines-printk"> - <title> - <function>printk()</function> - <filename class="headerfile">include/linux/kernel.h</filename> - </title> - - <para> - <function>printk()</function> feeds kernel messages to the - console, dmesg, and the syslog daemon. It is useful for debugging - and reporting errors, and can be used inside interrupt context, - but use with caution: a machine which has its console flooded with - printk messages is unusable. It uses a format string mostly - compatible with ANSI C printf, and C string concatenation to give - it a first "priority" argument: - </para> - - <programlisting> -printk(KERN_INFO "i = %u\n", i); - </programlisting> - - <para> - See <filename class="headerfile">include/linux/kernel.h</filename>; - for other KERN_ values; these are interpreted by syslog as the - level. Special case: for printing an IP address use - </para> - - <programlisting> -__be32 ipaddress; -printk(KERN_INFO "my ip: %pI4\n", &ipaddress); - </programlisting> - - <para> - <function>printk()</function> internally uses a 1K buffer and does - not catch overruns. Make sure that will be enough. - </para> - - <note> - <para> - You will know when you are a real kernel hacker - when you start typoing printf as printk in your user programs :) - </para> - </note> - - <!--- From the Lions book reader department --> - - <note> - <para> - Another sidenote: the original Unix Version 6 sources had a - comment on top of its printf function: "Printf should not be - used for chit-chat". You should follow that advice. - </para> - </note> - </sect1> - - <sect1 id="routines-copy"> - <title> - <function>copy_[to/from]_user()</function> - / - <function>get_user()</function> - / - <function>put_user()</function> - <filename class="headerfile">include/linux/uaccess.h</filename> - </title> - - <para> - <emphasis>[SLEEPS]</emphasis> - </para> - - <para> - <function>put_user()</function> and <function>get_user()</function> - are used to get and put single values (such as an int, char, or - long) from and to userspace. A pointer into userspace should - never be simply dereferenced: data should be copied using these - routines. Both return <constant>-EFAULT</constant> or 0. - </para> - <para> - <function>copy_to_user()</function> and - <function>copy_from_user()</function> are more general: they copy - an arbitrary amount of data to and from userspace. - <caution> - <para> - Unlike <function>put_user()</function> and - <function>get_user()</function>, they return the amount of - uncopied data (ie. <returnvalue>0</returnvalue> still means - success). - </para> - </caution> - [Yes, this moronic interface makes me cringe. The flamewar comes up every year or so. --RR.] - </para> - <para> - The functions may sleep implicitly. This should never be called - outside user context (it makes no sense), with interrupts - disabled, or a spinlock held. - </para> - </sect1> - - <sect1 id="routines-kmalloc"> - <title><function>kmalloc()</function>/<function>kfree()</function> - <filename class="headerfile">include/linux/slab.h</filename></title> - - <para> - <emphasis>[MAY SLEEP: SEE BELOW]</emphasis> - </para> - - <para> - These routines are used to dynamically request pointer-aligned - chunks of memory, like malloc and free do in userspace, but - <function>kmalloc()</function> takes an extra flag word. - Important values: - </para> - - <variablelist> - <varlistentry> - <term> - <constant> - GFP_KERNEL - </constant> - </term> - <listitem> - <para> - May sleep and swap to free memory. Only allowed in user - context, but is the most reliable way to allocate memory. - </para> - </listitem> - </varlistentry> - - <varlistentry> - <term> - <constant> - GFP_ATOMIC - </constant> - </term> - <listitem> - <para> - Don't sleep. Less reliable than <constant>GFP_KERNEL</constant>, - but may be called from interrupt context. You should - <emphasis>really</emphasis> have a good out-of-memory - error-handling strategy. - </para> - </listitem> - </varlistentry> - - <varlistentry> - <term> - <constant> - GFP_DMA - </constant> - </term> - <listitem> - <para> - Allocate ISA DMA lower than 16MB. If you don't know what that - is you don't need it. Very unreliable. - </para> - </listitem> - </varlistentry> - </variablelist> - - <para> - If you see a <errorname>sleeping function called from invalid - context</errorname> warning message, then maybe you called a - sleeping allocation function from interrupt context without - <constant>GFP_ATOMIC</constant>. You should really fix that. - Run, don't walk. - </para> - - <para> - If you are allocating at least <constant>PAGE_SIZE</constant> - (<filename class="headerfile">include/asm/page.h</filename>) bytes, - consider using <function>__get_free_pages()</function> - - (<filename class="headerfile">include/linux/mm.h</filename>). It - takes an order argument (0 for page sized, 1 for double page, 2 - for four pages etc.) and the same memory priority flag word as - above. - </para> - - <para> - If you are allocating more than a page worth of bytes you can use - <function>vmalloc()</function>. It'll allocate virtual memory in - the kernel map. This block is not contiguous in physical memory, - but the <acronym>MMU</acronym> makes it look like it is for you - (so it'll only look contiguous to the CPUs, not to external device - drivers). If you really need large physically contiguous memory - for some weird device, you have a problem: it is poorly supported - in Linux because after some time memory fragmentation in a running - kernel makes it hard. The best way is to allocate the block early - in the boot process via the <function>alloc_bootmem()</function> - routine. - </para> - - <para> - Before inventing your own cache of often-used objects consider - using a slab cache in - <filename class="headerfile">include/linux/slab.h</filename> - </para> - </sect1> - - <sect1 id="routines-current"> - <title><function>current</function> - <filename class="headerfile">include/asm/current.h</filename></title> - - <para> - This global variable (really a macro) contains a pointer to - the current task structure, so is only valid in user context. - For example, when a process makes a system call, this will - point to the task structure of the calling process. It is - <emphasis>not NULL</emphasis> in interrupt context. - </para> - </sect1> - - <sect1 id="routines-udelay"> - <title><function>mdelay()</function>/<function>udelay()</function> - <filename class="headerfile">include/asm/delay.h</filename> - <filename class="headerfile">include/linux/delay.h</filename> - </title> - - <para> - The <function>udelay()</function> and <function>ndelay()</function> functions can be used for small pauses. - Do not use large values with them as you risk - overflow - the helper function <function>mdelay()</function> is useful - here, or consider <function>msleep()</function>. - </para> - </sect1> - - <sect1 id="routines-endian"> - <title><function>cpu_to_be32()</function>/<function>be32_to_cpu()</function>/<function>cpu_to_le32()</function>/<function>le32_to_cpu()</function> - <filename class="headerfile">include/asm/byteorder.h</filename> - </title> - - <para> - The <function>cpu_to_be32()</function> family (where the "32" can - be replaced by 64 or 16, and the "be" can be replaced by "le") are - the general way to do endian conversions in the kernel: they - return the converted value. All variations supply the reverse as - well: <function>be32_to_cpu()</function>, etc. - </para> - - <para> - There are two major variations of these functions: the pointer - variation, such as <function>cpu_to_be32p()</function>, which take - a pointer to the given type, and return the converted value. The - other variation is the "in-situ" family, such as - <function>cpu_to_be32s()</function>, which convert value referred - to by the pointer, and return void. - </para> - </sect1> - - <sect1 id="routines-local-irqs"> - <title><function>local_irq_save()</function>/<function>local_irq_restore()</function> - <filename class="headerfile">include/linux/irqflags.h</filename> - </title> - - <para> - These routines disable hard interrupts on the local CPU, and - restore them. They are reentrant; saving the previous state in - their one <varname>unsigned long flags</varname> argument. If you - know that interrupts are enabled, you can simply use - <function>local_irq_disable()</function> and - <function>local_irq_enable()</function>. - </para> - </sect1> - - <sect1 id="routines-softirqs"> - <title><function>local_bh_disable()</function>/<function>local_bh_enable()</function> - <filename class="headerfile">include/linux/interrupt.h</filename></title> - - <para> - These routines disable soft interrupts on the local CPU, and - restore them. They are reentrant; if soft interrupts were - disabled before, they will still be disabled after this pair - of functions has been called. They prevent softirqs and tasklets - from running on the current CPU. - </para> - </sect1> - - <sect1 id="routines-processorids"> - <title><function>smp_processor_id</function>() - <filename class="headerfile">include/asm/smp.h</filename></title> - - <para> - <function>get_cpu()</function> disables preemption (so you won't - suddenly get moved to another CPU) and returns the current - processor number, between 0 and <symbol>NR_CPUS</symbol>. Note - that the CPU numbers are not necessarily continuous. You return - it again with <function>put_cpu()</function> when you are done. - </para> - <para> - If you know you cannot be preempted by another task (ie. you are - in interrupt context, or have preemption disabled) you can use - smp_processor_id(). - </para> - </sect1> - - <sect1 id="routines-init"> - <title><type>__init</type>/<type>__exit</type>/<type>__initdata</type> - <filename class="headerfile">include/linux/init.h</filename></title> - - <para> - After boot, the kernel frees up a special section; functions - marked with <type>__init</type> and data structures marked with - <type>__initdata</type> are dropped after boot is complete: similarly - modules discard this memory after initialization. <type>__exit</type> - is used to declare a function which is only required on exit: the - function will be dropped if this file is not compiled as a module. - See the header file for use. Note that it makes no sense for a function - marked with <type>__init</type> to be exported to modules with - <function>EXPORT_SYMBOL()</function> - this will break. - </para> - - </sect1> - - <sect1 id="routines-init-again"> - <title><function>__initcall()</function>/<function>module_init()</function> - <filename class="headerfile">include/linux/init.h</filename></title> - <para> - Many parts of the kernel are well served as a module - (dynamically-loadable parts of the kernel). Using the - <function>module_init()</function> and - <function>module_exit()</function> macros it is easy to write code - without #ifdefs which can operate both as a module or built into - the kernel. - </para> - - <para> - The <function>module_init()</function> macro defines which - function is to be called at module insertion time (if the file is - compiled as a module), or at boot time: if the file is not - compiled as a module the <function>module_init()</function> macro - becomes equivalent to <function>__initcall()</function>, which - through linker magic ensures that the function is called on boot. - </para> - - <para> - The function can return a negative error number to cause - module loading to fail (unfortunately, this has no effect if - the module is compiled into the kernel). This function is - called in user context with interrupts enabled, so it can sleep. - </para> - </sect1> - - <sect1 id="routines-moduleexit"> - <title> <function>module_exit()</function> - <filename class="headerfile">include/linux/init.h</filename> </title> - - <para> - This macro defines the function to be called at module removal - time (or never, in the case of the file compiled into the - kernel). It will only be called if the module usage count has - reached zero. This function can also sleep, but cannot fail: - everything must be cleaned up by the time it returns. - </para> - - <para> - Note that this macro is optional: if it is not present, your - module will not be removable (except for 'rmmod -f'). - </para> - </sect1> - - <sect1 id="routines-module-use-counters"> - <title> <function>try_module_get()</function>/<function>module_put()</function> - <filename class="headerfile">include/linux/module.h</filename></title> - - <para> - These manipulate the module usage count, to protect against - removal (a module also can't be removed if another module uses one - of its exported symbols: see below). Before calling into module - code, you should call <function>try_module_get()</function> on - that module: if it fails, then the module is being removed and you - should act as if it wasn't there. Otherwise, you can safely enter - the module, and call <function>module_put()</function> when you're - finished. - </para> - - <para> - Most registerable structures have an - <structfield>owner</structfield> field, such as in the - <structname>file_operations</structname> structure. Set this field - to the macro <symbol>THIS_MODULE</symbol>. - </para> - </sect1> - - <!-- add info on new-style module refcounting here --> - </chapter> - - <chapter id="queues"> - <title>Wait Queues - <filename class="headerfile">include/linux/wait.h</filename> - </title> - <para> - <emphasis>[SLEEPS]</emphasis> - </para> - - <para> - A wait queue is used to wait for someone to wake you up when a - certain condition is true. They must be used carefully to ensure - there is no race condition. You declare a - <type>wait_queue_head_t</type>, and then processes which want to - wait for that condition declare a <type>wait_queue_t</type> - referring to themselves, and place that in the queue. - </para> - - <sect1 id="queue-declaring"> - <title>Declaring</title> - - <para> - You declare a <type>wait_queue_head_t</type> using the - <function>DECLARE_WAIT_QUEUE_HEAD()</function> macro, or using the - <function>init_waitqueue_head()</function> routine in your - initialization code. - </para> - </sect1> - - <sect1 id="queue-waitqueue"> - <title>Queuing</title> - - <para> - Placing yourself in the waitqueue is fairly complex, because you - must put yourself in the queue before checking the condition. - There is a macro to do this: - <function>wait_event_interruptible()</function> - - <filename class="headerfile">include/linux/wait.h</filename> The - first argument is the wait queue head, and the second is an - expression which is evaluated; the macro returns - <returnvalue>0</returnvalue> when this expression is true, or - <returnvalue>-ERESTARTSYS</returnvalue> if a signal is received. - The <function>wait_event()</function> version ignores signals. - </para> - - </sect1> - - <sect1 id="queue-waking"> - <title>Waking Up Queued Tasks</title> - - <para> - Call <function>wake_up()</function> - - <filename class="headerfile">include/linux/wait.h</filename>;, - which will wake up every process in the queue. The exception is - if one has <constant>TASK_EXCLUSIVE</constant> set, in which case - the remainder of the queue will not be woken. There are other variants - of this basic function available in the same header. - </para> - </sect1> - </chapter> - - <chapter id="atomic-ops"> - <title>Atomic Operations</title> - - <para> - Certain operations are guaranteed atomic on all platforms. The - first class of operations work on <type>atomic_t</type> - - <filename class="headerfile">include/asm/atomic.h</filename>; this - contains a signed integer (at least 32 bits long), and you must use - these functions to manipulate or read atomic_t variables. - <function>atomic_read()</function> and - <function>atomic_set()</function> get and set the counter, - <function>atomic_add()</function>, - <function>atomic_sub()</function>, - <function>atomic_inc()</function>, - <function>atomic_dec()</function>, and - <function>atomic_dec_and_test()</function> (returns - <returnvalue>true</returnvalue> if it was decremented to zero). - </para> - - <para> - Yes. It returns <returnvalue>true</returnvalue> (i.e. != 0) if the - atomic variable is zero. - </para> - - <para> - Note that these functions are slower than normal arithmetic, and - so should not be used unnecessarily. - </para> - - <para> - The second class of atomic operations is atomic bit operations on an - <type>unsigned long</type>, defined in - - <filename class="headerfile">include/linux/bitops.h</filename>. These - operations generally take a pointer to the bit pattern, and a bit - number: 0 is the least significant bit. - <function>set_bit()</function>, <function>clear_bit()</function> - and <function>change_bit()</function> set, clear, and flip the - given bit. <function>test_and_set_bit()</function>, - <function>test_and_clear_bit()</function> and - <function>test_and_change_bit()</function> do the same thing, - except return true if the bit was previously set; these are - particularly useful for atomically setting flags. - </para> - - <para> - It is possible to call these operations with bit indices greater - than BITS_PER_LONG. The resulting behavior is strange on big-endian - platforms though so it is a good idea not to do this. - </para> - </chapter> - - <chapter id="symbols"> - <title>Symbols</title> - - <para> - Within the kernel proper, the normal linking rules apply - (ie. unless a symbol is declared to be file scope with the - <type>static</type> keyword, it can be used anywhere in the - kernel). However, for modules, a special exported symbol table is - kept which limits the entry points to the kernel proper. Modules - can also export symbols. - </para> - - <sect1 id="sym-exportsymbols"> - <title><function>EXPORT_SYMBOL()</function> - <filename class="headerfile">include/linux/export.h</filename></title> - - <para> - This is the classic method of exporting a symbol: dynamically - loaded modules will be able to use the symbol as normal. - </para> - </sect1> - - <sect1 id="sym-exportsymbols-gpl"> - <title><function>EXPORT_SYMBOL_GPL()</function> - <filename class="headerfile">include/linux/export.h</filename></title> - - <para> - Similar to <function>EXPORT_SYMBOL()</function> except that the - symbols exported by <function>EXPORT_SYMBOL_GPL()</function> can - only be seen by modules with a - <function>MODULE_LICENSE()</function> that specifies a GPL - compatible license. It implies that the function is considered - an internal implementation issue, and not really an interface. - Some maintainers and developers may however - require EXPORT_SYMBOL_GPL() when adding any new APIs or functionality. - </para> - </sect1> - </chapter> - - <chapter id="conventions"> - <title>Routines and Conventions</title> - - <sect1 id="conventions-doublelinkedlist"> - <title>Double-linked lists - <filename class="headerfile">include/linux/list.h</filename></title> - - <para> - There used to be three sets of linked-list routines in the kernel - headers, but this one is the winner. If you don't have some - particular pressing need for a single list, it's a good choice. - </para> - - <para> - In particular, <function>list_for_each_entry</function> is useful. - </para> - </sect1> - - <sect1 id="convention-returns"> - <title>Return Conventions</title> - - <para> - For code called in user context, it's very common to defy C - convention, and return <returnvalue>0</returnvalue> for success, - and a negative error number - (eg. <returnvalue>-EFAULT</returnvalue>) for failure. This can be - unintuitive at first, but it's fairly widespread in the kernel. - </para> - - <para> - Using <function>ERR_PTR()</function> - - <filename class="headerfile">include/linux/err.h</filename>; to - encode a negative error number into a pointer, and - <function>IS_ERR()</function> and <function>PTR_ERR()</function> - to get it back out again: avoids a separate pointer parameter for - the error number. Icky, but in a good way. - </para> - </sect1> - - <sect1 id="conventions-borkedcompile"> - <title>Breaking Compilation</title> - - <para> - Linus and the other developers sometimes change function or - structure names in development kernels; this is not done just to - keep everyone on their toes: it reflects a fundamental change - (eg. can no longer be called with interrupts on, or does extra - checks, or doesn't do checks which were caught before). Usually - this is accompanied by a fairly complete note to the linux-kernel - mailing list; search the archive. Simply doing a global replace - on the file usually makes things <emphasis>worse</emphasis>. - </para> - </sect1> - - <sect1 id="conventions-initialising"> - <title>Initializing structure members</title> - - <para> - The preferred method of initializing structures is to use - designated initialisers, as defined by ISO C99, eg: - </para> - <programlisting> -static struct block_device_operations opt_fops = { - .open = opt_open, - .release = opt_release, - .ioctl = opt_ioctl, - .check_media_change = opt_media_change, -}; - </programlisting> - <para> - This makes it easy to grep for, and makes it clear which - structure fields are set. You should do this because it looks - cool. - </para> - </sect1> - - <sect1 id="conventions-gnu-extns"> - <title>GNU Extensions</title> - - <para> - GNU Extensions are explicitly allowed in the Linux kernel. - Note that some of the more complex ones are not very well - supported, due to lack of general use, but the following are - considered standard (see the GCC info page section "C - Extensions" for more details - Yes, really the info page, the - man page is only a short summary of the stuff in info). - </para> - <itemizedlist> - <listitem> - <para> - Inline functions - </para> - </listitem> - <listitem> - <para> - Statement expressions (ie. the ({ and }) constructs). - </para> - </listitem> - <listitem> - <para> - Declaring attributes of a function / variable / type - (__attribute__) - </para> - </listitem> - <listitem> - <para> - typeof - </para> - </listitem> - <listitem> - <para> - Zero length arrays - </para> - </listitem> - <listitem> - <para> - Macro varargs - </para> - </listitem> - <listitem> - <para> - Arithmetic on void pointers - </para> - </listitem> - <listitem> - <para> - Non-Constant initializers - </para> - </listitem> - <listitem> - <para> - Assembler Instructions (not outside arch/ and include/asm/) - </para> - </listitem> - <listitem> - <para> - Function names as strings (__func__). - </para> - </listitem> - <listitem> - <para> - __builtin_constant_p() - </para> - </listitem> - </itemizedlist> - - <para> - Be wary when using long long in the kernel, the code gcc generates for - it is horrible and worse: division and multiplication does not work - on i386 because the GCC runtime functions for it are missing from - the kernel environment. - </para> - - <!-- FIXME: add a note about ANSI aliasing cleanness --> - </sect1> - - <sect1 id="conventions-cplusplus"> - <title>C++</title> - - <para> - Using C++ in the kernel is usually a bad idea, because the - kernel does not provide the necessary runtime environment - and the include files are not tested for it. It is still - possible, but not recommended. If you really want to do - this, forget about exceptions at least. - </para> - </sect1> - - <sect1 id="conventions-ifdef"> - <title>#if</title> - - <para> - It is generally considered cleaner to use macros in header files - (or at the top of .c files) to abstract away functions rather than - using `#if' pre-processor statements throughout the source code. - </para> - </sect1> - </chapter> - - <chapter id="submitting"> - <title>Putting Your Stuff in the Kernel</title> - - <para> - In order to get your stuff into shape for official inclusion, or - even to make a neat patch, there's administrative work to be - done: - </para> - <itemizedlist> - <listitem> - <para> - Figure out whose pond you've been pissing in. Look at the top of - the source files, inside the <filename>MAINTAINERS</filename> - file, and last of all in the <filename>CREDITS</filename> file. - You should coordinate with this person to make sure you're not - duplicating effort, or trying something that's already been - rejected. - </para> - - <para> - Make sure you put your name and EMail address at the top of - any files you create or mangle significantly. This is the - first place people will look when they find a bug, or when - <emphasis>they</emphasis> want to make a change. - </para> - </listitem> - - <listitem> - <para> - Usually you want a configuration option for your kernel hack. - Edit <filename>Kconfig</filename> in the appropriate directory. - The Config language is simple to use by cut and paste, and there's - complete documentation in - <filename>Documentation/kbuild/kconfig-language.txt</filename>. - </para> - - <para> - In your description of the option, make sure you address both the - expert user and the user who knows nothing about your feature. Mention - incompatibilities and issues here. <emphasis> Definitely - </emphasis> end your description with <quote> if in doubt, say N - </quote> (or, occasionally, `Y'); this is for people who have no - idea what you are talking about. - </para> - </listitem> - - <listitem> - <para> - Edit the <filename>Makefile</filename>: the CONFIG variables are - exported here so you can usually just add a "obj-$(CONFIG_xxx) += - xxx.o" line. The syntax is documented in - <filename>Documentation/kbuild/makefiles.txt</filename>. - </para> - </listitem> - - <listitem> - <para> - Put yourself in <filename>CREDITS</filename> if you've done - something noteworthy, usually beyond a single file (your name - should be at the top of the source files anyway). - <filename>MAINTAINERS</filename> means you want to be consulted - when changes are made to a subsystem, and hear about bugs; it - implies a more-than-passing commitment to some part of the code. - </para> - </listitem> - - <listitem> - <para> - Finally, don't forget to read <filename>Documentation/process/submitting-patches.rst</filename> - and possibly <filename>Documentation/process/submitting-drivers.rst</filename>. - </para> - </listitem> - </itemizedlist> - </chapter> - - <chapter id="cantrips"> - <title>Kernel Cantrips</title> - - <para> - Some favorites from browsing the source. Feel free to add to this - list. - </para> - - <para> - <filename>arch/x86/include/asm/delay.h:</filename> - </para> - <programlisting> -#define ndelay(n) (__builtin_constant_p(n) ? \ - ((n) > 20000 ? __bad_ndelay() : __const_udelay((n) * 5ul)) : \ - __ndelay(n)) - </programlisting> - - <para> - <filename>include/linux/fs.h</filename>: - </para> - <programlisting> -/* - * Kernel pointers have redundant information, so we can use a - * scheme where we can return either an error code or a dentry - * pointer with the same return value. - * - * This should be a per-architecture thing, to allow different - * error and pointer decisions. - */ - #define ERR_PTR(err) ((void *)((long)(err))) - #define PTR_ERR(ptr) ((long)(ptr)) - #define IS_ERR(ptr) ((unsigned long)(ptr) > (unsigned long)(-1000)) -</programlisting> - - <para> - <filename>arch/x86/include/asm/uaccess_32.h:</filename> - </para> - - <programlisting> -#define copy_to_user(to,from,n) \ - (__builtin_constant_p(n) ? \ - __constant_copy_to_user((to),(from),(n)) : \ - __generic_copy_to_user((to),(from),(n))) - </programlisting> - - <para> - <filename>arch/sparc/kernel/head.S:</filename> - </para> - - <programlisting> -/* - * Sun people can't spell worth damn. "compatability" indeed. - * At least we *know* we can't spell, and use a spell-checker. - */ - -/* Uh, actually Linus it is I who cannot spell. Too much murky - * Sparc assembly will do this to ya. - */ -C_LABEL(cputypvar): - .asciz "compatibility" - -/* Tested on SS-5, SS-10. Probably someone at Sun applied a spell-checker. */ - .align 4 -C_LABEL(cputypvar_sun4m): - .asciz "compatible" - </programlisting> - - <para> - <filename>arch/sparc/lib/checksum.S:</filename> - </para> - - <programlisting> - /* Sun, you just can't beat me, you just can't. Stop trying, - * give up. I'm serious, I am going to kick the living shit - * out of you, game over, lights out. - */ - </programlisting> - </chapter> - - <chapter id="credits"> - <title>Thanks</title> - - <para> - Thanks to Andi Kleen for the idea, answering my questions, fixing - my mistakes, filling content, etc. Philipp Rumpf for more spelling - and clarity fixes, and some excellent non-obvious points. Werner - Almesberger for giving me a great summary of - <function>disable_irq()</function>, and Jes Sorensen and Andrea - Arcangeli added caveats. Michael Elizabeth Chastain for checking - and adding to the Configure section. <!-- Rusty insisted on this - bit; I didn't do it! --> Telsa Gwynne for teaching me DocBook. - </para> - </chapter> -</book> - diff --git a/Documentation/DocBook/kernel-locking.tmpl b/Documentation/DocBook/kernel-locking.tmpl deleted file mode 100644 index 7c9cc4846cb6..000000000000 --- a/Documentation/DocBook/kernel-locking.tmpl +++ /dev/null @@ -1,2151 +0,0 @@ -<?xml version="1.0" encoding="UTF-8"?> -<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" - "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []> - -<book id="LKLockingGuide"> - <bookinfo> - <title>Unreliable Guide To Locking</title> - - <authorgroup> - <author> - <firstname>Rusty</firstname> - <surname>Russell</surname> - <affiliation> - <address> - <email>rusty@rustcorp.com.au</email> - </address> - </affiliation> - </author> - </authorgroup> - - <copyright> - <year>2003</year> - <holder>Rusty Russell</holder> - </copyright> - - <legalnotice> - <para> - This documentation is free software; you can redistribute - it and/or modify it under the terms of the GNU General Public - License as published by the Free Software Foundation; either - version 2 of the License, or (at your option) any later - version. - </para> - - <para> - This program is distributed in the hope that it will be - useful, but WITHOUT ANY WARRANTY; without even the implied - warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. - See the GNU General Public License for more details. - </para> - - <para> - You should have received a copy of the GNU General Public - License along with this program; if not, write to the Free - Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, - MA 02111-1307 USA - </para> - - <para> - For more details see the file COPYING in the source - distribution of Linux. - </para> - </legalnotice> - </bookinfo> - - <toc></toc> - <chapter id="intro"> - <title>Introduction</title> - <para> - Welcome, to Rusty's Remarkably Unreliable Guide to Kernel - Locking issues. This document describes the locking systems in - the Linux Kernel in 2.6. - </para> - <para> - With the wide availability of HyperThreading, and <firstterm - linkend="gloss-preemption">preemption </firstterm> in the Linux - Kernel, everyone hacking on the kernel needs to know the - fundamentals of concurrency and locking for - <firstterm linkend="gloss-smp"><acronym>SMP</acronym></firstterm>. - </para> - </chapter> - - <chapter id="races"> - <title>The Problem With Concurrency</title> - <para> - (Skip this if you know what a Race Condition is). - </para> - <para> - In a normal program, you can increment a counter like so: - </para> - <programlisting> - very_important_count++; - </programlisting> - - <para> - This is what they would expect to happen: - </para> - - <table> - <title>Expected Results</title> - - <tgroup cols="2" align="left"> - - <thead> - <row> - <entry>Instance 1</entry> - <entry>Instance 2</entry> - </row> - </thead> - - <tbody> - <row> - <entry>read very_important_count (5)</entry> - <entry></entry> - </row> - <row> - <entry>add 1 (6)</entry> - <entry></entry> - </row> - <row> - <entry>write very_important_count (6)</entry> - <entry></entry> - </row> - <row> - <entry></entry> - <entry>read very_important_count (6)</entry> - </row> - <row> - <entry></entry> - <entry>add 1 (7)</entry> - </row> - <row> - <entry></entry> - <entry>write very_important_count (7)</entry> - </row> - </tbody> - - </tgroup> - </table> - - <para> - This is what might happen: - </para> - - <table> - <title>Possible Results</title> - - <tgroup cols="2" align="left"> - <thead> - <row> - <entry>Instance 1</entry> - <entry>Instance 2</entry> - </row> - </thead> - - <tbody> - <row> - <entry>read very_important_count (5)</entry> - <entry></entry> - </row> - <row> - <entry></entry> - <entry>read very_important_count (5)</entry> - </row> - <row> - <entry>add 1 (6)</entry> - <entry></entry> - </row> - <row> - <entry></entry> - <entry>add 1 (6)</entry> - </row> - <row> - <entry>write very_important_count (6)</entry> - <entry></entry> - </row> - <row> - <entry></entry> - <entry>write very_important_count (6)</entry> - </row> - </tbody> - </tgroup> - </table> - - <sect1 id="race-condition"> - <title>Race Conditions and Critical Regions</title> - <para> - This overlap, where the result depends on the - relative timing of multiple tasks, is called a <firstterm>race condition</firstterm>. - The piece of code containing the concurrency issue is called a - <firstterm>critical region</firstterm>. And especially since Linux starting running - on SMP machines, they became one of the major issues in kernel - design and implementation. - </para> - <para> - Preemption can have the same effect, even if there is only one - CPU: by preempting one task during the critical region, we have - exactly the same race condition. In this case the thread which - preempts might run the critical region itself. - </para> - <para> - The solution is to recognize when these simultaneous accesses - occur, and use locks to make sure that only one instance can - enter the critical region at any time. There are many - friendly primitives in the Linux kernel to help you do this. - And then there are the unfriendly primitives, but I'll pretend - they don't exist. - </para> - </sect1> - </chapter> - - <chapter id="locks"> - <title>Locking in the Linux Kernel</title> - - <para> - If I could give you one piece of advice: never sleep with anyone - crazier than yourself. But if I had to give you advice on - locking: <emphasis>keep it simple</emphasis>. - </para> - - <para> - Be reluctant to introduce new locks. - </para> - - <para> - Strangely enough, this last one is the exact reverse of my advice when - you <emphasis>have</emphasis> slept with someone crazier than yourself. - And you should think about getting a big dog. - </para> - - <sect1 id="lock-intro"> - <title>Two Main Types of Kernel Locks: Spinlocks and Mutexes</title> - - <para> - There are two main types of kernel locks. The fundamental type - is the spinlock - (<filename class="headerfile">include/asm/spinlock.h</filename>), - which is a very simple single-holder lock: if you can't get the - spinlock, you keep trying (spinning) until you can. Spinlocks are - very small and fast, and can be used anywhere. - </para> - <para> - The second type is a mutex - (<filename class="headerfile">include/linux/mutex.h</filename>): it - is like a spinlock, but you may block holding a mutex. - If you can't lock a mutex, your task will suspend itself, and be woken - up when the mutex is released. This means the CPU can do something - else while you are waiting. There are many cases when you simply - can't sleep (see <xref linkend="sleeping-things"/>), and so have to - use a spinlock instead. - </para> - <para> - Neither type of lock is recursive: see - <xref linkend="deadlock"/>. - </para> - </sect1> - - <sect1 id="uniprocessor"> - <title>Locks and Uniprocessor Kernels</title> - - <para> - For kernels compiled without <symbol>CONFIG_SMP</symbol>, and - without <symbol>CONFIG_PREEMPT</symbol> spinlocks do not exist at - all. This is an excellent design decision: when no-one else can - run at the same time, there is no reason to have a lock. - </para> - - <para> - If the kernel is compiled without <symbol>CONFIG_SMP</symbol>, - but <symbol>CONFIG_PREEMPT</symbol> is set, then spinlocks - simply disable preemption, which is sufficient to prevent any - races. For most purposes, we can think of preemption as - equivalent to SMP, and not worry about it separately. - </para> - - <para> - You should always test your locking code with <symbol>CONFIG_SMP</symbol> - and <symbol>CONFIG_PREEMPT</symbol> enabled, even if you don't have an SMP test box, because it - will still catch some kinds of locking bugs. - </para> - - <para> - Mutexes still exist, because they are required for - synchronization between <firstterm linkend="gloss-usercontext">user - contexts</firstterm>, as we will see below. - </para> - </sect1> - - <sect1 id="usercontextlocking"> - <title>Locking Only In User Context</title> - - <para> - If you have a data structure which is only ever accessed from - user context, then you can use a simple mutex - (<filename>include/linux/mutex.h</filename>) to protect it. This - is the most trivial case: you initialize the mutex. Then you can - call <function>mutex_lock_interruptible()</function> to grab the mutex, - and <function>mutex_unlock()</function> to release it. There is also a - <function>mutex_lock()</function>, which should be avoided, because it - will not return if a signal is received. - </para> - - <para> - Example: <filename>net/netfilter/nf_sockopt.c</filename> allows - registration of new <function>setsockopt()</function> and - <function>getsockopt()</function> calls, with - <function>nf_register_sockopt()</function>. Registration and - de-registration are only done on module load and unload (and boot - time, where there is no concurrency), and the list of registrations - is only consulted for an unknown <function>setsockopt()</function> - or <function>getsockopt()</function> system call. The - <varname>nf_sockopt_mutex</varname> is perfect to protect this, - especially since the setsockopt and getsockopt calls may well - sleep. - </para> - </sect1> - - <sect1 id="lock-user-bh"> - <title>Locking Between User Context and Softirqs</title> - - <para> - If a <firstterm linkend="gloss-softirq">softirq</firstterm> shares - data with user context, you have two problems. Firstly, the current - user context can be interrupted by a softirq, and secondly, the - critical region could be entered from another CPU. This is where - <function>spin_lock_bh()</function> - (<filename class="headerfile">include/linux/spinlock.h</filename>) is - used. It disables softirqs on that CPU, then grabs the lock. - <function>spin_unlock_bh()</function> does the reverse. (The - '_bh' suffix is a historical reference to "Bottom Halves", the - old name for software interrupts. It should really be - called spin_lock_softirq()' in a perfect world). - </para> - - <para> - Note that you can also use <function>spin_lock_irq()</function> - or <function>spin_lock_irqsave()</function> here, which stop - hardware interrupts as well: see <xref linkend="hardirq-context"/>. - </para> - - <para> - This works perfectly for <firstterm linkend="gloss-up"><acronym>UP - </acronym></firstterm> as well: the spin lock vanishes, and this macro - simply becomes <function>local_bh_disable()</function> - (<filename class="headerfile">include/linux/interrupt.h</filename>), which - protects you from the softirq being run. - </para> - </sect1> - - <sect1 id="lock-user-tasklet"> - <title>Locking Between User Context and Tasklets</title> - - <para> - This is exactly the same as above, because <firstterm - linkend="gloss-tasklet">tasklets</firstterm> are actually run - from a softirq. - </para> - </sect1> - - <sect1 id="lock-user-timers"> - <title>Locking Between User Context and Timers</title> - - <para> - This, too, is exactly the same as above, because <firstterm - linkend="gloss-timers">timers</firstterm> are actually run from - a softirq. From a locking point of view, tasklets and timers - are identical. - </para> - </sect1> - - <sect1 id="lock-tasklets"> - <title>Locking Between Tasklets/Timers</title> - - <para> - Sometimes a tasklet or timer might want to share data with - another tasklet or timer. - </para> - - <sect2 id="lock-tasklets-same"> - <title>The Same Tasklet/Timer</title> - <para> - Since a tasklet is never run on two CPUs at once, you don't - need to worry about your tasklet being reentrant (running - twice at once), even on SMP. - </para> - </sect2> - - <sect2 id="lock-tasklets-different"> - <title>Different Tasklets/Timers</title> - <para> - If another tasklet/timer wants - to share data with your tasklet or timer , you will both need to use - <function>spin_lock()</function> and - <function>spin_unlock()</function> calls. - <function>spin_lock_bh()</function> is - unnecessary here, as you are already in a tasklet, and - none will be run on the same CPU. - </para> - </sect2> - </sect1> - - <sect1 id="lock-softirqs"> - <title>Locking Between Softirqs</title> - - <para> - Often a softirq might - want to share data with itself or a tasklet/timer. - </para> - - <sect2 id="lock-softirqs-same"> - <title>The Same Softirq</title> - - <para> - The same softirq can run on the other CPUs: you can use a - per-CPU array (see <xref linkend="per-cpu"/>) for better - performance. If you're going so far as to use a softirq, - you probably care about scalable performance enough - to justify the extra complexity. - </para> - - <para> - You'll need to use <function>spin_lock()</function> and - <function>spin_unlock()</function> for shared data. - </para> - </sect2> - - <sect2 id="lock-softirqs-different"> - <title>Different Softirqs</title> - - <para> - You'll need to use <function>spin_lock()</function> and - <function>spin_unlock()</function> for shared data, whether it - be a timer, tasklet, different softirq or the same or another - softirq: any of them could be running on a different CPU. - </para> - </sect2> - </sect1> - </chapter> - - <chapter id="hardirq-context"> - <title>Hard IRQ Context</title> - - <para> - Hardware interrupts usually communicate with a - tasklet or softirq. Frequently this involves putting work in a - queue, which the softirq will take out. - </para> - - <sect1 id="hardirq-softirq"> - <title>Locking Between Hard IRQ and Softirqs/Tasklets</title> - - <para> - If a hardware irq handler shares data with a softirq, you have - two concerns. Firstly, the softirq processing can be - interrupted by a hardware interrupt, and secondly, the - critical region could be entered by a hardware interrupt on - another CPU. This is where <function>spin_lock_irq()</function> is - used. It is defined to disable interrupts on that cpu, then grab - the lock. <function>spin_unlock_irq()</function> does the reverse. - </para> - - <para> - The irq handler does not to use - <function>spin_lock_irq()</function>, because the softirq cannot - run while the irq handler is running: it can use - <function>spin_lock()</function>, which is slightly faster. The - only exception would be if a different hardware irq handler uses - the same lock: <function>spin_lock_irq()</function> will stop - that from interrupting us. - </para> - - <para> - This works perfectly for UP as well: the spin lock vanishes, - and this macro simply becomes <function>local_irq_disable()</function> - (<filename class="headerfile">include/asm/smp.h</filename>), which - protects you from the softirq/tasklet/BH being run. - </para> - - <para> - <function>spin_lock_irqsave()</function> - (<filename>include/linux/spinlock.h</filename>) is a variant - which saves whether interrupts were on or off in a flags word, - which is passed to <function>spin_unlock_irqrestore()</function>. This - means that the same code can be used inside an hard irq handler (where - interrupts are already off) and in softirqs (where the irq - disabling is required). - </para> - - <para> - Note that softirqs (and hence tasklets and timers) are run on - return from hardware interrupts, so - <function>spin_lock_irq()</function> also stops these. In that - sense, <function>spin_lock_irqsave()</function> is the most - general and powerful locking function. - </para> - - </sect1> - <sect1 id="hardirq-hardirq"> - <title>Locking Between Two Hard IRQ Handlers</title> - <para> - It is rare to have to share data between two IRQ handlers, but - if you do, <function>spin_lock_irqsave()</function> should be - used: it is architecture-specific whether all interrupts are - disabled inside irq handlers themselves. - </para> - </sect1> - - </chapter> - - <chapter id="cheatsheet"> - <title>Cheat Sheet For Locking</title> - <para> - Pete Zaitcev gives the following summary: - </para> - <itemizedlist> - <listitem> - <para> - If you are in a process context (any syscall) and want to - lock other process out, use a mutex. You can take a mutex - and sleep (<function>copy_from_user*(</function> or - <function>kmalloc(x,GFP_KERNEL)</function>). - </para> - </listitem> - <listitem> - <para> - Otherwise (== data can be touched in an interrupt), use - <function>spin_lock_irqsave()</function> and - <function>spin_unlock_irqrestore()</function>. - </para> - </listitem> - <listitem> - <para> - Avoid holding spinlock for more than 5 lines of code and - across any function call (except accessors like - <function>readb</function>). - </para> - </listitem> - </itemizedlist> - - <sect1 id="minimum-lock-reqirements"> - <title>Table of Minimum Requirements</title> - - <para> The following table lists the <emphasis>minimum</emphasis> - locking requirements between various contexts. In some cases, - the same context can only be running on one CPU at a time, so - no locking is required for that context (eg. a particular - thread can only run on one CPU at a time, but if it needs - shares data with another thread, locking is required). - </para> - <para> - Remember the advice above: you can always use - <function>spin_lock_irqsave()</function>, which is a superset - of all other spinlock primitives. - </para> - - <table> -<title>Table of Locking Requirements</title> -<tgroup cols="11"> -<tbody> - -<row> -<entry></entry> -<entry>IRQ Handler A</entry> -<entry>IRQ Handler B</entry> -<entry>Softirq A</entry> -<entry>Softirq B</entry> -<entry>Tasklet A</entry> -<entry>Tasklet B</entry> -<entry>Timer A</entry> -<entry>Timer B</entry> -<entry>User Context A</entry> -<entry>User Context B</entry> -</row> - -<row> -<entry>IRQ Handler A</entry> -<entry>None</entry> -</row> - -<row> -<entry>IRQ Handler B</entry> -<entry>SLIS</entry> -<entry>None</entry> -</row> - -<row> -<entry>Softirq A</entry> -<entry>SLI</entry> -<entry>SLI</entry> -<entry>SL</entry> -</row> - -<row> -<entry>Softirq B</entry> -<entry>SLI</entry> -<entry>SLI</entry> -<entry>SL</entry> -<entry>SL</entry> -</row> - -<row> -<entry>Tasklet A</entry> -<entry>SLI</entry> -<entry>SLI</entry> -<entry>SL</entry> -<entry>SL</entry> -<entry>None</entry> -</row> - -<row> -<entry>Tasklet B</entry> -<entry>SLI</entry> -<entry>SLI</entry> -<entry>SL</entry> -<entry>SL</entry> -<entry>SL</entry> -<entry>None</entry> -</row> - -<row> -<entry>Timer A</entry> -<entry>SLI</entry> -<entry>SLI</entry> -<entry>SL</entry> -<entry>SL</entry> -<entry>SL</entry> -<entry>SL</entry> -<entry>None</entry> -</row> - -<row> -<entry>Timer B</entry> -<entry>SLI</entry> -<entry>SLI</entry> -<entry>SL</entry> -<entry>SL</entry> -<entry>SL</entry> -<entry>SL</entry> -<entry>SL</entry> -<entry>None</entry> -</row> - -<row> -<entry>User Context A</entry> -<entry>SLI</entry> -<entry>SLI</entry> -<entry>SLBH</entry> -<entry>SLBH</entry> -<entry>SLBH</entry> -<entry>SLBH</entry> -<entry>SLBH</entry> -<entry>SLBH</entry> -<entry>None</entry> -</row> - -<row> -<entry>User Context B</entry> -<entry>SLI</entry> -<entry>SLI</entry> -<entry>SLBH</entry> -<entry>SLBH</entry> -<entry>SLBH</entry> -<entry>SLBH</entry> -<entry>SLBH</entry> -<entry>SLBH</entry> -<entry>MLI</entry> -<entry>None</entry> -</row> - -</tbody> -</tgroup> -</table> - - <table> -<title>Legend for Locking Requirements Table</title> -<tgroup cols="2"> -<tbody> - -<row> -<entry>SLIS</entry> -<entry>spin_lock_irqsave</entry> -</row> -<row> -<entry>SLI</entry> -<entry>spin_lock_irq</entry> -</row> -<row> -<entry>SL</entry> -<entry>spin_lock</entry> -</row> -<row> -<entry>SLBH</entry> -<entry>spin_lock_bh</entry> -</row> -<row> -<entry>MLI</entry> -<entry>mutex_lock_interruptible</entry> -</row> - -</tbody> -</tgroup> -</table> - -</sect1> -</chapter> - -<chapter id="trylock-functions"> - <title>The trylock Functions</title> - <para> - There are functions that try to acquire a lock only once and immediately - return a value telling about success or failure to acquire the lock. - They can be used if you need no access to the data protected with the lock - when some other thread is holding the lock. You should acquire the lock - later if you then need access to the data protected with the lock. - </para> - - <para> - <function>spin_trylock()</function> does not spin but returns non-zero if - it acquires the spinlock on the first try or 0 if not. This function can - be used in all contexts like <function>spin_lock</function>: you must have - disabled the contexts that might interrupt you and acquire the spin lock. - </para> - - <para> - <function>mutex_trylock()</function> does not suspend your task - but returns non-zero if it could lock the mutex on the first try - or 0 if not. This function cannot be safely used in hardware or software - interrupt contexts despite not sleeping. - </para> -</chapter> - - <chapter id="Examples"> - <title>Common Examples</title> - <para> -Let's step through a simple example: a cache of number to name -mappings. The cache keeps a count of how often each of the objects is -used, and when it gets full, throws out the least used one. - - </para> - - <sect1 id="examples-usercontext"> - <title>All In User Context</title> - <para> -For our first example, we assume that all operations are in user -context (ie. from system calls), so we can sleep. This means we can -use a mutex to protect the cache and all the objects within -it. Here's the code: - </para> - - <programlisting> -#include <linux/list.h> -#include <linux/slab.h> -#include <linux/string.h> -#include <linux/mutex.h> -#include <asm/errno.h> - -struct object -{ - struct list_head list; - int id; - char name[32]; - int popularity; -}; - -/* Protects the cache, cache_num, and the objects within it */ -static DEFINE_MUTEX(cache_lock); -static LIST_HEAD(cache); -static unsigned int cache_num = 0; -#define MAX_CACHE_SIZE 10 - -/* Must be holding cache_lock */ -static struct object *__cache_find(int id) -{ - struct object *i; - - list_for_each_entry(i, &cache, list) - if (i->id == id) { - i->popularity++; - return i; - } - return NULL; -} - -/* Must be holding cache_lock */ -static void __cache_delete(struct object *obj) -{ - BUG_ON(!obj); - list_del(&obj->list); - kfree(obj); - cache_num--; -} - -/* Must be holding cache_lock */ -static void __cache_add(struct object *obj) -{ - list_add(&obj->list, &cache); - if (++cache_num > MAX_CACHE_SIZE) { - struct object *i, *outcast = NULL; - list_for_each_entry(i, &cache, list) { - if (!outcast || i->popularity < outcast->popularity) - outcast = i; - } - __cache_delete(outcast); - } -} - -int cache_add(int id, const char *name) -{ - struct object *obj; - - if ((obj = kmalloc(sizeof(*obj), GFP_KERNEL)) == NULL) - return -ENOMEM; - - strlcpy(obj->name, name, sizeof(obj->name)); - obj->id = id; - obj->popularity = 0; - - mutex_lock(&cache_lock); - __cache_add(obj); - mutex_unlock(&cache_lock); - return 0; -} - -void cache_delete(int id) -{ - mutex_lock(&cache_lock); - __cache_delete(__cache_find(id)); - mutex_unlock(&cache_lock); -} - -int cache_find(int id, char *name) -{ - struct object *obj; - int ret = -ENOENT; - - mutex_lock(&cache_lock); - obj = __cache_find(id); - if (obj) { - ret = 0; - strcpy(name, obj->name); - } - mutex_unlock(&cache_lock); - return ret; -} -</programlisting> - - <para> -Note that we always make sure we have the cache_lock when we add, -delete, or look up the cache: both the cache infrastructure itself and -the contents of the objects are protected by the lock. In this case -it's easy, since we copy the data for the user, and never let them -access the objects directly. - </para> - <para> -There is a slight (and common) optimization here: in -<function>cache_add</function> we set up the fields of the object -before grabbing the lock. This is safe, as no-one else can access it -until we put it in cache. - </para> - </sect1> - - <sect1 id="examples-interrupt"> - <title>Accessing From Interrupt Context</title> - <para> -Now consider the case where <function>cache_find</function> can be -called from interrupt context: either a hardware interrupt or a -softirq. An example would be a timer which deletes object from the -cache. - </para> - <para> -The change is shown below, in standard patch format: the -<symbol>-</symbol> are lines which are taken away, and the -<symbol>+</symbol> are lines which are added. - </para> -<programlisting> ---- cache.c.usercontext 2003-12-09 13:58:54.000000000 +1100 -+++ cache.c.interrupt 2003-12-09 14:07:49.000000000 +1100 -@@ -12,7 +12,7 @@ - int popularity; - }; - --static DEFINE_MUTEX(cache_lock); -+static DEFINE_SPINLOCK(cache_lock); - static LIST_HEAD(cache); - static unsigned int cache_num = 0; - #define MAX_CACHE_SIZE 10 -@@ -55,6 +55,7 @@ - int cache_add(int id, const char *name) - { - struct object *obj; -+ unsigned long flags; - - if ((obj = kmalloc(sizeof(*obj), GFP_KERNEL)) == NULL) - return -ENOMEM; -@@ -63,30 +64,33 @@ - obj->id = id; - obj->popularity = 0; - -- mutex_lock(&cache_lock); -+ spin_lock_irqsave(&cache_lock, flags); - __cache_add(obj); -- mutex_unlock(&cache_lock); -+ spin_unlock_irqrestore(&cache_lock, flags); - return 0; - } - - void cache_delete(int id) - { -- mutex_lock(&cache_lock); -+ unsigned long flags; -+ -+ spin_lock_irqsave(&cache_lock, flags); - __cache_delete(__cache_find(id)); -- mutex_unlock(&cache_lock); -+ spin_unlock_irqrestore(&cache_lock, flags); - } - - int cache_find(int id, char *name) - { - struct object *obj; - int ret = -ENOENT; -+ unsigned long flags; - -- mutex_lock(&cache_lock); -+ spin_lock_irqsave(&cache_lock, flags); - obj = __cache_find(id); - if (obj) { - ret = 0; - strcpy(name, obj->name); - } -- mutex_unlock(&cache_lock); -+ spin_unlock_irqrestore(&cache_lock, flags); - return ret; - } -</programlisting> - - <para> -Note that the <function>spin_lock_irqsave</function> will turn off -interrupts if they are on, otherwise does nothing (if we are already -in an interrupt handler), hence these functions are safe to call from -any context. - </para> - <para> -Unfortunately, <function>cache_add</function> calls -<function>kmalloc</function> with the <symbol>GFP_KERNEL</symbol> -flag, which is only legal in user context. I have assumed that -<function>cache_add</function> is still only called in user context, -otherwise this should become a parameter to -<function>cache_add</function>. - </para> - </sect1> - <sect1 id="examples-refcnt"> - <title>Exposing Objects Outside This File</title> - <para> -If our objects contained more information, it might not be sufficient -to copy the information in and out: other parts of the code might want -to keep pointers to these objects, for example, rather than looking up -the id every time. This produces two problems. - </para> - <para> -The first problem is that we use the <symbol>cache_lock</symbol> to -protect objects: we'd need to make this non-static so the rest of the -code can use it. This makes locking trickier, as it is no longer all -in one place. - </para> - <para> -The second problem is the lifetime problem: if another structure keeps -a pointer to an object, it presumably expects that pointer to remain -valid. Unfortunately, this is only guaranteed while you hold the -lock, otherwise someone might call <function>cache_delete</function> -and even worse, add another object, re-using the same address. - </para> - <para> -As there is only one lock, you can't hold it forever: no-one else would -get any work done. - </para> - <para> -The solution to this problem is to use a reference count: everyone who -has a pointer to the object increases it when they first get the -object, and drops the reference count when they're finished with it. -Whoever drops it to zero knows it is unused, and can actually delete it. - </para> - <para> -Here is the code: - </para> - -<programlisting> ---- cache.c.interrupt 2003-12-09 14:25:43.000000000 +1100 -+++ cache.c.refcnt 2003-12-09 14:33:05.000000000 +1100 -@@ -7,6 +7,7 @@ - struct object - { - struct list_head list; -+ unsigned int refcnt; - int id; - char name[32]; - int popularity; -@@ -17,6 +18,35 @@ - static unsigned int cache_num = 0; - #define MAX_CACHE_SIZE 10 - -+static void __object_put(struct object *obj) -+{ -+ if (--obj->refcnt == 0) -+ kfree(obj); -+} -+ -+static void __object_get(struct object *obj) -+{ -+ obj->refcnt++; -+} -+ -+void object_put(struct object *obj) -+{ -+ unsigned long flags; -+ -+ spin_lock_irqsave(&cache_lock, flags); -+ __object_put(obj); -+ spin_unlock_irqrestore(&cache_lock, flags); -+} -+ -+void object_get(struct object *obj) -+{ -+ unsigned long flags; -+ -+ spin_lock_irqsave(&cache_lock, flags); -+ __object_get(obj); -+ spin_unlock_irqrestore(&cache_lock, flags); -+} -+ - /* Must be holding cache_lock */ - static struct object *__cache_find(int id) - { -@@ -35,6 +65,7 @@ - { - BUG_ON(!obj); - list_del(&obj->list); -+ __object_put(obj); - cache_num--; - } - -@@ -63,6 +94,7 @@ - strlcpy(obj->name, name, sizeof(obj->name)); - obj->id = id; - obj->popularity = 0; -+ obj->refcnt = 1; /* The cache holds a reference */ - - spin_lock_irqsave(&cache_lock, flags); - __cache_add(obj); -@@ -79,18 +111,15 @@ - spin_unlock_irqrestore(&cache_lock, flags); - } - --int cache_find(int id, char *name) -+struct object *cache_find(int id) - { - struct object *obj; -- int ret = -ENOENT; - unsigned long flags; - - spin_lock_irqsave(&cache_lock, flags); - obj = __cache_find(id); -- if (obj) { -- ret = 0; -- strcpy(name, obj->name); -- } -+ if (obj) -+ __object_get(obj); - spin_unlock_irqrestore(&cache_lock, flags); -- return ret; -+ return obj; - } -</programlisting> - -<para> -We encapsulate the reference counting in the standard 'get' and 'put' -functions. Now we can return the object itself from -<function>cache_find</function> which has the advantage that the user -can now sleep holding the object (eg. to -<function>copy_to_user</function> to name to userspace). -</para> -<para> -The other point to note is that I said a reference should be held for -every pointer to the object: thus the reference count is 1 when first -inserted into the cache. In some versions the framework does not hold -a reference count, but they are more complicated. -</para> - - <sect2 id="examples-refcnt-atomic"> - <title>Using Atomic Operations For The Reference Count</title> -<para> -In practice, <type>atomic_t</type> would usually be used for -<structfield>refcnt</structfield>. There are a number of atomic -operations defined in - -<filename class="headerfile">include/asm/atomic.h</filename>: these are -guaranteed to be seen atomically from all CPUs in the system, so no -lock is required. In this case, it is simpler than using spinlocks, -although for anything non-trivial using spinlocks is clearer. The -<function>atomic_inc</function> and -<function>atomic_dec_and_test</function> are used instead of the -standard increment and decrement operators, and the lock is no longer -used to protect the reference count itself. -</para> - -<programlisting> ---- cache.c.refcnt 2003-12-09 15:00:35.000000000 +1100 -+++ cache.c.refcnt-atomic 2003-12-11 15:49:42.000000000 +1100 -@@ -7,7 +7,7 @@ - struct object - { - struct list_head list; -- unsigned int refcnt; -+ atomic_t refcnt; - int id; - char name[32]; - int popularity; -@@ -18,33 +18,15 @@ - static unsigned int cache_num = 0; - #define MAX_CACHE_SIZE 10 - --static void __object_put(struct object *obj) --{ -- if (--obj->refcnt == 0) -- kfree(obj); --} -- --static void __object_get(struct object *obj) --{ -- obj->refcnt++; --} -- - void object_put(struct object *obj) - { -- unsigned long flags; -- -- spin_lock_irqsave(&cache_lock, flags); -- __object_put(obj); -- spin_unlock_irqrestore(&cache_lock, flags); -+ if (atomic_dec_and_test(&obj->refcnt)) -+ kfree(obj); - } - - void object_get(struct object *obj) - { -- unsigned long flags; -- -- spin_lock_irqsave(&cache_lock, flags); -- __object_get(obj); -- spin_unlock_irqrestore(&cache_lock, flags); -+ atomic_inc(&obj->refcnt); - } - - /* Must be holding cache_lock */ -@@ -65,7 +47,7 @@ - { - BUG_ON(!obj); - list_del(&obj->list); -- __object_put(obj); -+ object_put(obj); - cache_num--; - } - -@@ -94,7 +76,7 @@ - strlcpy(obj->name, name, sizeof(obj->name)); - obj->id = id; - obj->popularity = 0; -- obj->refcnt = 1; /* The cache holds a reference */ -+ atomic_set(&obj->refcnt, 1); /* The cache holds a reference */ - - spin_lock_irqsave(&cache_lock, flags); - __cache_add(obj); -@@ -119,7 +101,7 @@ - spin_lock_irqsave(&cache_lock, flags); - obj = __cache_find(id); - if (obj) -- __object_get(obj); -+ object_get(obj); - spin_unlock_irqrestore(&cache_lock, flags); - return obj; - } -</programlisting> -</sect2> -</sect1> - - <sect1 id="examples-lock-per-obj"> - <title>Protecting The Objects Themselves</title> - <para> -In these examples, we assumed that the objects (except the reference -counts) never changed once they are created. If we wanted to allow -the name to change, there are three possibilities: - </para> - <itemizedlist> - <listitem> - <para> -You can make <symbol>cache_lock</symbol> non-static, and tell people -to grab that lock before changing the name in any object. - </para> - </listitem> - <listitem> - <para> -You can provide a <function>cache_obj_rename</function> which grabs -this lock and changes the name for the caller, and tell everyone to -use that function. - </para> - </listitem> - <listitem> - <para> -You can make the <symbol>cache_lock</symbol> protect only the cache -itself, and use another lock to protect the name. - </para> - </listitem> - </itemizedlist> - - <para> -Theoretically, you can make the locks as fine-grained as one lock for -every field, for every object. In practice, the most common variants -are: -</para> - <itemizedlist> - <listitem> - <para> -One lock which protects the infrastructure (the <symbol>cache</symbol> -list in this example) and all the objects. This is what we have done -so far. - </para> - </listitem> - <listitem> - <para> -One lock which protects the infrastructure (including the list -pointers inside the objects), and one lock inside the object which -protects the rest of that object. - </para> - </listitem> - <listitem> - <para> -Multiple locks to protect the infrastructure (eg. one lock per hash -chain), possibly with a separate per-object lock. - </para> - </listitem> - </itemizedlist> - -<para> -Here is the "lock-per-object" implementation: -</para> -<programlisting> ---- cache.c.refcnt-atomic 2003-12-11 15:50:54.000000000 +1100 -+++ cache.c.perobjectlock 2003-12-11 17:15:03.000000000 +1100 -@@ -6,11 +6,17 @@ - - struct object - { -+ /* These two protected by cache_lock. */ - struct list_head list; -+ int popularity; -+ - atomic_t refcnt; -+ -+ /* Doesn't change once created. */ - int id; -+ -+ spinlock_t lock; /* Protects the name */ - char name[32]; -- int popularity; - }; - - static DEFINE_SPINLOCK(cache_lock); -@@ -77,6 +84,7 @@ - obj->id = id; - obj->popularity = 0; - atomic_set(&obj->refcnt, 1); /* The cache holds a reference */ -+ spin_lock_init(&obj->lock); - - spin_lock_irqsave(&cache_lock, flags); - __cache_add(obj); -</programlisting> - -<para> -Note that I decide that the <structfield>popularity</structfield> -count should be protected by the <symbol>cache_lock</symbol> rather -than the per-object lock: this is because it (like the -<structname>struct list_head</structname> inside the object) is -logically part of the infrastructure. This way, I don't need to grab -the lock of every object in <function>__cache_add</function> when -seeking the least popular. -</para> - -<para> -I also decided that the <structfield>id</structfield> member is -unchangeable, so I don't need to grab each object lock in -<function>__cache_find()</function> to examine the -<structfield>id</structfield>: the object lock is only used by a -caller who wants to read or write the <structfield>name</structfield> -field. -</para> - -<para> -Note also that I added a comment describing what data was protected by -which locks. This is extremely important, as it describes the runtime -behavior of the code, and can be hard to gain from just reading. And -as Alan Cox says, <quote>Lock data, not code</quote>. -</para> -</sect1> -</chapter> - - <chapter id="common-problems"> - <title>Common Problems</title> - <sect1 id="deadlock"> - <title>Deadlock: Simple and Advanced</title> - - <para> - There is a coding bug where a piece of code tries to grab a - spinlock twice: it will spin forever, waiting for the lock to - be released (spinlocks, rwlocks and mutexes are not - recursive in Linux). This is trivial to diagnose: not a - stay-up-five-nights-talk-to-fluffy-code-bunnies kind of - problem. - </para> - - <para> - For a slightly more complex case, imagine you have a region - shared by a softirq and user context. If you use a - <function>spin_lock()</function> call to protect it, it is - possible that the user context will be interrupted by the softirq - while it holds the lock, and the softirq will then spin - forever trying to get the same lock. - </para> - - <para> - Both of these are called deadlock, and as shown above, it can - occur even with a single CPU (although not on UP compiles, - since spinlocks vanish on kernel compiles with - <symbol>CONFIG_SMP</symbol>=n. You'll still get data corruption - in the second example). - </para> - - <para> - This complete lockup is easy to diagnose: on SMP boxes the - watchdog timer or compiling with <symbol>DEBUG_SPINLOCK</symbol> set - (<filename>include/linux/spinlock.h</filename>) will show this up - immediately when it happens. - </para> - - <para> - A more complex problem is the so-called 'deadly embrace', - involving two or more locks. Say you have a hash table: each - entry in the table is a spinlock, and a chain of hashed - objects. Inside a softirq handler, you sometimes want to - alter an object from one place in the hash to another: you - grab the spinlock of the old hash chain and the spinlock of - the new hash chain, and delete the object from the old one, - and insert it in the new one. - </para> - - <para> - There are two problems here. First, if your code ever - tries to move the object to the same chain, it will deadlock - with itself as it tries to lock it twice. Secondly, if the - same softirq on another CPU is trying to move another object - in the reverse direction, the following could happen: - </para> - - <table> - <title>Consequences</title> - - <tgroup cols="2" align="left"> - - <thead> - <row> - <entry>CPU 1</entry> - <entry>CPU 2</entry> - </row> - </thead> - - <tbody> - <row> - <entry>Grab lock A -> OK</entry> - <entry>Grab lock B -> OK</entry> - </row> - <row> - <entry>Grab lock B -> spin</entry> - <entry>Grab lock A -> spin</entry> - </row> - </tbody> - </tgroup> - </table> - - <para> - The two CPUs will spin forever, waiting for the other to give up - their lock. It will look, smell, and feel like a crash. - </para> - </sect1> - - <sect1 id="techs-deadlock-prevent"> - <title>Preventing Deadlock</title> - - <para> - Textbooks will tell you that if you always lock in the same - order, you will never get this kind of deadlock. Practice - will tell you that this approach doesn't scale: when I - create a new lock, I don't understand enough of the kernel - to figure out where in the 5000 lock hierarchy it will fit. - </para> - - <para> - The best locks are encapsulated: they never get exposed in - headers, and are never held around calls to non-trivial - functions outside the same file. You can read through this - code and see that it will never deadlock, because it never - tries to grab another lock while it has that one. People - using your code don't even need to know you are using a - lock. - </para> - - <para> - A classic problem here is when you provide callbacks or - hooks: if you call these with the lock held, you risk simple - deadlock, or a deadly embrace (who knows what the callback - will do?). Remember, the other programmers are out to get - you, so don't do this. - </para> - - <sect2 id="techs-deadlock-overprevent"> - <title>Overzealous Prevention Of Deadlocks</title> - - <para> - Deadlocks are problematic, but not as bad as data - corruption. Code which grabs a read lock, searches a list, - fails to find what it wants, drops the read lock, grabs a - write lock and inserts the object has a race condition. - </para> - - <para> - If you don't see why, please stay the fuck away from my code. - </para> - </sect2> - </sect1> - - <sect1 id="racing-timers"> - <title>Racing Timers: A Kernel Pastime</title> - - <para> - Timers can produce their own special problems with races. - Consider a collection of objects (list, hash, etc) where each - object has a timer which is due to destroy it. - </para> - - <para> - If you want to destroy the entire collection (say on module - removal), you might do the following: - </para> - - <programlisting> - /* THIS CODE BAD BAD BAD BAD: IF IT WAS ANY WORSE IT WOULD USE - HUNGARIAN NOTATION */ - spin_lock_bh(&list_lock); - - while (list) { - struct foo *next = list->next; - del_timer(&list->timer); - kfree(list); - list = next; - } - - spin_unlock_bh(&list_lock); - </programlisting> - - <para> - Sooner or later, this will crash on SMP, because a timer can - have just gone off before the <function>spin_lock_bh()</function>, - and it will only get the lock after we - <function>spin_unlock_bh()</function>, and then try to free - the element (which has already been freed!). - </para> - - <para> - This can be avoided by checking the result of - <function>del_timer()</function>: if it returns - <returnvalue>1</returnvalue>, the timer has been deleted. - If <returnvalue>0</returnvalue>, it means (in this - case) that it is currently running, so we can do: - </para> - - <programlisting> - retry: - spin_lock_bh(&list_lock); - - while (list) { - struct foo *next = list->next; - if (!del_timer(&list->timer)) { - /* Give timer a chance to delete this */ - spin_unlock_bh(&list_lock); - goto retry; - } - kfree(list); - list = next; - } - - spin_unlock_bh(&list_lock); - </programlisting> - - <para> - Another common problem is deleting timers which restart - themselves (by calling <function>add_timer()</function> at the end - of their timer function). Because this is a fairly common case - which is prone to races, you should use <function>del_timer_sync()</function> - (<filename class="headerfile">include/linux/timer.h</filename>) - to handle this case. It returns the number of times the timer - had to be deleted before we finally stopped it from adding itself back - in. - </para> - </sect1> - - </chapter> - - <chapter id="Efficiency"> - <title>Locking Speed</title> - - <para> -There are three main things to worry about when considering speed of -some code which does locking. First is concurrency: how many things -are going to be waiting while someone else is holding a lock. Second -is the time taken to actually acquire and release an uncontended lock. -Third is using fewer, or smarter locks. I'm assuming that the lock is -used fairly often: otherwise, you wouldn't be concerned about -efficiency. -</para> - <para> -Concurrency depends on how long the lock is usually held: you should -hold the lock for as long as needed, but no longer. In the cache -example, we always create the object without the lock held, and then -grab the lock only when we are ready to insert it in the list. -</para> - <para> -Acquisition times depend on how much damage the lock operations do to -the pipeline (pipeline stalls) and how likely it is that this CPU was -the last one to grab the lock (ie. is the lock cache-hot for this -CPU): on a machine with more CPUs, this likelihood drops fast. -Consider a 700MHz Intel Pentium III: an instruction takes about 0.7ns, -an atomic increment takes about 58ns, a lock which is cache-hot on -this CPU takes 160ns, and a cacheline transfer from another CPU takes -an additional 170 to 360ns. (These figures from Paul McKenney's -<ulink url="http://www.linuxjournal.com/article.php?sid=6993"> Linux -Journal RCU article</ulink>). -</para> - <para> -These two aims conflict: holding a lock for a short time might be done -by splitting locks into parts (such as in our final per-object-lock -example), but this increases the number of lock acquisitions, and the -results are often slower than having a single lock. This is another -reason to advocate locking simplicity. -</para> - <para> -The third concern is addressed below: there are some methods to reduce -the amount of locking which needs to be done. -</para> - - <sect1 id="efficiency-rwlocks"> - <title>Read/Write Lock Variants</title> - - <para> - Both spinlocks and mutexes have read/write variants: - <type>rwlock_t</type> and <structname>struct rw_semaphore</structname>. - These divide users into two classes: the readers and the writers. If - you are only reading the data, you can get a read lock, but to write to - the data you need the write lock. Many people can hold a read lock, - but a writer must be sole holder. - </para> - - <para> - If your code divides neatly along reader/writer lines (as our - cache code does), and the lock is held by readers for - significant lengths of time, using these locks can help. They - are slightly slower than the normal locks though, so in practice - <type>rwlock_t</type> is not usually worthwhile. - </para> - </sect1> - - <sect1 id="efficiency-read-copy-update"> - <title>Avoiding Locks: Read Copy Update</title> - - <para> - There is a special method of read/write locking called Read Copy - Update. Using RCU, the readers can avoid taking a lock - altogether: as we expect our cache to be read more often than - updated (otherwise the cache is a waste of time), it is a - candidate for this optimization. - </para> - - <para> - How do we get rid of read locks? Getting rid of read locks - means that writers may be changing the list underneath the - readers. That is actually quite simple: we can read a linked - list while an element is being added if the writer adds the - element very carefully. For example, adding - <symbol>new</symbol> to a single linked list called - <symbol>list</symbol>: - </para> - - <programlisting> - new->next = list->next; - wmb(); - list->next = new; - </programlisting> - - <para> - The <function>wmb()</function> is a write memory barrier. It - ensures that the first operation (setting the new element's - <symbol>next</symbol> pointer) is complete and will be seen by - all CPUs, before the second operation is (putting the new - element into the list). This is important, since modern - compilers and modern CPUs can both reorder instructions unless - told otherwise: we want a reader to either not see the new - element at all, or see the new element with the - <symbol>next</symbol> pointer correctly pointing at the rest of - the list. - </para> - <para> - Fortunately, there is a function to do this for standard - <structname>struct list_head</structname> lists: - <function>list_add_rcu()</function> - (<filename>include/linux/list.h</filename>). - </para> - <para> - Removing an element from the list is even simpler: we replace - the pointer to the old element with a pointer to its successor, - and readers will either see it, or skip over it. - </para> - <programlisting> - list->next = old->next; - </programlisting> - <para> - There is <function>list_del_rcu()</function> - (<filename>include/linux/list.h</filename>) which does this (the - normal version poisons the old object, which we don't want). - </para> - <para> - The reader must also be careful: some CPUs can look through the - <symbol>next</symbol> pointer to start reading the contents of - the next element early, but don't realize that the pre-fetched - contents is wrong when the <symbol>next</symbol> pointer changes - underneath them. Once again, there is a - <function>list_for_each_entry_rcu()</function> - (<filename>include/linux/list.h</filename>) to help you. Of - course, writers can just use - <function>list_for_each_entry()</function>, since there cannot - be two simultaneous writers. - </para> - <para> - Our final dilemma is this: when can we actually destroy the - removed element? Remember, a reader might be stepping through - this element in the list right now: if we free this element and - the <symbol>next</symbol> pointer changes, the reader will jump - off into garbage and crash. We need to wait until we know that - all the readers who were traversing the list when we deleted the - element are finished. We use <function>call_rcu()</function> to - register a callback which will actually destroy the object once - all pre-existing readers are finished. Alternatively, - <function>synchronize_rcu()</function> may be used to block until - all pre-existing are finished. - </para> - <para> - But how does Read Copy Update know when the readers are - finished? The method is this: firstly, the readers always - traverse the list inside - <function>rcu_read_lock()</function>/<function>rcu_read_unlock()</function> - pairs: these simply disable preemption so the reader won't go to - sleep while reading the list. - </para> - <para> - RCU then waits until every other CPU has slept at least once: - since readers cannot sleep, we know that any readers which were - traversing the list during the deletion are finished, and the - callback is triggered. The real Read Copy Update code is a - little more optimized than this, but this is the fundamental - idea. - </para> - -<programlisting> ---- cache.c.perobjectlock 2003-12-11 17:15:03.000000000 +1100 -+++ cache.c.rcupdate 2003-12-11 17:55:14.000000000 +1100 -@@ -1,15 +1,18 @@ - #include <linux/list.h> - #include <linux/slab.h> - #include <linux/string.h> -+#include <linux/rcupdate.h> - #include <linux/mutex.h> - #include <asm/errno.h> - - struct object - { -- /* These two protected by cache_lock. */ -+ /* This is protected by RCU */ - struct list_head list; - int popularity; - -+ struct rcu_head rcu; -+ - atomic_t refcnt; - - /* Doesn't change once created. */ -@@ -40,7 +43,7 @@ - { - struct object *i; - -- list_for_each_entry(i, &cache, list) { -+ list_for_each_entry_rcu(i, &cache, list) { - if (i->id == id) { - i->popularity++; - return i; -@@ -49,19 +52,25 @@ - return NULL; - } - -+/* Final discard done once we know no readers are looking. */ -+static void cache_delete_rcu(void *arg) -+{ -+ object_put(arg); -+} -+ - /* Must be holding cache_lock */ - static void __cache_delete(struct object *obj) - { - BUG_ON(!obj); -- list_del(&obj->list); -- object_put(obj); -+ list_del_rcu(&obj->list); - cache_num--; -+ call_rcu(&obj->rcu, cache_delete_rcu); - } - - /* Must be holding cache_lock */ - static void __cache_add(struct object *obj) - { -- list_add(&obj->list, &cache); -+ list_add_rcu(&obj->list, &cache); - if (++cache_num > MAX_CACHE_SIZE) { - struct object *i, *outcast = NULL; - list_for_each_entry(i, &cache, list) { -@@ -104,12 +114,11 @@ - struct object *cache_find(int id) - { - struct object *obj; -- unsigned long flags; - -- spin_lock_irqsave(&cache_lock, flags); -+ rcu_read_lock(); - obj = __cache_find(id); - if (obj) - object_get(obj); -- spin_unlock_irqrestore(&cache_lock, flags); -+ rcu_read_unlock(); - return obj; - } -</programlisting> - -<para> -Note that the reader will alter the -<structfield>popularity</structfield> member in -<function>__cache_find()</function>, and now it doesn't hold a lock. -One solution would be to make it an <type>atomic_t</type>, but for -this usage, we don't really care about races: an approximate result is -good enough, so I didn't change it. -</para> - -<para> -The result is that <function>cache_find()</function> requires no -synchronization with any other functions, so is almost as fast on SMP -as it would be on UP. -</para> - -<para> -There is a further optimization possible here: remember our original -cache code, where there were no reference counts and the caller simply -held the lock whenever using the object? This is still possible: if -you hold the lock, no one can delete the object, so you don't need to -get and put the reference count. -</para> - -<para> -Now, because the 'read lock' in RCU is simply disabling preemption, a -caller which always has preemption disabled between calling -<function>cache_find()</function> and -<function>object_put()</function> does not need to actually get and -put the reference count: we could expose -<function>__cache_find()</function> by making it non-static, and -such callers could simply call that. -</para> -<para> -The benefit here is that the reference count is not written to: the -object is not altered in any way, which is much faster on SMP -machines due to caching. -</para> - </sect1> - - <sect1 id="per-cpu"> - <title>Per-CPU Data</title> - - <para> - Another technique for avoiding locking which is used fairly - widely is to duplicate information for each CPU. For example, - if you wanted to keep a count of a common condition, you could - use a spin lock and a single counter. Nice and simple. - </para> - - <para> - If that was too slow (it's usually not, but if you've got a - really big machine to test on and can show that it is), you - could instead use a counter for each CPU, then none of them need - an exclusive lock. See <function>DEFINE_PER_CPU()</function>, - <function>get_cpu_var()</function> and - <function>put_cpu_var()</function> - (<filename class="headerfile">include/linux/percpu.h</filename>). - </para> - - <para> - Of particular use for simple per-cpu counters is the - <type>local_t</type> type, and the - <function>cpu_local_inc()</function> and related functions, - which are more efficient than simple code on some architectures - (<filename class="headerfile">include/asm/local.h</filename>). - </para> - - <para> - Note that there is no simple, reliable way of getting an exact - value of such a counter, without introducing more locks. This - is not a problem for some uses. - </para> - </sect1> - - <sect1 id="mostly-hardirq"> - <title>Data Which Mostly Used By An IRQ Handler</title> - - <para> - If data is always accessed from within the same IRQ handler, you - don't need a lock at all: the kernel already guarantees that the - irq handler will not run simultaneously on multiple CPUs. - </para> - <para> - Manfred Spraul points out that you can still do this, even if - the data is very occasionally accessed in user context or - softirqs/tasklets. The irq handler doesn't use a lock, and - all other accesses are done as so: - </para> - -<programlisting> - spin_lock(&lock); - disable_irq(irq); - ... - enable_irq(irq); - spin_unlock(&lock); -</programlisting> - <para> - The <function>disable_irq()</function> prevents the irq handler - from running (and waits for it to finish if it's currently - running on other CPUs). The spinlock prevents any other - accesses happening at the same time. Naturally, this is slower - than just a <function>spin_lock_irq()</function> call, so it - only makes sense if this type of access happens extremely - rarely. - </para> - </sect1> - </chapter> - - <chapter id="sleeping-things"> - <title>What Functions Are Safe To Call From Interrupts?</title> - - <para> - Many functions in the kernel sleep (ie. call schedule()) - directly or indirectly: you can never call them while holding a - spinlock, or with preemption disabled. This also means you need - to be in user context: calling them from an interrupt is illegal. - </para> - - <sect1 id="sleeping"> - <title>Some Functions Which Sleep</title> - - <para> - The most common ones are listed below, but you usually have to - read the code to find out if other calls are safe. If everyone - else who calls it can sleep, you probably need to be able to - sleep, too. In particular, registration and deregistration - functions usually expect to be called from user context, and can - sleep. - </para> - - <itemizedlist> - <listitem> - <para> - Accesses to - <firstterm linkend="gloss-userspace">userspace</firstterm>: - </para> - <itemizedlist> - <listitem> - <para> - <function>copy_from_user()</function> - </para> - </listitem> - <listitem> - <para> - <function>copy_to_user()</function> - </para> - </listitem> - <listitem> - <para> - <function>get_user()</function> - </para> - </listitem> - <listitem> - <para> - <function>put_user()</function> - </para> - </listitem> - </itemizedlist> - </listitem> - - <listitem> - <para> - <function>kmalloc(GFP_KERNEL)</function> - </para> - </listitem> - - <listitem> - <para> - <function>mutex_lock_interruptible()</function> and - <function>mutex_lock()</function> - </para> - <para> - There is a <function>mutex_trylock()</function> which does not - sleep. Still, it must not be used inside interrupt context since - its implementation is not safe for that. - <function>mutex_unlock()</function> will also never sleep. - It cannot be used in interrupt context either since a mutex - must be released by the same task that acquired it. - </para> - </listitem> - </itemizedlist> - </sect1> - - <sect1 id="dont-sleep"> - <title>Some Functions Which Don't Sleep</title> - - <para> - Some functions are safe to call from any context, or holding - almost any lock. - </para> - - <itemizedlist> - <listitem> - <para> - <function>printk()</function> - </para> - </listitem> - <listitem> - <para> - <function>kfree()</function> - </para> - </listitem> - <listitem> - <para> - <function>add_timer()</function> and <function>del_timer()</function> - </para> - </listitem> - </itemizedlist> - </sect1> - </chapter> - - <chapter id="apiref-mutex"> - <title>Mutex API reference</title> -!Iinclude/linux/mutex.h -!Ekernel/locking/mutex.c - </chapter> - - <chapter id="apiref-futex"> - <title>Futex API reference</title> -!Ikernel/futex.c - </chapter> - - <chapter id="references"> - <title>Further reading</title> - - <itemizedlist> - <listitem> - <para> - <filename>Documentation/locking/spinlocks.txt</filename>: - Linus Torvalds' spinlocking tutorial in the kernel sources. - </para> - </listitem> - - <listitem> - <para> - Unix Systems for Modern Architectures: Symmetric - Multiprocessing and Caching for Kernel Programmers: - </para> - - <para> - Curt Schimmel's very good introduction to kernel level - locking (not written for Linux, but nearly everything - applies). The book is expensive, but really worth every - penny to understand SMP locking. [ISBN: 0201633388] - </para> - </listitem> - </itemizedlist> - </chapter> - - <chapter id="thanks"> - <title>Thanks</title> - - <para> - Thanks to Telsa Gwynne for DocBooking, neatening and adding - style. - </para> - - <para> - Thanks to Martin Pool, Philipp Rumpf, Stephen Rothwell, Paul - Mackerras, Ruedi Aschwanden, Alan Cox, Manfred Spraul, Tim - Waugh, Pete Zaitcev, James Morris, Robert Love, Paul McKenney, - John Ashby for proofreading, correcting, flaming, commenting. - </para> - - <para> - Thanks to the cabal for having no influence on this document. - </para> - </chapter> - - <glossary id="glossary"> - <title>Glossary</title> - - <glossentry id="gloss-preemption"> - <glossterm>preemption</glossterm> - <glossdef> - <para> - Prior to 2.5, or when <symbol>CONFIG_PREEMPT</symbol> is - unset, processes in user context inside the kernel would not - preempt each other (ie. you had that CPU until you gave it up, - except for interrupts). With the addition of - <symbol>CONFIG_PREEMPT</symbol> in 2.5.4, this changed: when - in user context, higher priority tasks can "cut in": spinlocks - were changed to disable preemption, even on UP. - </para> - </glossdef> - </glossentry> - - <glossentry id="gloss-bh"> - <glossterm>bh</glossterm> - <glossdef> - <para> - Bottom Half: for historical reasons, functions with - '_bh' in them often now refer to any software interrupt, e.g. - <function>spin_lock_bh()</function> blocks any software interrupt - on the current CPU. Bottom halves are deprecated, and will - eventually be replaced by tasklets. Only one bottom half will be - running at any time. - </para> - </glossdef> - </glossentry> - - <glossentry id="gloss-hwinterrupt"> - <glossterm>Hardware Interrupt / Hardware IRQ</glossterm> - <glossdef> - <para> - Hardware interrupt request. <function>in_irq()</function> returns - <returnvalue>true</returnvalue> in a hardware interrupt handler. - </para> - </glossdef> - </glossentry> - - <glossentry id="gloss-interruptcontext"> - <glossterm>Interrupt Context</glossterm> - <glossdef> - <para> - Not user context: processing a hardware irq or software irq. - Indicated by the <function>in_interrupt()</function> macro - returning <returnvalue>true</returnvalue>. - </para> - </glossdef> - </glossentry> - - <glossentry id="gloss-smp"> - <glossterm><acronym>SMP</acronym></glossterm> - <glossdef> - <para> - Symmetric Multi-Processor: kernels compiled for multiple-CPU - machines. (CONFIG_SMP=y). - </para> - </glossdef> - </glossentry> - - <glossentry id="gloss-softirq"> - <glossterm>Software Interrupt / softirq</glossterm> - <glossdef> - <para> - Software interrupt handler. <function>in_irq()</function> returns - <returnvalue>false</returnvalue>; <function>in_softirq()</function> - returns <returnvalue>true</returnvalue>. Tasklets and softirqs - both fall into the category of 'software interrupts'. - </para> - <para> - Strictly speaking a softirq is one of up to 32 enumerated software - interrupts which can run on multiple CPUs at once. - Sometimes used to refer to tasklets as - well (ie. all software interrupts). - </para> - </glossdef> - </glossentry> - - <glossentry id="gloss-tasklet"> - <glossterm>tasklet</glossterm> - <glossdef> - <para> - A dynamically-registrable software interrupt, - which is guaranteed to only run on one CPU at a time. - </para> - </glossdef> - </glossentry> - - <glossentry id="gloss-timers"> - <glossterm>timer</glossterm> - <glossdef> - <para> - A dynamically-registrable software interrupt, which is run at - (or close to) a given time. When running, it is just like a - tasklet (in fact, they are called from the TIMER_SOFTIRQ). - </para> - </glossdef> - </glossentry> - - <glossentry id="gloss-up"> - <glossterm><acronym>UP</acronym></glossterm> - <glossdef> - <para> - Uni-Processor: Non-SMP. (CONFIG_SMP=n). - </para> - </glossdef> - </glossentry> - - <glossentry id="gloss-usercontext"> - <glossterm>User Context</glossterm> - <glossdef> - <para> - The kernel executing on behalf of a particular process (ie. a - system call or trap) or kernel thread. You can tell which - process with the <symbol>current</symbol> macro.) Not to - be confused with userspace. Can be interrupted by software or - hardware interrupts. - </para> - </glossdef> - </glossentry> - - <glossentry id="gloss-userspace"> - <glossterm>Userspace</glossterm> - <glossdef> - <para> - A process executing its own code outside the kernel. - </para> - </glossdef> - </glossentry> - - </glossary> -</book> - diff --git a/Documentation/DocBook/kgdb.tmpl b/Documentation/DocBook/kgdb.tmpl deleted file mode 100644 index 856ac20bf367..000000000000 --- a/Documentation/DocBook/kgdb.tmpl +++ /dev/null @@ -1,918 +0,0 @@ -<?xml version="1.0" encoding="UTF-8"?> -<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" - "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []> - -<book id="kgdbOnLinux"> - <bookinfo> - <title>Using kgdb, kdb and the kernel debugger internals</title> - - <authorgroup> - <author> - <firstname>Jason</firstname> - <surname>Wessel</surname> - <affiliation> - <address> - <email>jason.wessel@windriver.com</email> - </address> - </affiliation> - </author> - </authorgroup> - <copyright> - <year>2008,2010</year> - <holder>Wind River Systems, Inc.</holder> - </copyright> - <copyright> - <year>2004-2005</year> - <holder>MontaVista Software, Inc.</holder> - </copyright> - <copyright> - <year>2004</year> - <holder>Amit S. Kale</holder> - </copyright> - - <legalnotice> - <para> - This file is licensed under the terms of the GNU General Public License - version 2. This program is licensed "as is" without any warranty of any - kind, whether express or implied. - </para> - - </legalnotice> - </bookinfo> - -<toc></toc> - <chapter id="Introduction"> - <title>Introduction</title> - <para> - The kernel has two different debugger front ends (kdb and kgdb) - which interface to the debug core. It is possible to use either - of the debugger front ends and dynamically transition between them - if you configure the kernel properly at compile and runtime. - </para> - <para> - Kdb is simplistic shell-style interface which you can use on a - system console with a keyboard or serial console. You can use it - to inspect memory, registers, process lists, dmesg, and even set - breakpoints to stop in a certain location. Kdb is not a source - level debugger, although you can set breakpoints and execute some - basic kernel run control. Kdb is mainly aimed at doing some - analysis to aid in development or diagnosing kernel problems. You - can access some symbols by name in kernel built-ins or in kernel - modules if the code was built - with <symbol>CONFIG_KALLSYMS</symbol>. - </para> - <para> - Kgdb is intended to be used as a source level debugger for the - Linux kernel. It is used along with gdb to debug a Linux kernel. - The expectation is that gdb can be used to "break in" to the - kernel to inspect memory, variables and look through call stack - information similar to the way an application developer would use - gdb to debug an application. It is possible to place breakpoints - in kernel code and perform some limited execution stepping. - </para> - <para> - Two machines are required for using kgdb. One of these machines is - a development machine and the other is the target machine. The - kernel to be debugged runs on the target machine. The development - machine runs an instance of gdb against the vmlinux file which - contains the symbols (not a boot image such as bzImage, zImage, - uImage...). In gdb the developer specifies the connection - parameters and connects to kgdb. The type of connection a - developer makes with gdb depends on the availability of kgdb I/O - modules compiled as built-ins or loadable kernel modules in the test - machine's kernel. - </para> - </chapter> - <chapter id="CompilingAKernel"> - <title>Compiling a kernel</title> - <para> - <itemizedlist> - <listitem><para>In order to enable compilation of kdb, you must first enable kgdb.</para></listitem> - <listitem><para>The kgdb test compile options are described in the kgdb test suite chapter.</para></listitem> - </itemizedlist> - </para> - <sect1 id="CompileKGDB"> - <title>Kernel config options for kgdb</title> - <para> - To enable <symbol>CONFIG_KGDB</symbol> you should look under - "Kernel hacking" / "Kernel debugging" and select "KGDB: kernel debugger". - </para> - <para> - While it is not a hard requirement that you have symbols in your - vmlinux file, gdb tends not to be very useful without the symbolic - data, so you will want to turn - on <symbol>CONFIG_DEBUG_INFO</symbol> which is called "Compile the - kernel with debug info" in the config menu. - </para> - <para> - It is advised, but not required, that you turn on the - <symbol>CONFIG_FRAME_POINTER</symbol> kernel option which is called "Compile the - kernel with frame pointers" in the config menu. This option - inserts code to into the compiled executable which saves the frame - information in registers or on the stack at different points which - allows a debugger such as gdb to more accurately construct - stack back traces while debugging the kernel. - </para> - <para> - If the architecture that you are using supports the kernel option - CONFIG_STRICT_KERNEL_RWX, you should consider turning it off. This - option will prevent the use of software breakpoints because it - marks certain regions of the kernel's memory space as read-only. - If kgdb supports it for the architecture you are using, you can - use hardware breakpoints if you desire to run with the - CONFIG_STRICT_KERNEL_RWX option turned on, else you need to turn off - this option. - </para> - <para> - Next you should choose one of more I/O drivers to interconnect - debugging host and debugged target. Early boot debugging requires - a KGDB I/O driver that supports early debugging and the driver - must be built into the kernel directly. Kgdb I/O driver - configuration takes place via kernel or module parameters which - you can learn more about in the in the section that describes the - parameter "kgdboc". - </para> - <para>Here is an example set of .config symbols to enable or - disable for kgdb: - <itemizedlist> - <listitem><para># CONFIG_STRICT_KERNEL_RWX is not set</para></listitem> - <listitem><para>CONFIG_FRAME_POINTER=y</para></listitem> - <listitem><para>CONFIG_KGDB=y</para></listitem> - <listitem><para>CONFIG_KGDB_SERIAL_CONSOLE=y</para></listitem> - </itemizedlist> - </para> - </sect1> - <sect1 id="CompileKDB"> - <title>Kernel config options for kdb</title> - <para>Kdb is quite a bit more complex than the simple gdbstub - sitting on top of the kernel's debug core. Kdb must implement a - shell, and also adds some helper functions in other parts of the - kernel, responsible for printing out interesting data such as what - you would see if you ran "lsmod", or "ps". In order to build kdb - into the kernel you follow the same steps as you would for kgdb. - </para> - <para>The main config option for kdb - is <symbol>CONFIG_KGDB_KDB</symbol> which is called "KGDB_KDB: - include kdb frontend for kgdb" in the config menu. In theory you - would have already also selected an I/O driver such as the - CONFIG_KGDB_SERIAL_CONSOLE interface if you plan on using kdb on a - serial port, when you were configuring kgdb. - </para> - <para>If you want to use a PS/2-style keyboard with kdb, you would - select CONFIG_KDB_KEYBOARD which is called "KGDB_KDB: keyboard as - input device" in the config menu. The CONFIG_KDB_KEYBOARD option - is not used for anything in the gdb interface to kgdb. The - CONFIG_KDB_KEYBOARD option only works with kdb. - </para> - <para>Here is an example set of .config symbols to enable/disable kdb: - <itemizedlist> - <listitem><para># CONFIG_STRICT_KERNEL_RWX is not set</para></listitem> - <listitem><para>CONFIG_FRAME_POINTER=y</para></listitem> - <listitem><para>CONFIG_KGDB=y</para></listitem> - <listitem><para>CONFIG_KGDB_SERIAL_CONSOLE=y</para></listitem> - <listitem><para>CONFIG_KGDB_KDB=y</para></listitem> - <listitem><para>CONFIG_KDB_KEYBOARD=y</para></listitem> - </itemizedlist> - </para> - </sect1> - </chapter> - <chapter id="kgdbKernelArgs"> - <title>Kernel Debugger Boot Arguments</title> - <para>This section describes the various runtime kernel - parameters that affect the configuration of the kernel debugger. - The following chapter covers using kdb and kgdb as well as - providing some examples of the configuration parameters.</para> - <sect1 id="kgdboc"> - <title>Kernel parameter: kgdboc</title> - <para>The kgdboc driver was originally an abbreviation meant to - stand for "kgdb over console". Today it is the primary mechanism - to configure how to communicate from gdb to kgdb as well as the - devices you want to use to interact with the kdb shell. - </para> - <para>For kgdb/gdb, kgdboc is designed to work with a single serial - port. It is intended to cover the circumstance where you want to - use a serial console as your primary console as well as using it to - perform kernel debugging. It is also possible to use kgdb on a - serial port which is not designated as a system console. Kgdboc - may be configured as a kernel built-in or a kernel loadable module. - You can only make use of <constant>kgdbwait</constant> and early - debugging if you build kgdboc into the kernel as a built-in. - </para> - <para>Optionally you can elect to activate kms (Kernel Mode - Setting) integration. When you use kms with kgdboc and you have a - video driver that has atomic mode setting hooks, it is possible to - enter the debugger on the graphics console. When the kernel - execution is resumed, the previous graphics mode will be restored. - This integration can serve as a useful tool to aid in diagnosing - crashes or doing analysis of memory with kdb while allowing the - full graphics console applications to run. - </para> - <sect2 id="kgdbocArgs"> - <title>kgdboc arguments</title> - <para>Usage: <constant>kgdboc=[kms][[,]kbd][[,]serial_device][,baud]</constant></para> - <para>The order listed above must be observed if you use any of the - optional configurations together. - </para> - <para>Abbreviations: - <itemizedlist> - <listitem><para>kms = Kernel Mode Setting</para></listitem> - <listitem><para>kbd = Keyboard</para></listitem> - </itemizedlist> - </para> - <para>You can configure kgdboc to use the keyboard, and/or a serial - device depending on if you are using kdb and/or kgdb, in one of the - following scenarios. The order listed above must be observed if - you use any of the optional configurations together. Using kms + - only gdb is generally not a useful combination.</para> - <sect3 id="kgdbocArgs1"> - <title>Using loadable module or built-in</title> - <para> - <orderedlist> - <listitem><para>As a kernel built-in:</para> - <para>Use the kernel boot argument: <constant>kgdboc=<tty-device>,[baud]</constant></para></listitem> - <listitem> - <para>As a kernel loadable module:</para> - <para>Use the command: <constant>modprobe kgdboc kgdboc=<tty-device>,[baud]</constant></para> - <para>Here are two examples of how you might format the kgdboc - string. The first is for an x86 target using the first serial port. - The second example is for the ARM Versatile AB using the second - serial port. - <orderedlist> - <listitem><para><constant>kgdboc=ttyS0,115200</constant></para></listitem> - <listitem><para><constant>kgdboc=ttyAMA1,115200</constant></para></listitem> - </orderedlist> - </para> - </listitem> - </orderedlist></para> - </sect3> - <sect3 id="kgdbocArgs2"> - <title>Configure kgdboc at runtime with sysfs</title> - <para>At run time you can enable or disable kgdboc by echoing a - parameters into the sysfs. Here are two examples:</para> - <orderedlist> - <listitem><para>Enable kgdboc on ttyS0</para> - <para><constant>echo ttyS0 > /sys/module/kgdboc/parameters/kgdboc</constant></para></listitem> - <listitem><para>Disable kgdboc</para> - <para><constant>echo "" > /sys/module/kgdboc/parameters/kgdboc</constant></para></listitem> - </orderedlist> - <para>NOTE: You do not need to specify the baud if you are - configuring the console on tty which is already configured or - open.</para> - </sect3> - <sect3 id="kgdbocArgs3"> - <title>More examples</title> - <para>You can configure kgdboc to use the keyboard, and/or a serial device - depending on if you are using kdb and/or kgdb, in one of the - following scenarios. - <orderedlist> - <listitem><para>kdb and kgdb over only a serial port</para> - <para><constant>kgdboc=<serial_device>[,baud]</constant></para> - <para>Example: <constant>kgdboc=ttyS0,115200</constant></para> - </listitem> - <listitem><para>kdb and kgdb with keyboard and a serial port</para> - <para><constant>kgdboc=kbd,<serial_device>[,baud]</constant></para> - <para>Example: <constant>kgdboc=kbd,ttyS0,115200</constant></para> - </listitem> - <listitem><para>kdb with a keyboard</para> - <para><constant>kgdboc=kbd</constant></para> - </listitem> - <listitem><para>kdb with kernel mode setting</para> - <para><constant>kgdboc=kms,kbd</constant></para> - </listitem> - <listitem><para>kdb with kernel mode setting and kgdb over a serial port</para> - <para><constant>kgdboc=kms,kbd,ttyS0,115200</constant></para> - </listitem> - </orderedlist> - </para> - <para>NOTE: Kgdboc does not support interrupting the target via the - gdb remote protocol. You must manually send a sysrq-g unless you - have a proxy that splits console output to a terminal program. - A console proxy has a separate TCP port for the debugger and a separate - TCP port for the "human" console. The proxy can take care of sending - the sysrq-g for you. - </para> - <para>When using kgdboc with no debugger proxy, you can end up - connecting the debugger at one of two entry points. If an - exception occurs after you have loaded kgdboc, a message should - print on the console stating it is waiting for the debugger. In - this case you disconnect your terminal program and then connect the - debugger in its place. If you want to interrupt the target system - and forcibly enter a debug session you have to issue a Sysrq - sequence and then type the letter <constant>g</constant>. Then - you disconnect the terminal session and connect gdb. Your options - if you don't like this are to hack gdb to send the sysrq-g for you - as well as on the initial connect, or to use a debugger proxy that - allows an unmodified gdb to do the debugging. - </para> - </sect3> - </sect2> - </sect1> - <sect1 id="kgdbwait"> - <title>Kernel parameter: kgdbwait</title> - <para> - The Kernel command line option <constant>kgdbwait</constant> makes - kgdb wait for a debugger connection during booting of a kernel. You - can only use this option if you compiled a kgdb I/O driver into the - kernel and you specified the I/O driver configuration as a kernel - command line option. The kgdbwait parameter should always follow the - configuration parameter for the kgdb I/O driver in the kernel - command line else the I/O driver will not be configured prior to - asking the kernel to use it to wait. - </para> - <para> - The kernel will stop and wait as early as the I/O driver and - architecture allows when you use this option. If you build the - kgdb I/O driver as a loadable kernel module kgdbwait will not do - anything. - </para> - </sect1> - <sect1 id="kgdbcon"> - <title>Kernel parameter: kgdbcon</title> - <para> The kgdbcon feature allows you to see printk() messages - inside gdb while gdb is connected to the kernel. Kdb does not make - use of the kgdbcon feature. - </para> - <para>Kgdb supports using the gdb serial protocol to send console - messages to the debugger when the debugger is connected and running. - There are two ways to activate this feature. - <orderedlist> - <listitem><para>Activate with the kernel command line option:</para> - <para><constant>kgdbcon</constant></para> - </listitem> - <listitem><para>Use sysfs before configuring an I/O driver</para> - <para> - <constant>echo 1 > /sys/module/kgdb/parameters/kgdb_use_con</constant> - </para> - <para> - NOTE: If you do this after you configure the kgdb I/O driver, the - setting will not take effect until the next point the I/O is - reconfigured. - </para> - </listitem> - </orderedlist> - </para> - <para>IMPORTANT NOTE: You cannot use kgdboc + kgdbcon on a tty that is an - active system console. An example of incorrect usage is <constant>console=ttyS0,115200 kgdboc=ttyS0 kgdbcon</constant> - </para> - <para>It is possible to use this option with kgdboc on a tty that is not a system console. - </para> - </sect1> - <sect1 id="kgdbreboot"> - <title>Run time parameter: kgdbreboot</title> - <para> The kgdbreboot feature allows you to change how the debugger - deals with the reboot notification. You have 3 choices for the - behavior. The default behavior is always set to 0.</para> - <orderedlist> - <listitem><para>echo -1 > /sys/module/debug_core/parameters/kgdbreboot</para> - <para>Ignore the reboot notification entirely.</para> - </listitem> - <listitem><para>echo 0 > /sys/module/debug_core/parameters/kgdbreboot</para> - <para>Send the detach message to any attached debugger client.</para> - </listitem> - <listitem><para>echo 1 > /sys/module/debug_core/parameters/kgdbreboot</para> - <para>Enter the debugger on reboot notify.</para> - </listitem> - </orderedlist> - </sect1> - </chapter> - <chapter id="usingKDB"> - <title>Using kdb</title> - <para> - </para> - <sect1 id="quickKDBserial"> - <title>Quick start for kdb on a serial port</title> - <para>This is a quick example of how to use kdb.</para> - <para><orderedlist> - <listitem><para>Configure kgdboc at boot using kernel parameters: - <itemizedlist> - <listitem><para><constant>console=ttyS0,115200 kgdboc=ttyS0,115200</constant></para></listitem> - </itemizedlist></para> - <para>OR</para> - <para>Configure kgdboc after the kernel has booted; assuming you are using a serial port console: - <itemizedlist> - <listitem><para><constant>echo ttyS0 > /sys/module/kgdboc/parameters/kgdboc</constant></para></listitem> - </itemizedlist> - </para> - </listitem> - <listitem><para>Enter the kernel debugger manually or by waiting for an oops or fault. There are several ways you can enter the kernel debugger manually; all involve using the sysrq-g, which means you must have enabled CONFIG_MAGIC_SYSRQ=y in your kernel config.</para> - <itemizedlist> - <listitem><para>When logged in as root or with a super user session you can run:</para> - <para><constant>echo g > /proc/sysrq-trigger</constant></para></listitem> - <listitem><para>Example using minicom 2.2</para> - <para>Press: <constant>Control-a</constant></para> - <para>Press: <constant>f</constant></para> - <para>Press: <constant>g</constant></para> - </listitem> - <listitem><para>When you have telneted to a terminal server that supports sending a remote break</para> - <para>Press: <constant>Control-]</constant></para> - <para>Type in:<constant>send break</constant></para> - <para>Press: <constant>Enter</constant></para> - <para>Press: <constant>g</constant></para> - </listitem> - </itemizedlist> - </listitem> - <listitem><para>From the kdb prompt you can run the "help" command to see a complete list of the commands that are available.</para> - <para>Some useful commands in kdb include: - <itemizedlist> - <listitem><para>lsmod -- Shows where kernel modules are loaded</para></listitem> - <listitem><para>ps -- Displays only the active processes</para></listitem> - <listitem><para>ps A -- Shows all the processes</para></listitem> - <listitem><para>summary -- Shows kernel version info and memory usage</para></listitem> - <listitem><para>bt -- Get a backtrace of the current process using dump_stack()</para></listitem> - <listitem><para>dmesg -- View the kernel syslog buffer</para></listitem> - <listitem><para>go -- Continue the system</para></listitem> - </itemizedlist> - </para> - </listitem> - <listitem> - <para>When you are done using kdb you need to consider rebooting the - system or using the "go" command to resuming normal kernel - execution. If you have paused the kernel for a lengthy period of - time, applications that rely on timely networking or anything to do - with real wall clock time could be adversely affected, so you - should take this into consideration when using the kernel - debugger.</para> - </listitem> - </orderedlist></para> - </sect1> - <sect1 id="quickKDBkeyboard"> - <title>Quick start for kdb using a keyboard connected console</title> - <para>This is a quick example of how to use kdb with a keyboard.</para> - <para><orderedlist> - <listitem><para>Configure kgdboc at boot using kernel parameters: - <itemizedlist> - <listitem><para><constant>kgdboc=kbd</constant></para></listitem> - </itemizedlist></para> - <para>OR</para> - <para>Configure kgdboc after the kernel has booted: - <itemizedlist> - <listitem><para><constant>echo kbd > /sys/module/kgdboc/parameters/kgdboc</constant></para></listitem> - </itemizedlist> - </para> - </listitem> - <listitem><para>Enter the kernel debugger manually or by waiting for an oops or fault. There are several ways you can enter the kernel debugger manually; all involve using the sysrq-g, which means you must have enabled CONFIG_MAGIC_SYSRQ=y in your kernel config.</para> - <itemizedlist> - <listitem><para>When logged in as root or with a super user session you can run:</para> - <para><constant>echo g > /proc/sysrq-trigger</constant></para></listitem> - <listitem><para>Example using a laptop keyboard</para> - <para>Press and hold down: <constant>Alt</constant></para> - <para>Press and hold down: <constant>Fn</constant></para> - <para>Press and release the key with the label: <constant>SysRq</constant></para> - <para>Release: <constant>Fn</constant></para> - <para>Press and release: <constant>g</constant></para> - <para>Release: <constant>Alt</constant></para> - </listitem> - <listitem><para>Example using a PS/2 101-key keyboard</para> - <para>Press and hold down: <constant>Alt</constant></para> - <para>Press and release the key with the label: <constant>SysRq</constant></para> - <para>Press and release: <constant>g</constant></para> - <para>Release: <constant>Alt</constant></para> - </listitem> - </itemizedlist> - </listitem> - <listitem> - <para>Now type in a kdb command such as "help", "dmesg", "bt" or "go" to continue kernel execution.</para> - </listitem> - </orderedlist></para> - </sect1> - </chapter> - <chapter id="EnableKGDB"> - <title>Using kgdb / gdb</title> - <para>In order to use kgdb you must activate it by passing - configuration information to one of the kgdb I/O drivers. If you - do not pass any configuration information kgdb will not do anything - at all. Kgdb will only actively hook up to the kernel trap hooks - if a kgdb I/O driver is loaded and configured. If you unconfigure - a kgdb I/O driver, kgdb will unregister all the kernel hook points. - </para> - <para> All kgdb I/O drivers can be reconfigured at run time, if - <symbol>CONFIG_SYSFS</symbol> and <symbol>CONFIG_MODULES</symbol> - are enabled, by echo'ing a new config string to - <constant>/sys/module/<driver>/parameter/<option></constant>. - The driver can be unconfigured by passing an empty string. You cannot - change the configuration while the debugger is attached. Make sure - to detach the debugger with the <constant>detach</constant> command - prior to trying to unconfigure a kgdb I/O driver. - </para> - <sect1 id="ConnectingGDB"> - <title>Connecting with gdb to a serial port</title> - <orderedlist> - <listitem><para>Configure kgdboc</para> - <para>Configure kgdboc at boot using kernel parameters: - <itemizedlist> - <listitem><para><constant>kgdboc=ttyS0,115200</constant></para></listitem> - </itemizedlist></para> - <para>OR</para> - <para>Configure kgdboc after the kernel has booted: - <itemizedlist> - <listitem><para><constant>echo ttyS0 > /sys/module/kgdboc/parameters/kgdboc</constant></para></listitem> - </itemizedlist></para> - </listitem> - <listitem> - <para>Stop kernel execution (break into the debugger)</para> - <para>In order to connect to gdb via kgdboc, the kernel must - first be stopped. There are several ways to stop the kernel which - include using kgdbwait as a boot argument, via a sysrq-g, or running - the kernel until it takes an exception where it waits for the - debugger to attach. - <itemizedlist> - <listitem><para>When logged in as root or with a super user session you can run:</para> - <para><constant>echo g > /proc/sysrq-trigger</constant></para></listitem> - <listitem><para>Example using minicom 2.2</para> - <para>Press: <constant>Control-a</constant></para> - <para>Press: <constant>f</constant></para> - <para>Press: <constant>g</constant></para> - </listitem> - <listitem><para>When you have telneted to a terminal server that supports sending a remote break</para> - <para>Press: <constant>Control-]</constant></para> - <para>Type in:<constant>send break</constant></para> - <para>Press: <constant>Enter</constant></para> - <para>Press: <constant>g</constant></para> - </listitem> - </itemizedlist> - </para> - </listitem> - <listitem> - <para>Connect from gdb</para> - <para> - Example (using a directly connected port): - </para> - <programlisting> - % gdb ./vmlinux - (gdb) set remotebaud 115200 - (gdb) target remote /dev/ttyS0 - </programlisting> - <para> - Example (kgdb to a terminal server on TCP port 2012): - </para> - <programlisting> - % gdb ./vmlinux - (gdb) target remote 192.168.2.2:2012 - </programlisting> - <para> - Once connected, you can debug a kernel the way you would debug an - application program. - </para> - <para> - If you are having problems connecting or something is going - seriously wrong while debugging, it will most often be the case - that you want to enable gdb to be verbose about its target - communications. You do this prior to issuing the <constant>target - remote</constant> command by typing in: <constant>set debug remote 1</constant> - </para> - </listitem> - </orderedlist> - <para>Remember if you continue in gdb, and need to "break in" again, - you need to issue an other sysrq-g. It is easy to create a simple - entry point by putting a breakpoint at <constant>sys_sync</constant> - and then you can run "sync" from a shell or script to break into the - debugger.</para> - </sect1> - </chapter> - <chapter id="switchKdbKgdb"> - <title>kgdb and kdb interoperability</title> - <para>It is possible to transition between kdb and kgdb dynamically. - The debug core will remember which you used the last time and - automatically start in the same mode.</para> - <sect1> - <title>Switching between kdb and kgdb</title> - <sect2> - <title>Switching from kgdb to kdb</title> - <para> - There are two ways to switch from kgdb to kdb: you can use gdb to - issue a maintenance packet, or you can blindly type the command $3#33. - Whenever the kernel debugger stops in kgdb mode it will print the - message <constant>KGDB or $3#33 for KDB</constant>. It is important - to note that you have to type the sequence correctly in one pass. - You cannot type a backspace or delete because kgdb will interpret - that as part of the debug stream. - <orderedlist> - <listitem><para>Change from kgdb to kdb by blindly typing:</para> - <para><constant>$3#33</constant></para></listitem> - <listitem><para>Change from kgdb to kdb with gdb</para> - <para><constant>maintenance packet 3</constant></para> - <para>NOTE: Now you must kill gdb. Typically you press control-z and - issue the command: kill -9 %</para></listitem> - </orderedlist> - </para> - </sect2> - <sect2> - <title>Change from kdb to kgdb</title> - <para>There are two ways you can change from kdb to kgdb. You can - manually enter kgdb mode by issuing the kgdb command from the kdb - shell prompt, or you can connect gdb while the kdb shell prompt is - active. The kdb shell looks for the typical first commands that gdb - would issue with the gdb remote protocol and if it sees one of those - commands it automatically changes into kgdb mode.</para> - <orderedlist> - <listitem><para>From kdb issue the command:</para> - <para><constant>kgdb</constant></para> - <para>Now disconnect your terminal program and connect gdb in its place</para></listitem> - <listitem><para>At the kdb prompt, disconnect the terminal program and connect gdb in its place.</para></listitem> - </orderedlist> - </sect2> - </sect1> - <sect1> - <title>Running kdb commands from gdb</title> - <para>It is possible to run a limited set of kdb commands from gdb, - using the gdb monitor command. You don't want to execute any of the - run control or breakpoint operations, because it can disrupt the - state of the kernel debugger. You should be using gdb for - breakpoints and run control operations if you have gdb connected. - The more useful commands to run are things like lsmod, dmesg, ps or - possibly some of the memory information commands. To see all the kdb - commands you can run <constant>monitor help</constant>.</para> - <para>Example: - <informalexample><programlisting> -(gdb) monitor ps -1 idle process (state I) and -27 sleeping system daemon (state M) processes suppressed, -use 'ps A' to see all. -Task Addr Pid Parent [*] cpu State Thread Command - -0xc78291d0 1 0 0 0 S 0xc7829404 init -0xc7954150 942 1 0 0 S 0xc7954384 dropbear -0xc78789c0 944 1 0 0 S 0xc7878bf4 sh -(gdb) - </programlisting></informalexample> - </para> - </sect1> - </chapter> - <chapter id="KGDBTestSuite"> - <title>kgdb Test Suite</title> - <para> - When kgdb is enabled in the kernel config you can also elect to - enable the config parameter KGDB_TESTS. Turning this on will - enable a special kgdb I/O module which is designed to test the - kgdb internal functions. - </para> - <para> - The kgdb tests are mainly intended for developers to test the kgdb - internals as well as a tool for developing a new kgdb architecture - specific implementation. These tests are not really for end users - of the Linux kernel. The primary source of documentation would be - to look in the drivers/misc/kgdbts.c file. - </para> - <para> - The kgdb test suite can also be configured at compile time to run - the core set of tests by setting the kernel config parameter - KGDB_TESTS_ON_BOOT. This particular option is aimed at automated - regression testing and does not require modifying the kernel boot - config arguments. If this is turned on, the kgdb test suite can - be disabled by specifying "kgdbts=" as a kernel boot argument. - </para> - </chapter> - <chapter id="CommonBackEndReq"> - <title>Kernel Debugger Internals</title> - <sect1 id="kgdbArchitecture"> - <title>Architecture Specifics</title> - <para> - The kernel debugger is organized into a number of components: - <orderedlist> - <listitem><para>The debug core</para> - <para> - The debug core is found in kernel/debugger/debug_core.c. It contains: - <itemizedlist> - <listitem><para>A generic OS exception handler which includes - sync'ing the processors into a stopped state on an multi-CPU - system.</para></listitem> - <listitem><para>The API to talk to the kgdb I/O drivers</para></listitem> - <listitem><para>The API to make calls to the arch-specific kgdb implementation</para></listitem> - <listitem><para>The logic to perform safe memory reads and writes to memory while using the debugger</para></listitem> - <listitem><para>A full implementation for software breakpoints unless overridden by the arch</para></listitem> - <listitem><para>The API to invoke either the kdb or kgdb frontend to the debug core.</para></listitem> - <listitem><para>The structures and callback API for atomic kernel mode setting.</para> - <para>NOTE: kgdboc is where the kms callbacks are invoked.</para></listitem> - </itemizedlist> - </para> - </listitem> - <listitem><para>kgdb arch-specific implementation</para> - <para> - This implementation is generally found in arch/*/kernel/kgdb.c. - As an example, arch/x86/kernel/kgdb.c contains the specifics to - implement HW breakpoint as well as the initialization to - dynamically register and unregister for the trap handlers on - this architecture. The arch-specific portion implements: - <itemizedlist> - <listitem><para>contains an arch-specific trap catcher which - invokes kgdb_handle_exception() to start kgdb about doing its - work</para></listitem> - <listitem><para>translation to and from gdb specific packet format to pt_regs</para></listitem> - <listitem><para>Registration and unregistration of architecture specific trap hooks</para></listitem> - <listitem><para>Any special exception handling and cleanup</para></listitem> - <listitem><para>NMI exception handling and cleanup</para></listitem> - <listitem><para>(optional) HW breakpoints</para></listitem> - </itemizedlist> - </para> - </listitem> - <listitem><para>gdbstub frontend (aka kgdb)</para> - <para>The gdbstub is located in kernel/debug/gdbstub.c. It contains:</para> - <itemizedlist> - <listitem><para>All the logic to implement the gdb serial protocol</para></listitem> - </itemizedlist> - </listitem> - <listitem><para>kdb frontend</para> - <para>The kdb debugger shell is broken down into a number of - components. The kdb core is located in kernel/debug/kdb. There - are a number of helper functions in some of the other kernel - components to make it possible for kdb to examine and report - information about the kernel without taking locks that could - cause a kernel deadlock. The kdb core contains implements the following functionality.</para> - <itemizedlist> - <listitem><para>A simple shell</para></listitem> - <listitem><para>The kdb core command set</para></listitem> - <listitem><para>A registration API to register additional kdb shell commands.</para> - <itemizedlist> - <listitem><para>A good example of a self-contained kdb module - is the "ftdump" command for dumping the ftrace buffer. See: - kernel/trace/trace_kdb.c</para></listitem> - <listitem><para>For an example of how to dynamically register - a new kdb command you can build the kdb_hello.ko kernel module - from samples/kdb/kdb_hello.c. To build this example you can - set CONFIG_SAMPLES=y and CONFIG_SAMPLE_KDB=m in your kernel - config. Later run "modprobe kdb_hello" and the next time you - enter the kdb shell, you can run the "hello" - command.</para></listitem> - </itemizedlist></listitem> - <listitem><para>The implementation for kdb_printf() which - emits messages directly to I/O drivers, bypassing the kernel - log.</para></listitem> - <listitem><para>SW / HW breakpoint management for the kdb shell</para></listitem> - </itemizedlist> - </listitem> - <listitem><para>kgdb I/O driver</para> - <para> - Each kgdb I/O driver has to provide an implementation for the following: - <itemizedlist> - <listitem><para>configuration via built-in or module</para></listitem> - <listitem><para>dynamic configuration and kgdb hook registration calls</para></listitem> - <listitem><para>read and write character interface</para></listitem> - <listitem><para>A cleanup handler for unconfiguring from the kgdb core</para></listitem> - <listitem><para>(optional) Early debug methodology</para></listitem> - </itemizedlist> - Any given kgdb I/O driver has to operate very closely with the - hardware and must do it in such a way that does not enable - interrupts or change other parts of the system context without - completely restoring them. The kgdb core will repeatedly "poll" - a kgdb I/O driver for characters when it needs input. The I/O - driver is expected to return immediately if there is no data - available. Doing so allows for the future possibility to touch - watchdog hardware in such a way as to have a target system not - reset when these are enabled. - </para> - </listitem> - </orderedlist> - </para> - <para> - If you are intent on adding kgdb architecture specific support - for a new architecture, the architecture should define - <constant>HAVE_ARCH_KGDB</constant> in the architecture specific - Kconfig file. This will enable kgdb for the architecture, and - at that point you must create an architecture specific kgdb - implementation. - </para> - <para> - There are a few flags which must be set on every architecture in - their <asm/kgdb.h> file. These are: - <itemizedlist> - <listitem> - <para> - NUMREGBYTES: The size in bytes of all of the registers, so - that we can ensure they will all fit into a packet. - </para> - </listitem> - <listitem> - <para> - BUFMAX: The size in bytes of the buffer GDB will read into. - This must be larger than NUMREGBYTES. - </para> - </listitem> - <listitem> - <para> - CACHE_FLUSH_IS_SAFE: Set to 1 if it is always safe to call - flush_cache_range or flush_icache_range. On some architectures, - these functions may not be safe to call on SMP since we keep other - CPUs in a holding pattern. - </para> - </listitem> - </itemizedlist> - </para> - <para> - There are also the following functions for the common backend, - found in kernel/kgdb.c, that must be supplied by the - architecture-specific backend unless marked as (optional), in - which case a default function maybe used if the architecture - does not need to provide a specific implementation. - </para> -!Iinclude/linux/kgdb.h - </sect1> - <sect1 id="kgdbocDesign"> - <title>kgdboc internals</title> - <sect2> - <title>kgdboc and uarts</title> - <para> - The kgdboc driver is actually a very thin driver that relies on the - underlying low level to the hardware driver having "polling hooks" - to which the tty driver is attached. In the initial - implementation of kgdboc the serial_core was changed to expose a - low level UART hook for doing polled mode reading and writing of a - single character while in an atomic context. When kgdb makes an I/O - request to the debugger, kgdboc invokes a callback in the serial - core which in turn uses the callback in the UART driver.</para> - <para> - When using kgdboc with a UART, the UART driver must implement two callbacks in the <constant>struct uart_ops</constant>. Example from drivers/8250.c:<programlisting> -#ifdef CONFIG_CONSOLE_POLL - .poll_get_char = serial8250_get_poll_char, - .poll_put_char = serial8250_put_poll_char, -#endif - </programlisting> - Any implementation specifics around creating a polling driver use the - <constant>#ifdef CONFIG_CONSOLE_POLL</constant>, as shown above. - Keep in mind that polling hooks have to be implemented in such a way - that they can be called from an atomic context and have to restore - the state of the UART chip on return such that the system can return - to normal when the debugger detaches. You need to be very careful - with any kind of lock you consider, because failing here is most likely - going to mean pressing the reset button. - </para> - </sect2> - <sect2 id="kgdbocKbd"> - <title>kgdboc and keyboards</title> - <para>The kgdboc driver contains logic to configure communications - with an attached keyboard. The keyboard infrastructure is only - compiled into the kernel when CONFIG_KDB_KEYBOARD=y is set in the - kernel configuration.</para> - <para>The core polled keyboard driver driver for PS/2 type keyboards - is in drivers/char/kdb_keyboard.c. This driver is hooked into the - debug core when kgdboc populates the callback in the array - called <constant>kdb_poll_funcs[]</constant>. The - kdb_get_kbd_char() is the top-level function which polls hardware - for single character input. - </para> - </sect2> - <sect2 id="kgdbocKms"> - <title>kgdboc and kms</title> - <para>The kgdboc driver contains logic to request the graphics - display to switch to a text context when you are using - "kgdboc=kms,kbd", provided that you have a video driver which has a - frame buffer console and atomic kernel mode setting support.</para> - <para> - Every time the kernel - debugger is entered it calls kgdboc_pre_exp_handler() which in turn - calls con_debug_enter() in the virtual console layer. On resuming kernel - execution, the kernel debugger calls kgdboc_post_exp_handler() which - in turn calls con_debug_leave().</para> - <para>Any video driver that wants to be compatible with the kernel - debugger and the atomic kms callbacks must implement the - mode_set_base_atomic, fb_debug_enter and fb_debug_leave operations. - For the fb_debug_enter and fb_debug_leave the option exists to use - the generic drm fb helper functions or implement something custom for - the hardware. The following example shows the initialization of the - .mode_set_base_atomic operation in - drivers/gpu/drm/i915/intel_display.c: - <informalexample> - <programlisting> -static const struct drm_crtc_helper_funcs intel_helper_funcs = { -[...] - .mode_set_base_atomic = intel_pipe_set_base_atomic, -[...] -}; - </programlisting> - </informalexample> - </para> - <para>Here is an example of how the i915 driver initializes the fb_debug_enter and fb_debug_leave functions to use the generic drm helpers in - drivers/gpu/drm/i915/intel_fb.c: - <informalexample> - <programlisting> -static struct fb_ops intelfb_ops = { -[...] - .fb_debug_enter = drm_fb_helper_debug_enter, - .fb_debug_leave = drm_fb_helper_debug_leave, -[...] -}; - </programlisting> - </informalexample> - </para> - </sect2> - </sect1> - </chapter> - <chapter id="credits"> - <title>Credits</title> - <para> - The following people have contributed to this document: - <orderedlist> - <listitem><para>Amit Kale<email>amitkale@linsyssoft.com</email></para></listitem> - <listitem><para>Tom Rini<email>trini@kernel.crashing.org</email></para></listitem> - </orderedlist> - In March 2008 this document was completely rewritten by: - <itemizedlist> - <listitem><para>Jason Wessel<email>jason.wessel@windriver.com</email></para></listitem> - </itemizedlist> - In Jan 2010 this document was updated to include kdb. - <itemizedlist> - <listitem><para>Jason Wessel<email>jason.wessel@windriver.com</email></para></listitem> - </itemizedlist> - </para> - </chapter> -</book> - diff --git a/Documentation/DocBook/libata.tmpl b/Documentation/DocBook/libata.tmpl deleted file mode 100644 index 0320910b866d..000000000000 --- a/Documentation/DocBook/libata.tmpl +++ /dev/null @@ -1,1625 +0,0 @@ -<?xml version="1.0" encoding="UTF-8"?> -<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" - "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []> - -<book id="libataDevGuide"> - <bookinfo> - <title>libATA Developer's Guide</title> - - <authorgroup> - <author> - <firstname>Jeff</firstname> - <surname>Garzik</surname> - </author> - </authorgroup> - - <copyright> - <year>2003-2006</year> - <holder>Jeff Garzik</holder> - </copyright> - - <legalnotice> - <para> - The contents of this file are subject to the Open - Software License version 1.1 that can be found at - <ulink url="http://fedoraproject.org/wiki/Licensing:OSL1.1">http://fedoraproject.org/wiki/Licensing:OSL1.1</ulink> - and is included herein by reference. - </para> - - <para> - Alternatively, the contents of this file may be used under the terms - of the GNU General Public License version 2 (the "GPL") as distributed - in the kernel source COPYING file, in which case the provisions of - the GPL are applicable instead of the above. If you wish to allow - the use of your version of this file only under the terms of the - GPL and not to allow others to use your version of this file under - the OSL, indicate your decision by deleting the provisions above and - replace them with the notice and other provisions required by the GPL. - If you do not delete the provisions above, a recipient may use your - version of this file under either the OSL or the GPL. - </para> - - </legalnotice> - </bookinfo> - -<toc></toc> - - <chapter id="libataIntroduction"> - <title>Introduction</title> - <para> - libATA is a library used inside the Linux kernel to support ATA host - controllers and devices. libATA provides an ATA driver API, class - transports for ATA and ATAPI devices, and SCSI<->ATA translation - for ATA devices according to the T10 SAT specification. - </para> - <para> - This Guide documents the libATA driver API, library functions, library - internals, and a couple sample ATA low-level drivers. - </para> - </chapter> - - <chapter id="libataDriverApi"> - <title>libata Driver API</title> - <para> - struct ata_port_operations is defined for every low-level libata - hardware driver, and it controls how the low-level driver - interfaces with the ATA and SCSI layers. - </para> - <para> - FIS-based drivers will hook into the system with ->qc_prep() and - ->qc_issue() high-level hooks. Hardware which behaves in a manner - similar to PCI IDE hardware may utilize several generic helpers, - defining at a bare minimum the bus I/O addresses of the ATA shadow - register blocks. - </para> - <sect1> - <title>struct ata_port_operations</title> - - <sect2><title>Disable ATA port</title> - <programlisting> -void (*port_disable) (struct ata_port *); - </programlisting> - - <para> - Called from ata_bus_probe() error path, as well as when - unregistering from the SCSI module (rmmod, hot unplug). - This function should do whatever needs to be done to take the - port out of use. In most cases, ata_port_disable() can be used - as this hook. - </para> - <para> - Called from ata_bus_probe() on a failed probe. - Called from ata_scsi_release(). - </para> - - </sect2> - - <sect2><title>Post-IDENTIFY device configuration</title> - <programlisting> -void (*dev_config) (struct ata_port *, struct ata_device *); - </programlisting> - - <para> - Called after IDENTIFY [PACKET] DEVICE is issued to each device - found. Typically used to apply device-specific fixups prior to - issue of SET FEATURES - XFER MODE, and prior to operation. - </para> - <para> - This entry may be specified as NULL in ata_port_operations. - </para> - - </sect2> - - <sect2><title>Set PIO/DMA mode</title> - <programlisting> -void (*set_piomode) (struct ata_port *, struct ata_device *); -void (*set_dmamode) (struct ata_port *, struct ata_device *); -void (*post_set_mode) (struct ata_port *); -unsigned int (*mode_filter) (struct ata_port *, struct ata_device *, unsigned int); - </programlisting> - - <para> - Hooks called prior to the issue of SET FEATURES - XFER MODE - command. The optional ->mode_filter() hook is called when libata - has built a mask of the possible modes. This is passed to the - ->mode_filter() function which should return a mask of valid modes - after filtering those unsuitable due to hardware limits. It is not - valid to use this interface to add modes. - </para> - <para> - dev->pio_mode and dev->dma_mode are guaranteed to be valid when - ->set_piomode() and when ->set_dmamode() is called. The timings for - any other drive sharing the cable will also be valid at this point. - That is the library records the decisions for the modes of each - drive on a channel before it attempts to set any of them. - </para> - <para> - ->post_set_mode() is - called unconditionally, after the SET FEATURES - XFER MODE - command completes successfully. - </para> - - <para> - ->set_piomode() is always called (if present), but - ->set_dma_mode() is only called if DMA is possible. - </para> - - </sect2> - - <sect2><title>Taskfile read/write</title> - <programlisting> -void (*sff_tf_load) (struct ata_port *ap, struct ata_taskfile *tf); -void (*sff_tf_read) (struct ata_port *ap, struct ata_taskfile *tf); - </programlisting> - - <para> - ->tf_load() is called to load the given taskfile into hardware - registers / DMA buffers. ->tf_read() is called to read the - hardware registers / DMA buffers, to obtain the current set of - taskfile register values. - Most drivers for taskfile-based hardware (PIO or MMIO) use - ata_sff_tf_load() and ata_sff_tf_read() for these hooks. - </para> - - </sect2> - - <sect2><title>PIO data read/write</title> - <programlisting> -void (*sff_data_xfer) (struct ata_device *, unsigned char *, unsigned int, int); - </programlisting> - - <para> -All bmdma-style drivers must implement this hook. This is the low-level -operation that actually copies the data bytes during a PIO data -transfer. -Typically the driver will choose one of ata_sff_data_xfer_noirq(), -ata_sff_data_xfer(), or ata_sff_data_xfer32(). - </para> - - </sect2> - - <sect2><title>ATA command execute</title> - <programlisting> -void (*sff_exec_command)(struct ata_port *ap, struct ata_taskfile *tf); - </programlisting> - - <para> - causes an ATA command, previously loaded with - ->tf_load(), to be initiated in hardware. - Most drivers for taskfile-based hardware use ata_sff_exec_command() - for this hook. - </para> - - </sect2> - - <sect2><title>Per-cmd ATAPI DMA capabilities filter</title> - <programlisting> -int (*check_atapi_dma) (struct ata_queued_cmd *qc); - </programlisting> - - <para> -Allow low-level driver to filter ATA PACKET commands, returning a status -indicating whether or not it is OK to use DMA for the supplied PACKET -command. - </para> - <para> - This hook may be specified as NULL, in which case libata will - assume that atapi dma can be supported. - </para> - - </sect2> - - <sect2><title>Read specific ATA shadow registers</title> - <programlisting> -u8 (*sff_check_status)(struct ata_port *ap); -u8 (*sff_check_altstatus)(struct ata_port *ap); - </programlisting> - - <para> - Reads the Status/AltStatus ATA shadow register from - hardware. On some hardware, reading the Status register has - the side effect of clearing the interrupt condition. - Most drivers for taskfile-based hardware use - ata_sff_check_status() for this hook. - </para> - - </sect2> - - <sect2><title>Write specific ATA shadow register</title> - <programlisting> -void (*sff_set_devctl)(struct ata_port *ap, u8 ctl); - </programlisting> - - <para> - Write the device control ATA shadow register to the hardware. - Most drivers don't need to define this. - </para> - - </sect2> - - <sect2><title>Select ATA device on bus</title> - <programlisting> -void (*sff_dev_select)(struct ata_port *ap, unsigned int device); - </programlisting> - - <para> - Issues the low-level hardware command(s) that causes one of N - hardware devices to be considered 'selected' (active and - available for use) on the ATA bus. This generally has no - meaning on FIS-based devices. - </para> - <para> - Most drivers for taskfile-based hardware use - ata_sff_dev_select() for this hook. - </para> - - </sect2> - - <sect2><title>Private tuning method</title> - <programlisting> -void (*set_mode) (struct ata_port *ap); - </programlisting> - - <para> - By default libata performs drive and controller tuning in - accordance with the ATA timing rules and also applies blacklists - and cable limits. Some controllers need special handling and have - custom tuning rules, typically raid controllers that use ATA - commands but do not actually do drive timing. - </para> - - <warning> - <para> - This hook should not be used to replace the standard controller - tuning logic when a controller has quirks. Replacing the default - tuning logic in that case would bypass handling for drive and - bridge quirks that may be important to data reliability. If a - controller needs to filter the mode selection it should use the - mode_filter hook instead. - </para> - </warning> - - </sect2> - - <sect2><title>Control PCI IDE BMDMA engine</title> - <programlisting> -void (*bmdma_setup) (struct ata_queued_cmd *qc); -void (*bmdma_start) (struct ata_queued_cmd *qc); -void (*bmdma_stop) (struct ata_port *ap); -u8 (*bmdma_status) (struct ata_port *ap); - </programlisting> - - <para> -When setting up an IDE BMDMA transaction, these hooks arm -(->bmdma_setup), fire (->bmdma_start), and halt (->bmdma_stop) -the hardware's DMA engine. ->bmdma_status is used to read the standard -PCI IDE DMA Status register. - </para> - - <para> -These hooks are typically either no-ops, or simply not implemented, in -FIS-based drivers. - </para> - <para> -Most legacy IDE drivers use ata_bmdma_setup() for the bmdma_setup() -hook. ata_bmdma_setup() will write the pointer to the PRD table to -the IDE PRD Table Address register, enable DMA in the DMA Command -register, and call exec_command() to begin the transfer. - </para> - <para> -Most legacy IDE drivers use ata_bmdma_start() for the bmdma_start() -hook. ata_bmdma_start() will write the ATA_DMA_START flag to the DMA -Command register. - </para> - <para> -Many legacy IDE drivers use ata_bmdma_stop() for the bmdma_stop() -hook. ata_bmdma_stop() clears the ATA_DMA_START flag in the DMA -command register. - </para> - <para> -Many legacy IDE drivers use ata_bmdma_status() as the bmdma_status() hook. - </para> - - </sect2> - - <sect2><title>High-level taskfile hooks</title> - <programlisting> -void (*qc_prep) (struct ata_queued_cmd *qc); -int (*qc_issue) (struct ata_queued_cmd *qc); - </programlisting> - - <para> - Higher-level hooks, these two hooks can potentially supercede - several of the above taskfile/DMA engine hooks. ->qc_prep is - called after the buffers have been DMA-mapped, and is typically - used to populate the hardware's DMA scatter-gather table. - Most drivers use the standard ata_qc_prep() helper function, but - more advanced drivers roll their own. - </para> - <para> - ->qc_issue is used to make a command active, once the hardware - and S/G tables have been prepared. IDE BMDMA drivers use the - helper function ata_qc_issue_prot() for taskfile protocol-based - dispatch. More advanced drivers implement their own ->qc_issue. - </para> - <para> - ata_qc_issue_prot() calls ->tf_load(), ->bmdma_setup(), and - ->bmdma_start() as necessary to initiate a transfer. - </para> - - </sect2> - - <sect2><title>Exception and probe handling (EH)</title> - <programlisting> -void (*eng_timeout) (struct ata_port *ap); -void (*phy_reset) (struct ata_port *ap); - </programlisting> - - <para> -Deprecated. Use ->error_handler() instead. - </para> - - <programlisting> -void (*freeze) (struct ata_port *ap); -void (*thaw) (struct ata_port *ap); - </programlisting> - - <para> -ata_port_freeze() is called when HSM violations or some other -condition disrupts normal operation of the port. A frozen port -is not allowed to perform any operation until the port is -thawed, which usually follows a successful reset. - </para> - - <para> -The optional ->freeze() callback can be used for freezing the port -hardware-wise (e.g. mask interrupt and stop DMA engine). If a -port cannot be frozen hardware-wise, the interrupt handler -must ack and clear interrupts unconditionally while the port -is frozen. - </para> - <para> -The optional ->thaw() callback is called to perform the opposite of ->freeze(): -prepare the port for normal operation once again. Unmask interrupts, -start DMA engine, etc. - </para> - - <programlisting> -void (*error_handler) (struct ata_port *ap); - </programlisting> - - <para> -->error_handler() is a driver's hook into probe, hotplug, and recovery -and other exceptional conditions. The primary responsibility of an -implementation is to call ata_do_eh() or ata_bmdma_drive_eh() with a set -of EH hooks as arguments: - </para> - - <para> -'prereset' hook (may be NULL) is called during an EH reset, before any other actions -are taken. - </para> - - <para> -'postreset' hook (may be NULL) is called after the EH reset is performed. Based on -existing conditions, severity of the problem, and hardware capabilities, - </para> - - <para> -Either 'softreset' (may be NULL) or 'hardreset' (may be NULL) will be -called to perform the low-level EH reset. - </para> - - <programlisting> -void (*post_internal_cmd) (struct ata_queued_cmd *qc); - </programlisting> - - <para> -Perform any hardware-specific actions necessary to finish processing -after executing a probe-time or EH-time command via ata_exec_internal(). - </para> - - </sect2> - - <sect2><title>Hardware interrupt handling</title> - <programlisting> -irqreturn_t (*irq_handler)(int, void *, struct pt_regs *); -void (*irq_clear) (struct ata_port *); - </programlisting> - - <para> - ->irq_handler is the interrupt handling routine registered with - the system, by libata. ->irq_clear is called during probe just - before the interrupt handler is registered, to be sure hardware - is quiet. - </para> - <para> - The second argument, dev_instance, should be cast to a pointer - to struct ata_host_set. - </para> - <para> - Most legacy IDE drivers use ata_sff_interrupt() for the - irq_handler hook, which scans all ports in the host_set, - determines which queued command was active (if any), and calls - ata_sff_host_intr(ap,qc). - </para> - <para> - Most legacy IDE drivers use ata_sff_irq_clear() for the - irq_clear() hook, which simply clears the interrupt and error - flags in the DMA status register. - </para> - - </sect2> - - <sect2><title>SATA phy read/write</title> - <programlisting> -int (*scr_read) (struct ata_port *ap, unsigned int sc_reg, - u32 *val); -int (*scr_write) (struct ata_port *ap, unsigned int sc_reg, - u32 val); - </programlisting> - - <para> - Read and write standard SATA phy registers. Currently only used - if ->phy_reset hook called the sata_phy_reset() helper function. - sc_reg is one of SCR_STATUS, SCR_CONTROL, SCR_ERROR, or SCR_ACTIVE. - </para> - - </sect2> - - <sect2><title>Init and shutdown</title> - <programlisting> -int (*port_start) (struct ata_port *ap); -void (*port_stop) (struct ata_port *ap); -void (*host_stop) (struct ata_host_set *host_set); - </programlisting> - - <para> - ->port_start() is called just after the data structures for each - port are initialized. Typically this is used to alloc per-port - DMA buffers / tables / rings, enable DMA engines, and similar - tasks. Some drivers also use this entry point as a chance to - allocate driver-private memory for ap->private_data. - </para> - <para> - Many drivers use ata_port_start() as this hook or call - it from their own port_start() hooks. ata_port_start() - allocates space for a legacy IDE PRD table and returns. - </para> - <para> - ->port_stop() is called after ->host_stop(). Its sole function - is to release DMA/memory resources, now that they are no longer - actively being used. Many drivers also free driver-private - data from port at this time. - </para> - <para> - ->host_stop() is called after all ->port_stop() calls -have completed. The hook must finalize hardware shutdown, release DMA -and other resources, etc. - This hook may be specified as NULL, in which case it is not called. - </para> - - </sect2> - - </sect1> - </chapter> - - <chapter id="libataEH"> - <title>Error handling</title> - - <para> - This chapter describes how errors are handled under libata. - Readers are advised to read SCSI EH - (Documentation/scsi/scsi_eh.txt) and ATA exceptions doc first. - </para> - - <sect1><title>Origins of commands</title> - <para> - In libata, a command is represented with struct ata_queued_cmd - or qc. qc's are preallocated during port initialization and - repetitively used for command executions. Currently only one - qc is allocated per port but yet-to-be-merged NCQ branch - allocates one for each tag and maps each qc to NCQ tag 1-to-1. - </para> - <para> - libata commands can originate from two sources - libata itself - and SCSI midlayer. libata internal commands are used for - initialization and error handling. All normal blk requests - and commands for SCSI emulation are passed as SCSI commands - through queuecommand callback of SCSI host template. - </para> - </sect1> - - <sect1><title>How commands are issued</title> - - <variablelist> - - <varlistentry><term>Internal commands</term> - <listitem> - <para> - First, qc is allocated and initialized using - ata_qc_new_init(). Although ata_qc_new_init() doesn't - implement any wait or retry mechanism when qc is not - available, internal commands are currently issued only during - initialization and error recovery, so no other command is - active and allocation is guaranteed to succeed. - </para> - <para> - Once allocated qc's taskfile is initialized for the command to - be executed. qc currently has two mechanisms to notify - completion. One is via qc->complete_fn() callback and the - other is completion qc->waiting. qc->complete_fn() callback - is the asynchronous path used by normal SCSI translated - commands and qc->waiting is the synchronous (issuer sleeps in - process context) path used by internal commands. - </para> - <para> - Once initialization is complete, host_set lock is acquired - and the qc is issued. - </para> - </listitem> - </varlistentry> - - <varlistentry><term>SCSI commands</term> - <listitem> - <para> - All libata drivers use ata_scsi_queuecmd() as - hostt->queuecommand callback. scmds can either be simulated - or translated. No qc is involved in processing a simulated - scmd. The result is computed right away and the scmd is - completed. - </para> - <para> - For a translated scmd, ata_qc_new_init() is invoked to - allocate a qc and the scmd is translated into the qc. SCSI - midlayer's completion notification function pointer is stored - into qc->scsidone. - </para> - <para> - qc->complete_fn() callback is used for completion - notification. ATA commands use ata_scsi_qc_complete() while - ATAPI commands use atapi_qc_complete(). Both functions end up - calling qc->scsidone to notify upper layer when the qc is - finished. After translation is completed, the qc is issued - with ata_qc_issue(). - </para> - <para> - Note that SCSI midlayer invokes hostt->queuecommand while - holding host_set lock, so all above occur while holding - host_set lock. - </para> - </listitem> - </varlistentry> - - </variablelist> - </sect1> - - <sect1><title>How commands are processed</title> - <para> - Depending on which protocol and which controller are used, - commands are processed differently. For the purpose of - discussion, a controller which uses taskfile interface and all - standard callbacks is assumed. - </para> - <para> - Currently 6 ATA command protocols are used. They can be - sorted into the following four categories according to how - they are processed. - </para> - - <variablelist> - <varlistentry><term>ATA NO DATA or DMA</term> - <listitem> - <para> - ATA_PROT_NODATA and ATA_PROT_DMA fall into this category. - These types of commands don't require any software - intervention once issued. Device will raise interrupt on - completion. - </para> - </listitem> - </varlistentry> - - <varlistentry><term>ATA PIO</term> - <listitem> - <para> - ATA_PROT_PIO is in this category. libata currently - implements PIO with polling. ATA_NIEN bit is set to turn - off interrupt and pio_task on ata_wq performs polling and - IO. - </para> - </listitem> - </varlistentry> - - <varlistentry><term>ATAPI NODATA or DMA</term> - <listitem> - <para> - ATA_PROT_ATAPI_NODATA and ATA_PROT_ATAPI_DMA are in this - category. packet_task is used to poll BSY bit after - issuing PACKET command. Once BSY is turned off by the - device, packet_task transfers CDB and hands off processing - to interrupt handler. - </para> - </listitem> - </varlistentry> - - <varlistentry><term>ATAPI PIO</term> - <listitem> - <para> - ATA_PROT_ATAPI is in this category. ATA_NIEN bit is set - and, as in ATAPI NODATA or DMA, packet_task submits cdb. - However, after submitting cdb, further processing (data - transfer) is handed off to pio_task. - </para> - </listitem> - </varlistentry> - </variablelist> - </sect1> - - <sect1><title>How commands are completed</title> - <para> - Once issued, all qc's are either completed with - ata_qc_complete() or time out. For commands which are handled - by interrupts, ata_host_intr() invokes ata_qc_complete(), and, - for PIO tasks, pio_task invokes ata_qc_complete(). In error - cases, packet_task may also complete commands. - </para> - <para> - ata_qc_complete() does the following. - </para> - - <orderedlist> - - <listitem> - <para> - DMA memory is unmapped. - </para> - </listitem> - - <listitem> - <para> - ATA_QCFLAG_ACTIVE is cleared from qc->flags. - </para> - </listitem> - - <listitem> - <para> - qc->complete_fn() callback is invoked. If the return value of - the callback is not zero. Completion is short circuited and - ata_qc_complete() returns. - </para> - </listitem> - - <listitem> - <para> - __ata_qc_complete() is called, which does - <orderedlist> - - <listitem> - <para> - qc->flags is cleared to zero. - </para> - </listitem> - - <listitem> - <para> - ap->active_tag and qc->tag are poisoned. - </para> - </listitem> - - <listitem> - <para> - qc->waiting is cleared & completed (in that order). - </para> - </listitem> - - <listitem> - <para> - qc is deallocated by clearing appropriate bit in ap->qactive. - </para> - </listitem> - - </orderedlist> - </para> - </listitem> - - </orderedlist> - - <para> - So, it basically notifies upper layer and deallocates qc. One - exception is short-circuit path in #3 which is used by - atapi_qc_complete(). - </para> - <para> - For all non-ATAPI commands, whether it fails or not, almost - the same code path is taken and very little error handling - takes place. A qc is completed with success status if it - succeeded, with failed status otherwise. - </para> - <para> - However, failed ATAPI commands require more handling as - REQUEST SENSE is needed to acquire sense data. If an ATAPI - command fails, ata_qc_complete() is invoked with error status, - which in turn invokes atapi_qc_complete() via - qc->complete_fn() callback. - </para> - <para> - This makes atapi_qc_complete() set scmd->result to - SAM_STAT_CHECK_CONDITION, complete the scmd and return 1. As - the sense data is empty but scmd->result is CHECK CONDITION, - SCSI midlayer will invoke EH for the scmd, and returning 1 - makes ata_qc_complete() to return without deallocating the qc. - This leads us to ata_scsi_error() with partially completed qc. - </para> - - </sect1> - - <sect1><title>ata_scsi_error()</title> - <para> - ata_scsi_error() is the current transportt->eh_strategy_handler() - for libata. As discussed above, this will be entered in two - cases - timeout and ATAPI error completion. This function - calls low level libata driver's eng_timeout() callback, the - standard callback for which is ata_eng_timeout(). It checks - if a qc is active and calls ata_qc_timeout() on the qc if so. - Actual error handling occurs in ata_qc_timeout(). - </para> - <para> - If EH is invoked for timeout, ata_qc_timeout() stops BMDMA and - completes the qc. Note that as we're currently in EH, we - cannot call scsi_done. As described in SCSI EH doc, a - recovered scmd should be either retried with - scsi_queue_insert() or finished with scsi_finish_command(). - Here, we override qc->scsidone with scsi_finish_command() and - calls ata_qc_complete(). - </para> - <para> - If EH is invoked due to a failed ATAPI qc, the qc here is - completed but not deallocated. The purpose of this - half-completion is to use the qc as place holder to make EH - code reach this place. This is a bit hackish, but it works. - </para> - <para> - Once control reaches here, the qc is deallocated by invoking - __ata_qc_complete() explicitly. Then, internal qc for REQUEST - SENSE is issued. Once sense data is acquired, scmd is - finished by directly invoking scsi_finish_command() on the - scmd. Note that as we already have completed and deallocated - the qc which was associated with the scmd, we don't need - to/cannot call ata_qc_complete() again. - </para> - - </sect1> - - <sect1><title>Problems with the current EH</title> - - <itemizedlist> - - <listitem> - <para> - Error representation is too crude. Currently any and all - error conditions are represented with ATA STATUS and ERROR - registers. Errors which aren't ATA device errors are treated - as ATA device errors by setting ATA_ERR bit. Better error - descriptor which can properly represent ATA and other - errors/exceptions is needed. - </para> - </listitem> - - <listitem> - <para> - When handling timeouts, no action is taken to make device - forget about the timed out command and ready for new commands. - </para> - </listitem> - - <listitem> - <para> - EH handling via ata_scsi_error() is not properly protected - from usual command processing. On EH entrance, the device is - not in quiescent state. Timed out commands may succeed or - fail any time. pio_task and atapi_task may still be running. - </para> - </listitem> - - <listitem> - <para> - Too weak error recovery. Devices / controllers causing HSM - mismatch errors and other errors quite often require reset to - return to known state. Also, advanced error handling is - necessary to support features like NCQ and hotplug. - </para> - </listitem> - - <listitem> - <para> - ATA errors are directly handled in the interrupt handler and - PIO errors in pio_task. This is problematic for advanced - error handling for the following reasons. - </para> - <para> - First, advanced error handling often requires context and - internal qc execution. - </para> - <para> - Second, even a simple failure (say, CRC error) needs - information gathering and could trigger complex error handling - (say, resetting & reconfiguring). Having multiple code - paths to gather information, enter EH and trigger actions - makes life painful. - </para> - <para> - Third, scattered EH code makes implementing low level drivers - difficult. Low level drivers override libata callbacks. If - EH is scattered over several places, each affected callbacks - should perform its part of error handling. This can be error - prone and painful. - </para> - </listitem> - - </itemizedlist> - </sect1> - </chapter> - - <chapter id="libataExt"> - <title>libata Library</title> -!Edrivers/ata/libata-core.c - </chapter> - - <chapter id="libataInt"> - <title>libata Core Internals</title> -!Idrivers/ata/libata-core.c - </chapter> - - <chapter id="libataScsiInt"> - <title>libata SCSI translation/emulation</title> -!Edrivers/ata/libata-scsi.c -!Idrivers/ata/libata-scsi.c - </chapter> - - <chapter id="ataExceptions"> - <title>ATA errors and exceptions</title> - - <para> - This chapter tries to identify what error/exception conditions exist - for ATA/ATAPI devices and describe how they should be handled in - implementation-neutral way. - </para> - - <para> - The term 'error' is used to describe conditions where either an - explicit error condition is reported from device or a command has - timed out. - </para> - - <para> - The term 'exception' is either used to describe exceptional - conditions which are not errors (say, power or hotplug events), or - to describe both errors and non-error exceptional conditions. Where - explicit distinction between error and exception is necessary, the - term 'non-error exception' is used. - </para> - - <sect1 id="excat"> - <title>Exception categories</title> - <para> - Exceptions are described primarily with respect to legacy - taskfile + bus master IDE interface. If a controller provides - other better mechanism for error reporting, mapping those into - categories described below shouldn't be difficult. - </para> - - <para> - In the following sections, two recovery actions - reset and - reconfiguring transport - are mentioned. These are described - further in <xref linkend="exrec"/>. - </para> - - <sect2 id="excatHSMviolation"> - <title>HSM violation</title> - <para> - This error is indicated when STATUS value doesn't match HSM - requirement during issuing or execution any ATA/ATAPI command. - </para> - - <itemizedlist> - <title>Examples</title> - - <listitem> - <para> - ATA_STATUS doesn't contain !BSY && DRDY && !DRQ while trying - to issue a command. - </para> - </listitem> - - <listitem> - <para> - !BSY && !DRQ during PIO data transfer. - </para> - </listitem> - - <listitem> - <para> - DRQ on command completion. - </para> - </listitem> - - <listitem> - <para> - !BSY && ERR after CDB transfer starts but before the - last byte of CDB is transferred. ATA/ATAPI standard states - that "The device shall not terminate the PACKET command - with an error before the last byte of the command packet has - been written" in the error outputs description of PACKET - command and the state diagram doesn't include such - transitions. - </para> - </listitem> - - </itemizedlist> - - <para> - In these cases, HSM is violated and not much information - regarding the error can be acquired from STATUS or ERROR - register. IOW, this error can be anything - driver bug, - faulty device, controller and/or cable. - </para> - - <para> - As HSM is violated, reset is necessary to restore known state. - Reconfiguring transport for lower speed might be helpful too - as transmission errors sometimes cause this kind of errors. - </para> - </sect2> - - <sect2 id="excatDevErr"> - <title>ATA/ATAPI device error (non-NCQ / non-CHECK CONDITION)</title> - - <para> - These are errors detected and reported by ATA/ATAPI devices - indicating device problems. For this type of errors, STATUS - and ERROR register values are valid and describe error - condition. Note that some of ATA bus errors are detected by - ATA/ATAPI devices and reported using the same mechanism as - device errors. Those cases are described later in this - section. - </para> - - <para> - For ATA commands, this type of errors are indicated by !BSY - && ERR during command execution and on completion. - </para> - - <para>For ATAPI commands,</para> - - <itemizedlist> - - <listitem> - <para> - !BSY && ERR && ABRT right after issuing PACKET - indicates that PACKET command is not supported and falls in - this category. - </para> - </listitem> - - <listitem> - <para> - !BSY && ERR(==CHK) && !ABRT after the last - byte of CDB is transferred indicates CHECK CONDITION and - doesn't fall in this category. - </para> - </listitem> - - <listitem> - <para> - !BSY && ERR(==CHK) && ABRT after the last byte - of CDB is transferred *probably* indicates CHECK CONDITION and - doesn't fall in this category. - </para> - </listitem> - - </itemizedlist> - - <para> - Of errors detected as above, the following are not ATA/ATAPI - device errors but ATA bus errors and should be handled - according to <xref linkend="excatATAbusErr"/>. - </para> - - <variablelist> - - <varlistentry> - <term>CRC error during data transfer</term> - <listitem> - <para> - This is indicated by ICRC bit in the ERROR register and - means that corruption occurred during data transfer. Up to - ATA/ATAPI-7, the standard specifies that this bit is only - applicable to UDMA transfers but ATA/ATAPI-8 draft revision - 1f says that the bit may be applicable to multiword DMA and - PIO. - </para> - </listitem> - </varlistentry> - - <varlistentry> - <term>ABRT error during data transfer or on completion</term> - <listitem> - <para> - Up to ATA/ATAPI-7, the standard specifies that ABRT could be - set on ICRC errors and on cases where a device is not able - to complete a command. Combined with the fact that MWDMA - and PIO transfer errors aren't allowed to use ICRC bit up to - ATA/ATAPI-7, it seems to imply that ABRT bit alone could - indicate transfer errors. - </para> - <para> - However, ATA/ATAPI-8 draft revision 1f removes the part - that ICRC errors can turn on ABRT. So, this is kind of - gray area. Some heuristics are needed here. - </para> - </listitem> - </varlistentry> - - </variablelist> - - <para> - ATA/ATAPI device errors can be further categorized as follows. - </para> - - <variablelist> - - <varlistentry> - <term>Media errors</term> - <listitem> - <para> - This is indicated by UNC bit in the ERROR register. ATA - devices reports UNC error only after certain number of - retries cannot recover the data, so there's nothing much - else to do other than notifying upper layer. - </para> - <para> - READ and WRITE commands report CHS or LBA of the first - failed sector but ATA/ATAPI standard specifies that the - amount of transferred data on error completion is - indeterminate, so we cannot assume that sectors preceding - the failed sector have been transferred and thus cannot - complete those sectors successfully as SCSI does. - </para> - </listitem> - </varlistentry> - - <varlistentry> - <term>Media changed / media change requested error</term> - <listitem> - <para> - <<TODO: fill here>> - </para> - </listitem> - </varlistentry> - - <varlistentry><term>Address error</term> - <listitem> - <para> - This is indicated by IDNF bit in the ERROR register. - Report to upper layer. - </para> - </listitem> - </varlistentry> - - <varlistentry><term>Other errors</term> - <listitem> - <para> - This can be invalid command or parameter indicated by ABRT - ERROR bit or some other error condition. Note that ABRT - bit can indicate a lot of things including ICRC and Address - errors. Heuristics needed. - </para> - </listitem> - </varlistentry> - - </variablelist> - - <para> - Depending on commands, not all STATUS/ERROR bits are - applicable. These non-applicable bits are marked with - "na" in the output descriptions but up to ATA/ATAPI-7 - no definition of "na" can be found. However, - ATA/ATAPI-8 draft revision 1f describes "N/A" as - follows. - </para> - - <blockquote> - <variablelist> - <varlistentry><term>3.2.3.3a N/A</term> - <listitem> - <para> - A keyword the indicates a field has no defined value in - this standard and should not be checked by the host or - device. N/A fields should be cleared to zero. - </para> - </listitem> - </varlistentry> - </variablelist> - </blockquote> - - <para> - So, it seems reasonable to assume that "na" bits are - cleared to zero by devices and thus need no explicit masking. - </para> - - </sect2> - - <sect2 id="excatATAPIcc"> - <title>ATAPI device CHECK CONDITION</title> - - <para> - ATAPI device CHECK CONDITION error is indicated by set CHK bit - (ERR bit) in the STATUS register after the last byte of CDB is - transferred for a PACKET command. For this kind of errors, - sense data should be acquired to gather information regarding - the errors. REQUEST SENSE packet command should be used to - acquire sense data. - </para> - - <para> - Once sense data is acquired, this type of errors can be - handled similarly to other SCSI errors. Note that sense data - may indicate ATA bus error (e.g. Sense Key 04h HARDWARE ERROR - && ASC/ASCQ 47h/00h SCSI PARITY ERROR). In such - cases, the error should be considered as an ATA bus error and - handled according to <xref linkend="excatATAbusErr"/>. - </para> - - </sect2> - - <sect2 id="excatNCQerr"> - <title>ATA device error (NCQ)</title> - - <para> - NCQ command error is indicated by cleared BSY and set ERR bit - during NCQ command phase (one or more NCQ commands - outstanding). Although STATUS and ERROR registers will - contain valid values describing the error, READ LOG EXT is - required to clear the error condition, determine which command - has failed and acquire more information. - </para> - - <para> - READ LOG EXT Log Page 10h reports which tag has failed and - taskfile register values describing the error. With this - information the failed command can be handled as a normal ATA - command error as in <xref linkend="excatDevErr"/> and all - other in-flight commands must be retried. Note that this - retry should not be counted - it's likely that commands - retried this way would have completed normally if it were not - for the failed command. - </para> - - <para> - Note that ATA bus errors can be reported as ATA device NCQ - errors. This should be handled as described in <xref - linkend="excatATAbusErr"/>. - </para> - - <para> - If READ LOG EXT Log Page 10h fails or reports NQ, we're - thoroughly screwed. This condition should be treated - according to <xref linkend="excatHSMviolation"/>. - </para> - - </sect2> - - <sect2 id="excatATAbusErr"> - <title>ATA bus error</title> - - <para> - ATA bus error means that data corruption occurred during - transmission over ATA bus (SATA or PATA). This type of errors - can be indicated by - </para> - - <itemizedlist> - - <listitem> - <para> - ICRC or ABRT error as described in <xref linkend="excatDevErr"/>. - </para> - </listitem> - - <listitem> - <para> - Controller-specific error completion with error information - indicating transmission error. - </para> - </listitem> - - <listitem> - <para> - On some controllers, command timeout. In this case, there may - be a mechanism to determine that the timeout is due to - transmission error. - </para> - </listitem> - - <listitem> - <para> - Unknown/random errors, timeouts and all sorts of weirdities. - </para> - </listitem> - - </itemizedlist> - - <para> - As described above, transmission errors can cause wide variety - of symptoms ranging from device ICRC error to random device - lockup, and, for many cases, there is no way to tell if an - error condition is due to transmission error or not; - therefore, it's necessary to employ some kind of heuristic - when dealing with errors and timeouts. For example, - encountering repetitive ABRT errors for known supported - command is likely to indicate ATA bus error. - </para> - - <para> - Once it's determined that ATA bus errors have possibly - occurred, lowering ATA bus transmission speed is one of - actions which may alleviate the problem. See <xref - linkend="exrecReconf"/> for more information. - </para> - - </sect2> - - <sect2 id="excatPCIbusErr"> - <title>PCI bus error</title> - - <para> - Data corruption or other failures during transmission over PCI - (or other system bus). For standard BMDMA, this is indicated - by Error bit in the BMDMA Status register. This type of - errors must be logged as it indicates something is very wrong - with the system. Resetting host controller is recommended. - </para> - - </sect2> - - <sect2 id="excatLateCompletion"> - <title>Late completion</title> - - <para> - This occurs when timeout occurs and the timeout handler finds - out that the timed out command has completed successfully or - with error. This is usually caused by lost interrupts. This - type of errors must be logged. Resetting host controller is - recommended. - </para> - - </sect2> - - <sect2 id="excatUnknown"> - <title>Unknown error (timeout)</title> - - <para> - This is when timeout occurs and the command is still - processing or the host and device are in unknown state. When - this occurs, HSM could be in any valid or invalid state. To - bring the device to known state and make it forget about the - timed out command, resetting is necessary. The timed out - command may be retried. - </para> - - <para> - Timeouts can also be caused by transmission errors. Refer to - <xref linkend="excatATAbusErr"/> for more details. - </para> - - </sect2> - - <sect2 id="excatHoplugPM"> - <title>Hotplug and power management exceptions</title> - - <para> - <<TODO: fill here>> - </para> - - </sect2> - - </sect1> - - <sect1 id="exrec"> - <title>EH recovery actions</title> - - <para> - This section discusses several important recovery actions. - </para> - - <sect2 id="exrecClr"> - <title>Clearing error condition</title> - - <para> - Many controllers require its error registers to be cleared by - error handler. Different controllers may have different - requirements. - </para> - - <para> - For SATA, it's strongly recommended to clear at least SError - register during error handling. - </para> - </sect2> - - <sect2 id="exrecRst"> - <title>Reset</title> - - <para> - During EH, resetting is necessary in the following cases. - </para> - - <itemizedlist> - - <listitem> - <para> - HSM is in unknown or invalid state - </para> - </listitem> - - <listitem> - <para> - HBA is in unknown or invalid state - </para> - </listitem> - - <listitem> - <para> - EH needs to make HBA/device forget about in-flight commands - </para> - </listitem> - - <listitem> - <para> - HBA/device behaves weirdly - </para> - </listitem> - - </itemizedlist> - - <para> - Resetting during EH might be a good idea regardless of error - condition to improve EH robustness. Whether to reset both or - either one of HBA and device depends on situation but the - following scheme is recommended. - </para> - - <itemizedlist> - - <listitem> - <para> - When it's known that HBA is in ready state but ATA/ATAPI - device is in unknown state, reset only device. - </para> - </listitem> - - <listitem> - <para> - If HBA is in unknown state, reset both HBA and device. - </para> - </listitem> - - </itemizedlist> - - <para> - HBA resetting is implementation specific. For a controller - complying to taskfile/BMDMA PCI IDE, stopping active DMA - transaction may be sufficient iff BMDMA state is the only HBA - context. But even mostly taskfile/BMDMA PCI IDE complying - controllers may have implementation specific requirements and - mechanism to reset themselves. This must be addressed by - specific drivers. - </para> - - <para> - OTOH, ATA/ATAPI standard describes in detail ways to reset - ATA/ATAPI devices. - </para> - - <variablelist> - - <varlistentry><term>PATA hardware reset</term> - <listitem> - <para> - This is hardware initiated device reset signalled with - asserted PATA RESET- signal. There is no standard way to - initiate hardware reset from software although some - hardware provides registers that allow driver to directly - tweak the RESET- signal. - </para> - </listitem> - </varlistentry> - - <varlistentry><term>Software reset</term> - <listitem> - <para> - This is achieved by turning CONTROL SRST bit on for at - least 5us. Both PATA and SATA support it but, in case of - SATA, this may require controller-specific support as the - second Register FIS to clear SRST should be transmitted - while BSY bit is still set. Note that on PATA, this resets - both master and slave devices on a channel. - </para> - </listitem> - </varlistentry> - - <varlistentry><term>EXECUTE DEVICE DIAGNOSTIC command</term> - <listitem> - <para> - Although ATA/ATAPI standard doesn't describe exactly, EDD - implies some level of resetting, possibly similar level - with software reset. Host-side EDD protocol can be handled - with normal command processing and most SATA controllers - should be able to handle EDD's just like other commands. - As in software reset, EDD affects both devices on a PATA - bus. - </para> - <para> - Although EDD does reset devices, this doesn't suit error - handling as EDD cannot be issued while BSY is set and it's - unclear how it will act when device is in unknown/weird - state. - </para> - </listitem> - </varlistentry> - - <varlistentry><term>ATAPI DEVICE RESET command</term> - <listitem> - <para> - This is very similar to software reset except that reset - can be restricted to the selected device without affecting - the other device sharing the cable. - </para> - </listitem> - </varlistentry> - - <varlistentry><term>SATA phy reset</term> - <listitem> - <para> - This is the preferred way of resetting a SATA device. In - effect, it's identical to PATA hardware reset. Note that - this can be done with the standard SCR Control register. - As such, it's usually easier to implement than software - reset. - </para> - </listitem> - </varlistentry> - - </variablelist> - - <para> - One more thing to consider when resetting devices is that - resetting clears certain configuration parameters and they - need to be set to their previous or newly adjusted values - after reset. - </para> - - <para> - Parameters affected are. - </para> - - <itemizedlist> - - <listitem> - <para> - CHS set up with INITIALIZE DEVICE PARAMETERS (seldom used) - </para> - </listitem> - - <listitem> - <para> - Parameters set with SET FEATURES including transfer mode setting - </para> - </listitem> - - <listitem> - <para> - Block count set with SET MULTIPLE MODE - </para> - </listitem> - - <listitem> - <para> - Other parameters (SET MAX, MEDIA LOCK...) - </para> - </listitem> - - </itemizedlist> - - <para> - ATA/ATAPI standard specifies that some parameters must be - maintained across hardware or software reset, but doesn't - strictly specify all of them. Always reconfiguring needed - parameters after reset is required for robustness. Note that - this also applies when resuming from deep sleep (power-off). - </para> - - <para> - Also, ATA/ATAPI standard requires that IDENTIFY DEVICE / - IDENTIFY PACKET DEVICE is issued after any configuration - parameter is updated or a hardware reset and the result used - for further operation. OS driver is required to implement - revalidation mechanism to support this. - </para> - - </sect2> - - <sect2 id="exrecReconf"> - <title>Reconfigure transport</title> - - <para> - For both PATA and SATA, a lot of corners are cut for cheap - connectors, cables or controllers and it's quite common to see - high transmission error rate. This can be mitigated by - lowering transmission speed. - </para> - - <para> - The following is a possible scheme Jeff Garzik suggested. - </para> - - <blockquote> - <para> - If more than $N (3?) transmission errors happen in 15 minutes, - </para> - <itemizedlist> - <listitem> - <para> - if SATA, decrease SATA PHY speed. if speed cannot be decreased, - </para> - </listitem> - <listitem> - <para> - decrease UDMA xfer speed. if at UDMA0, switch to PIO4, - </para> - </listitem> - <listitem> - <para> - decrease PIO xfer speed. if at PIO3, complain, but continue - </para> - </listitem> - </itemizedlist> - </blockquote> - - </sect2> - - </sect1> - - </chapter> - - <chapter id="PiixInt"> - <title>ata_piix Internals</title> -!Idrivers/ata/ata_piix.c - </chapter> - - <chapter id="SILInt"> - <title>sata_sil Internals</title> -!Idrivers/ata/sata_sil.c - </chapter> - - <chapter id="libataThanks"> - <title>Thanks</title> - <para> - The bulk of the ATA knowledge comes thanks to long conversations with - Andre Hedrick (www.linux-ide.org), and long hours pondering the ATA - and SCSI specifications. - </para> - <para> - Thanks to Alan Cox for pointing out similarities - between SATA and SCSI, and in general for motivation to hack on - libata. - </para> - <para> - libata's device detection - method, ata_pio_devchk, and in general all the early probing was - based on extensive study of Hale Landis's probe/reset code in his - ATADRVR driver (www.ata-atapi.com). - </para> - </chapter> - -</book> diff --git a/Documentation/DocBook/librs.tmpl b/Documentation/DocBook/librs.tmpl deleted file mode 100644 index 94f21361e0ed..000000000000 --- a/Documentation/DocBook/librs.tmpl +++ /dev/null @@ -1,289 +0,0 @@ -<?xml version="1.0" encoding="UTF-8"?> -<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" - "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []> - -<book id="Reed-Solomon-Library-Guide"> - <bookinfo> - <title>Reed-Solomon Library Programming Interface</title> - - <authorgroup> - <author> - <firstname>Thomas</firstname> - <surname>Gleixner</surname> - <affiliation> - <address> - <email>tglx@linutronix.de</email> - </address> - </affiliation> - </author> - </authorgroup> - - <copyright> - <year>2004</year> - <holder>Thomas Gleixner</holder> - </copyright> - - <legalnotice> - <para> - This documentation is free software; you can redistribute - it and/or modify it under the terms of the GNU General Public - License version 2 as published by the Free Software Foundation. - </para> - - <para> - This program is distributed in the hope that it will be - useful, but WITHOUT ANY WARRANTY; without even the implied - warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. - See the GNU General Public License for more details. - </para> - - <para> - You should have received a copy of the GNU General Public - License along with this program; if not, write to the Free - Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, - MA 02111-1307 USA - </para> - - <para> - For more details see the file COPYING in the source - distribution of Linux. - </para> - </legalnotice> - </bookinfo> - -<toc></toc> - - <chapter id="intro"> - <title>Introduction</title> - <para> - The generic Reed-Solomon Library provides encoding, decoding - and error correction functions. - </para> - <para> - Reed-Solomon codes are used in communication and storage - applications to ensure data integrity. - </para> - <para> - This documentation is provided for developers who want to utilize - the functions provided by the library. - </para> - </chapter> - - <chapter id="bugs"> - <title>Known Bugs And Assumptions</title> - <para> - None. - </para> - </chapter> - - <chapter id="usage"> - <title>Usage</title> - <para> - This chapter provides examples of how to use the library. - </para> - <sect1> - <title>Initializing</title> - <para> - The init function init_rs returns a pointer to an - rs decoder structure, which holds the necessary - information for encoding, decoding and error correction - with the given polynomial. It either uses an existing - matching decoder or creates a new one. On creation all - the lookup tables for fast en/decoding are created. - The function may take a while, so make sure not to - call it in critical code paths. - </para> - <programlisting> -/* the Reed Solomon control structure */ -static struct rs_control *rs_decoder; - -/* Symbolsize is 10 (bits) - * Primitive polynomial is x^10+x^3+1 - * first consecutive root is 0 - * primitive element to generate roots = 1 - * generator polynomial degree (number of roots) = 6 - */ -rs_decoder = init_rs (10, 0x409, 0, 1, 6); - </programlisting> - </sect1> - <sect1> - <title>Encoding</title> - <para> - The encoder calculates the Reed-Solomon code over - the given data length and stores the result in - the parity buffer. Note that the parity buffer must - be initialized before calling the encoder. - </para> - <para> - The expanded data can be inverted on the fly by - providing a non-zero inversion mask. The expanded data is - XOR'ed with the mask. This is used e.g. for FLASH - ECC, where the all 0xFF is inverted to an all 0x00. - The Reed-Solomon code for all 0x00 is all 0x00. The - code is inverted before storing to FLASH so it is 0xFF - too. This prevents that reading from an erased FLASH - results in ECC errors. - </para> - <para> - The databytes are expanded to the given symbol size - on the fly. There is no support for encoding continuous - bitstreams with a symbol size != 8 at the moment. If - it is necessary it should be not a big deal to implement - such functionality. - </para> - <programlisting> -/* Parity buffer. Size = number of roots */ -uint16_t par[6]; -/* Initialize the parity buffer */ -memset(par, 0, sizeof(par)); -/* Encode 512 byte in data8. Store parity in buffer par */ -encode_rs8 (rs_decoder, data8, 512, par, 0); - </programlisting> - </sect1> - <sect1> - <title>Decoding</title> - <para> - The decoder calculates the syndrome over - the given data length and the received parity symbols - and corrects errors in the data. - </para> - <para> - If a syndrome is available from a hardware decoder - then the syndrome calculation is skipped. - </para> - <para> - The correction of the data buffer can be suppressed - by providing a correction pattern buffer and an error - location buffer to the decoder. The decoder stores the - calculated error location and the correction bitmask - in the given buffers. This is useful for hardware - decoders which use a weird bit ordering scheme. - </para> - <para> - The databytes are expanded to the given symbol size - on the fly. There is no support for decoding continuous - bitstreams with a symbolsize != 8 at the moment. If - it is necessary it should be not a big deal to implement - such functionality. - </para> - - <sect2> - <title> - Decoding with syndrome calculation, direct data correction - </title> - <programlisting> -/* Parity buffer. Size = number of roots */ -uint16_t par[6]; -uint8_t data[512]; -int numerr; -/* Receive data */ -..... -/* Receive parity */ -..... -/* Decode 512 byte in data8.*/ -numerr = decode_rs8 (rs_decoder, data8, par, 512, NULL, 0, NULL, 0, NULL); - </programlisting> - </sect2> - - <sect2> - <title> - Decoding with syndrome given by hardware decoder, direct data correction - </title> - <programlisting> -/* Parity buffer. Size = number of roots */ -uint16_t par[6], syn[6]; -uint8_t data[512]; -int numerr; -/* Receive data */ -..... -/* Receive parity */ -..... -/* Get syndrome from hardware decoder */ -..... -/* Decode 512 byte in data8.*/ -numerr = decode_rs8 (rs_decoder, data8, par, 512, syn, 0, NULL, 0, NULL); - </programlisting> - </sect2> - - <sect2> - <title> - Decoding with syndrome given by hardware decoder, no direct data correction. - </title> - <para> - Note: It's not necessary to give data and received parity to the decoder. - </para> - <programlisting> -/* Parity buffer. Size = number of roots */ -uint16_t par[6], syn[6], corr[8]; -uint8_t data[512]; -int numerr, errpos[8]; -/* Receive data */ -..... -/* Receive parity */ -..... -/* Get syndrome from hardware decoder */ -..... -/* Decode 512 byte in data8.*/ -numerr = decode_rs8 (rs_decoder, NULL, NULL, 512, syn, 0, errpos, 0, corr); -for (i = 0; i < numerr; i++) { - do_error_correction_in_your_buffer(errpos[i], corr[i]); -} - </programlisting> - </sect2> - </sect1> - <sect1> - <title>Cleanup</title> - <para> - The function free_rs frees the allocated resources, - if the caller is the last user of the decoder. - </para> - <programlisting> -/* Release resources */ -free_rs(rs_decoder); - </programlisting> - </sect1> - - </chapter> - - <chapter id="structs"> - <title>Structures</title> - <para> - This chapter contains the autogenerated documentation of the structures which are - used in the Reed-Solomon Library and are relevant for a developer. - </para> -!Iinclude/linux/rslib.h - </chapter> - - <chapter id="pubfunctions"> - <title>Public Functions Provided</title> - <para> - This chapter contains the autogenerated documentation of the Reed-Solomon functions - which are exported. - </para> -!Elib/reed_solomon/reed_solomon.c - </chapter> - - <chapter id="credits"> - <title>Credits</title> - <para> - The library code for encoding and decoding was written by Phil Karn. - </para> - <programlisting> - Copyright 2002, Phil Karn, KA9Q - May be used under the terms of the GNU General Public License (GPL) - </programlisting> - <para> - The wrapper functions and interfaces are written by Thomas Gleixner. - </para> - <para> - Many users have provided bugfixes, improvements and helping hands for testing. - Thanks a lot. - </para> - <para> - The following people have contributed to this document: - </para> - <para> - Thomas Gleixner<email>tglx@linutronix.de</email> - </para> - </chapter> -</book> diff --git a/Documentation/DocBook/lsm.tmpl b/Documentation/DocBook/lsm.tmpl deleted file mode 100644 index fe7664ce9667..000000000000 --- a/Documentation/DocBook/lsm.tmpl +++ /dev/null @@ -1,265 +0,0 @@ -<?xml version="1.0" encoding="UTF-8"?> -<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" - "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []> - -<article class="whitepaper" id="LinuxSecurityModule" lang="en"> - <articleinfo> - <title>Linux Security Modules: General Security Hooks for Linux</title> - <authorgroup> - <author> - <firstname>Stephen</firstname> - <surname>Smalley</surname> - <affiliation> - <orgname>NAI Labs</orgname> - <address><email>ssmalley@nai.com</email></address> - </affiliation> - </author> - <author> - <firstname>Timothy</firstname> - <surname>Fraser</surname> - <affiliation> - <orgname>NAI Labs</orgname> - <address><email>tfraser@nai.com</email></address> - </affiliation> - </author> - <author> - <firstname>Chris</firstname> - <surname>Vance</surname> - <affiliation> - <orgname>NAI Labs</orgname> - <address><email>cvance@nai.com</email></address> - </affiliation> - </author> - </authorgroup> - </articleinfo> - -<sect1 id="Introduction"><title>Introduction</title> - -<para> -In March 2001, the National Security Agency (NSA) gave a presentation -about Security-Enhanced Linux (SELinux) at the 2.5 Linux Kernel -Summit. SELinux is an implementation of flexible and fine-grained -nondiscretionary access controls in the Linux kernel, originally -implemented as its own particular kernel patch. Several other -security projects (e.g. RSBAC, Medusa) have also developed flexible -access control architectures for the Linux kernel, and various -projects have developed particular access control models for Linux -(e.g. LIDS, DTE, SubDomain). Each project has developed and -maintained its own kernel patch to support its security needs. -</para> - -<para> -In response to the NSA presentation, Linus Torvalds made a set of -remarks that described a security framework he would be willing to -consider for inclusion in the mainstream Linux kernel. He described a -general framework that would provide a set of security hooks to -control operations on kernel objects and a set of opaque security -fields in kernel data structures for maintaining security attributes. -This framework could then be used by loadable kernel modules to -implement any desired model of security. Linus also suggested the -possibility of migrating the Linux capabilities code into such a -module. -</para> - -<para> -The Linux Security Modules (LSM) project was started by WireX to -develop such a framework. LSM is a joint development effort by -several security projects, including Immunix, SELinux, SGI and Janus, -and several individuals, including Greg Kroah-Hartman and James -Morris, to develop a Linux kernel patch that implements this -framework. The patch is currently tracking the 2.4 series and is -targeted for integration into the 2.5 development series. This -technical report provides an overview of the framework and the example -capabilities security module provided by the LSM kernel patch. -</para> - -</sect1> - -<sect1 id="framework"><title>LSM Framework</title> - -<para> -The LSM kernel patch provides a general kernel framework to support -security modules. In particular, the LSM framework is primarily -focused on supporting access control modules, although future -development is likely to address other security needs such as -auditing. By itself, the framework does not provide any additional -security; it merely provides the infrastructure to support security -modules. The LSM kernel patch also moves most of the capabilities -logic into an optional security module, with the system defaulting -to the traditional superuser logic. This capabilities module -is discussed further in <xref linkend="cap"/>. -</para> - -<para> -The LSM kernel patch adds security fields to kernel data structures -and inserts calls to hook functions at critical points in the kernel -code to manage the security fields and to perform access control. It -also adds functions for registering and unregistering security -modules, and adds a general <function>security</function> system call -to support new system calls for security-aware applications. -</para> - -<para> -The LSM security fields are simply <type>void*</type> pointers. For -process and program execution security information, security fields -were added to <structname>struct task_struct</structname> and -<structname>struct linux_binprm</structname>. For filesystem security -information, a security field was added to -<structname>struct super_block</structname>. For pipe, file, and socket -security information, security fields were added to -<structname>struct inode</structname> and -<structname>struct file</structname>. For packet and network device security -information, security fields were added to -<structname>struct sk_buff</structname> and -<structname>struct net_device</structname>. For System V IPC security -information, security fields were added to -<structname>struct kern_ipc_perm</structname> and -<structname>struct msg_msg</structname>; additionally, the definitions -for <structname>struct msg_msg</structname>, <structname>struct -msg_queue</structname>, and <structname>struct -shmid_kernel</structname> were moved to header files -(<filename>include/linux/msg.h</filename> and -<filename>include/linux/shm.h</filename> as appropriate) to allow -the security modules to use these definitions. -</para> - -<para> -Each LSM hook is a function pointer in a global table, -security_ops. This table is a -<structname>security_operations</structname> structure as defined by -<filename>include/linux/security.h</filename>. Detailed documentation -for each hook is included in this header file. At present, this -structure consists of a collection of substructures that group related -hooks based on the kernel object (e.g. task, inode, file, sk_buff, -etc) as well as some top-level hook function pointers for system -operations. This structure is likely to be flattened in the future -for performance. The placement of the hook calls in the kernel code -is described by the "called:" lines in the per-hook documentation in -the header file. The hook calls can also be easily found in the -kernel code by looking for the string "security_ops->". - -</para> - -<para> -Linus mentioned per-process security hooks in his original remarks as a -possible alternative to global security hooks. However, if LSM were -to start from the perspective of per-process hooks, then the base -framework would have to deal with how to handle operations that -involve multiple processes (e.g. kill), since each process might have -its own hook for controlling the operation. This would require a -general mechanism for composing hooks in the base framework. -Additionally, LSM would still need global hooks for operations that -have no process context (e.g. network input operations). -Consequently, LSM provides global security hooks, but a security -module is free to implement per-process hooks (where that makes sense) -by storing a security_ops table in each process' security field and -then invoking these per-process hooks from the global hooks. -The problem of composition is thus deferred to the module. -</para> - -<para> -The global security_ops table is initialized to a set of hook -functions provided by a dummy security module that provides -traditional superuser logic. A <function>register_security</function> -function (in <filename>security/security.c</filename>) is provided to -allow a security module to set security_ops to refer to its own hook -functions, and an <function>unregister_security</function> function is -provided to revert security_ops to the dummy module hooks. This -mechanism is used to set the primary security module, which is -responsible for making the final decision for each hook. -</para> - -<para> -LSM also provides a simple mechanism for stacking additional security -modules with the primary security module. It defines -<function>register_security</function> and -<function>unregister_security</function> hooks in the -<structname>security_operations</structname> structure and provides -<function>mod_reg_security</function> and -<function>mod_unreg_security</function> functions that invoke these -hooks after performing some sanity checking. A security module can -call these functions in order to stack with other modules. However, -the actual details of how this stacking is handled are deferred to the -module, which can implement these hooks in any way it wishes -(including always returning an error if it does not wish to support -stacking). In this manner, LSM again defers the problem of -composition to the module. -</para> - -<para> -Although the LSM hooks are organized into substructures based on -kernel object, all of the hooks can be viewed as falling into two -major categories: hooks that are used to manage the security fields -and hooks that are used to perform access control. Examples of the -first category of hooks include the -<function>alloc_security</function> and -<function>free_security</function> hooks defined for each kernel data -structure that has a security field. These hooks are used to allocate -and free security structures for kernel objects. The first category -of hooks also includes hooks that set information in the security -field after allocation, such as the <function>post_lookup</function> -hook in <structname>struct inode_security_ops</structname>. This hook -is used to set security information for inodes after successful lookup -operations. An example of the second category of hooks is the -<function>permission</function> hook in -<structname>struct inode_security_ops</structname>. This hook checks -permission when accessing an inode. -</para> - -</sect1> - -<sect1 id="cap"><title>LSM Capabilities Module</title> - -<para> -The LSM kernel patch moves most of the existing POSIX.1e capabilities -logic into an optional security module stored in the file -<filename>security/capability.c</filename>. This change allows -users who do not want to use capabilities to omit this code entirely -from their kernel, instead using the dummy module for traditional -superuser logic or any other module that they desire. This change -also allows the developers of the capabilities logic to maintain and -enhance their code more freely, without needing to integrate patches -back into the base kernel. -</para> - -<para> -In addition to moving the capabilities logic, the LSM kernel patch -could move the capability-related fields from the kernel data -structures into the new security fields managed by the security -modules. However, at present, the LSM kernel patch leaves the -capability fields in the kernel data structures. In his original -remarks, Linus suggested that this might be preferable so that other -security modules can be easily stacked with the capabilities module -without needing to chain multiple security structures on the security field. -It also avoids imposing extra overhead on the capabilities module -to manage the security fields. However, the LSM framework could -certainly support such a move if it is determined to be desirable, -with only a few additional changes described below. -</para> - -<para> -At present, the capabilities logic for computing process capabilities -on <function>execve</function> and <function>set*uid</function>, -checking capabilities for a particular process, saving and checking -capabilities for netlink messages, and handling the -<function>capget</function> and <function>capset</function> system -calls have been moved into the capabilities module. There are still a -few locations in the base kernel where capability-related fields are -directly examined or modified, but the current version of the LSM -patch does allow a security module to completely replace the -assignment and testing of capabilities. These few locations would -need to be changed if the capability-related fields were moved into -the security field. The following is a list of known locations that -still perform such direct examination or modification of -capability-related fields: -<itemizedlist> -<listitem><para><filename>fs/open.c</filename>:<function>sys_access</function></para></listitem> -<listitem><para><filename>fs/lockd/host.c</filename>:<function>nlm_bind_host</function></para></listitem> -<listitem><para><filename>fs/nfsd/auth.c</filename>:<function>nfsd_setuser</function></para></listitem> -<listitem><para><filename>fs/proc/array.c</filename>:<function>task_cap</function></para></listitem> -</itemizedlist> -</para> - -</sect1> - -</article> diff --git a/Documentation/DocBook/mtdnand.tmpl b/Documentation/DocBook/mtdnand.tmpl deleted file mode 100644 index b442921bca54..000000000000 --- a/Documentation/DocBook/mtdnand.tmpl +++ /dev/null @@ -1,1291 +0,0 @@ -<?xml version="1.0" encoding="UTF-8"?> -<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" - "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []> - -<book id="MTD-NAND-Guide"> - <bookinfo> - <title>MTD NAND Driver Programming Interface</title> - - <authorgroup> - <author> - <firstname>Thomas</firstname> - <surname>Gleixner</surname> - <affiliation> - <address> - <email>tglx@linutronix.de</email> - </address> - </affiliation> - </author> - </authorgroup> - - <copyright> - <year>2004</year> - <holder>Thomas Gleixner</holder> - </copyright> - - <legalnotice> - <para> - This documentation is free software; you can redistribute - it and/or modify it under the terms of the GNU General Public - License version 2 as published by the Free Software Foundation. - </para> - - <para> - This program is distributed in the hope that it will be - useful, but WITHOUT ANY WARRANTY; without even the implied - warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. - See the GNU General Public License for more details. - </para> - - <para> - You should have received a copy of the GNU General Public - License along with this program; if not, write to the Free - Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, - MA 02111-1307 USA - </para> - - <para> - For more details see the file COPYING in the source - distribution of Linux. - </para> - </legalnotice> - </bookinfo> - -<toc></toc> - - <chapter id="intro"> - <title>Introduction</title> - <para> - The generic NAND driver supports almost all NAND and AG-AND based - chips and connects them to the Memory Technology Devices (MTD) - subsystem of the Linux Kernel. - </para> - <para> - This documentation is provided for developers who want to implement - board drivers or filesystem drivers suitable for NAND devices. - </para> - </chapter> - - <chapter id="bugs"> - <title>Known Bugs And Assumptions</title> - <para> - None. - </para> - </chapter> - - <chapter id="dochints"> - <title>Documentation hints</title> - <para> - The function and structure docs are autogenerated. Each function and - struct member has a short description which is marked with an [XXX] identifier. - The following chapters explain the meaning of those identifiers. - </para> - <sect1 id="Function_identifiers_XXX"> - <title>Function identifiers [XXX]</title> - <para> - The functions are marked with [XXX] identifiers in the short - comment. The identifiers explain the usage and scope of the - functions. Following identifiers are used: - </para> - <itemizedlist> - <listitem><para> - [MTD Interface]</para><para> - These functions provide the interface to the MTD kernel API. - They are not replaceable and provide functionality - which is complete hardware independent. - </para></listitem> - <listitem><para> - [NAND Interface]</para><para> - These functions are exported and provide the interface to the NAND kernel API. - </para></listitem> - <listitem><para> - [GENERIC]</para><para> - Generic functions are not replaceable and provide functionality - which is complete hardware independent. - </para></listitem> - <listitem><para> - [DEFAULT]</para><para> - Default functions provide hardware related functionality which is suitable - for most of the implementations. These functions can be replaced by the - board driver if necessary. Those functions are called via pointers in the - NAND chip description structure. The board driver can set the functions which - should be replaced by board dependent functions before calling nand_scan(). - If the function pointer is NULL on entry to nand_scan() then the pointer - is set to the default function which is suitable for the detected chip type. - </para></listitem> - </itemizedlist> - </sect1> - <sect1 id="Struct_member_identifiers_XXX"> - <title>Struct member identifiers [XXX]</title> - <para> - The struct members are marked with [XXX] identifiers in the - comment. The identifiers explain the usage and scope of the - members. Following identifiers are used: - </para> - <itemizedlist> - <listitem><para> - [INTERN]</para><para> - These members are for NAND driver internal use only and must not be - modified. Most of these values are calculated from the chip geometry - information which is evaluated during nand_scan(). - </para></listitem> - <listitem><para> - [REPLACEABLE]</para><para> - Replaceable members hold hardware related functions which can be - provided by the board driver. The board driver can set the functions which - should be replaced by board dependent functions before calling nand_scan(). - If the function pointer is NULL on entry to nand_scan() then the pointer - is set to the default function which is suitable for the detected chip type. - </para></listitem> - <listitem><para> - [BOARDSPECIFIC]</para><para> - Board specific members hold hardware related information which must - be provided by the board driver. The board driver must set the function - pointers and datafields before calling nand_scan(). - </para></listitem> - <listitem><para> - [OPTIONAL]</para><para> - Optional members can hold information relevant for the board driver. The - generic NAND driver code does not use this information. - </para></listitem> - </itemizedlist> - </sect1> - </chapter> - - <chapter id="basicboarddriver"> - <title>Basic board driver</title> - <para> - For most boards it will be sufficient to provide just the - basic functions and fill out some really board dependent - members in the nand chip description structure. - </para> - <sect1 id="Basic_defines"> - <title>Basic defines</title> - <para> - At least you have to provide a nand_chip structure - and a storage for the ioremap'ed chip address. - You can allocate the nand_chip structure using - kmalloc or you can allocate it statically. - The NAND chip structure embeds an mtd structure - which will be registered to the MTD subsystem. - You can extract a pointer to the mtd structure - from a nand_chip pointer using the nand_to_mtd() - helper. - </para> - <para> - Kmalloc based example - </para> - <programlisting> -static struct mtd_info *board_mtd; -static void __iomem *baseaddr; - </programlisting> - <para> - Static example - </para> - <programlisting> -static struct nand_chip board_chip; -static void __iomem *baseaddr; - </programlisting> - </sect1> - <sect1 id="Partition_defines"> - <title>Partition defines</title> - <para> - If you want to divide your device into partitions, then - define a partitioning scheme suitable to your board. - </para> - <programlisting> -#define NUM_PARTITIONS 2 -static struct mtd_partition partition_info[] = { - { .name = "Flash partition 1", - .offset = 0, - .size = 8 * 1024 * 1024 }, - { .name = "Flash partition 2", - .offset = MTDPART_OFS_NEXT, - .size = MTDPART_SIZ_FULL }, -}; - </programlisting> - </sect1> - <sect1 id="Hardware_control_functions"> - <title>Hardware control function</title> - <para> - The hardware control function provides access to the - control pins of the NAND chip(s). - The access can be done by GPIO pins or by address lines. - If you use address lines, make sure that the timing - requirements are met. - </para> - <para> - <emphasis>GPIO based example</emphasis> - </para> - <programlisting> -static void board_hwcontrol(struct mtd_info *mtd, int cmd) -{ - switch(cmd){ - case NAND_CTL_SETCLE: /* Set CLE pin high */ break; - case NAND_CTL_CLRCLE: /* Set CLE pin low */ break; - case NAND_CTL_SETALE: /* Set ALE pin high */ break; - case NAND_CTL_CLRALE: /* Set ALE pin low */ break; - case NAND_CTL_SETNCE: /* Set nCE pin low */ break; - case NAND_CTL_CLRNCE: /* Set nCE pin high */ break; - } -} - </programlisting> - <para> - <emphasis>Address lines based example.</emphasis> It's assumed that the - nCE pin is driven by a chip select decoder. - </para> - <programlisting> -static void board_hwcontrol(struct mtd_info *mtd, int cmd) -{ - struct nand_chip *this = mtd_to_nand(mtd); - switch(cmd){ - case NAND_CTL_SETCLE: this->IO_ADDR_W |= CLE_ADRR_BIT; break; - case NAND_CTL_CLRCLE: this->IO_ADDR_W &= ~CLE_ADRR_BIT; break; - case NAND_CTL_SETALE: this->IO_ADDR_W |= ALE_ADRR_BIT; break; - case NAND_CTL_CLRALE: this->IO_ADDR_W &= ~ALE_ADRR_BIT; break; - } -} - </programlisting> - </sect1> - <sect1 id="Device_ready_function"> - <title>Device ready function</title> - <para> - If the hardware interface has the ready busy pin of the NAND chip connected to a - GPIO or other accessible I/O pin, this function is used to read back the state of the - pin. The function has no arguments and should return 0, if the device is busy (R/B pin - is low) and 1, if the device is ready (R/B pin is high). - If the hardware interface does not give access to the ready busy pin, then - the function must not be defined and the function pointer this->dev_ready is set to NULL. - </para> - </sect1> - <sect1 id="Init_function"> - <title>Init function</title> - <para> - The init function allocates memory and sets up all the board - specific parameters and function pointers. When everything - is set up nand_scan() is called. This function tries to - detect and identify then chip. If a chip is found all the - internal data fields are initialized accordingly. - The structure(s) have to be zeroed out first and then filled with the necessary - information about the device. - </para> - <programlisting> -static int __init board_init (void) -{ - struct nand_chip *this; - int err = 0; - - /* Allocate memory for MTD device structure and private data */ - this = kzalloc(sizeof(struct nand_chip), GFP_KERNEL); - if (!this) { - printk ("Unable to allocate NAND MTD device structure.\n"); - err = -ENOMEM; - goto out; - } - - board_mtd = nand_to_mtd(this); - - /* map physical address */ - baseaddr = ioremap(CHIP_PHYSICAL_ADDRESS, 1024); - if (!baseaddr) { - printk("Ioremap to access NAND chip failed\n"); - err = -EIO; - goto out_mtd; - } - - /* Set address of NAND IO lines */ - this->IO_ADDR_R = baseaddr; - this->IO_ADDR_W = baseaddr; - /* Reference hardware control function */ - this->hwcontrol = board_hwcontrol; - /* Set command delay time, see datasheet for correct value */ - this->chip_delay = CHIP_DEPENDEND_COMMAND_DELAY; - /* Assign the device ready function, if available */ - this->dev_ready = board_dev_ready; - this->eccmode = NAND_ECC_SOFT; - - /* Scan to find existence of the device */ - if (nand_scan (board_mtd, 1)) { - err = -ENXIO; - goto out_ior; - } - - add_mtd_partitions(board_mtd, partition_info, NUM_PARTITIONS); - goto out; - -out_ior: - iounmap(baseaddr); -out_mtd: - kfree (this); -out: - return err; -} -module_init(board_init); - </programlisting> - </sect1> - <sect1 id="Exit_function"> - <title>Exit function</title> - <para> - The exit function is only necessary if the driver is - compiled as a module. It releases all resources which - are held by the chip driver and unregisters the partitions - in the MTD layer. - </para> - <programlisting> -#ifdef MODULE -static void __exit board_cleanup (void) -{ - /* Release resources, unregister device */ - nand_release (board_mtd); - - /* unmap physical address */ - iounmap(baseaddr); - - /* Free the MTD device structure */ - kfree (mtd_to_nand(board_mtd)); -} -module_exit(board_cleanup); -#endif - </programlisting> - </sect1> - </chapter> - - <chapter id="boarddriversadvanced"> - <title>Advanced board driver functions</title> - <para> - This chapter describes the advanced functionality of the NAND - driver. For a list of functions which can be overridden by the board - driver see the documentation of the nand_chip structure. - </para> - <sect1 id="Multiple_chip_control"> - <title>Multiple chip control</title> - <para> - The nand driver can control chip arrays. Therefore the - board driver must provide an own select_chip function. This - function must (de)select the requested chip. - The function pointer in the nand_chip structure must - be set before calling nand_scan(). The maxchip parameter - of nand_scan() defines the maximum number of chips to - scan for. Make sure that the select_chip function can - handle the requested number of chips. - </para> - <para> - The nand driver concatenates the chips to one virtual - chip and provides this virtual chip to the MTD layer. - </para> - <para> - <emphasis>Note: The driver can only handle linear chip arrays - of equally sized chips. There is no support for - parallel arrays which extend the buswidth.</emphasis> - </para> - <para> - <emphasis>GPIO based example</emphasis> - </para> - <programlisting> -static void board_select_chip (struct mtd_info *mtd, int chip) -{ - /* Deselect all chips, set all nCE pins high */ - GPIO(BOARD_NAND_NCE) |= 0xff; - if (chip >= 0) - GPIO(BOARD_NAND_NCE) &= ~ (1 << chip); -} - </programlisting> - <para> - <emphasis>Address lines based example.</emphasis> - Its assumed that the nCE pins are connected to an - address decoder. - </para> - <programlisting> -static void board_select_chip (struct mtd_info *mtd, int chip) -{ - struct nand_chip *this = mtd_to_nand(mtd); - - /* Deselect all chips */ - this->IO_ADDR_R &= ~BOARD_NAND_ADDR_MASK; - this->IO_ADDR_W &= ~BOARD_NAND_ADDR_MASK; - switch (chip) { - case 0: - this->IO_ADDR_R |= BOARD_NAND_ADDR_CHIP0; - this->IO_ADDR_W |= BOARD_NAND_ADDR_CHIP0; - break; - .... - case n: - this->IO_ADDR_R |= BOARD_NAND_ADDR_CHIPn; - this->IO_ADDR_W |= BOARD_NAND_ADDR_CHIPn; - break; - } -} - </programlisting> - </sect1> - <sect1 id="Hardware_ECC_support"> - <title>Hardware ECC support</title> - <sect2 id="Functions_and_constants"> - <title>Functions and constants</title> - <para> - The nand driver supports three different types of - hardware ECC. - <itemizedlist> - <listitem><para>NAND_ECC_HW3_256</para><para> - Hardware ECC generator providing 3 bytes ECC per - 256 byte. - </para> </listitem> - <listitem><para>NAND_ECC_HW3_512</para><para> - Hardware ECC generator providing 3 bytes ECC per - 512 byte. - </para> </listitem> - <listitem><para>NAND_ECC_HW6_512</para><para> - Hardware ECC generator providing 6 bytes ECC per - 512 byte. - </para> </listitem> - <listitem><para>NAND_ECC_HW8_512</para><para> - Hardware ECC generator providing 6 bytes ECC per - 512 byte. - </para> </listitem> - </itemizedlist> - If your hardware generator has a different functionality - add it at the appropriate place in nand_base.c - </para> - <para> - The board driver must provide following functions: - <itemizedlist> - <listitem><para>enable_hwecc</para><para> - This function is called before reading / writing to - the chip. Reset or initialize the hardware generator - in this function. The function is called with an - argument which let you distinguish between read - and write operations. - </para> </listitem> - <listitem><para>calculate_ecc</para><para> - This function is called after read / write from / to - the chip. Transfer the ECC from the hardware to - the buffer. If the option NAND_HWECC_SYNDROME is set - then the function is only called on write. See below. - </para> </listitem> - <listitem><para>correct_data</para><para> - In case of an ECC error this function is called for - error detection and correction. Return 1 respectively 2 - in case the error can be corrected. If the error is - not correctable return -1. If your hardware generator - matches the default algorithm of the nand_ecc software - generator then use the correction function provided - by nand_ecc instead of implementing duplicated code. - </para> </listitem> - </itemizedlist> - </para> - </sect2> - <sect2 id="Hardware_ECC_with_syndrome_calculation"> - <title>Hardware ECC with syndrome calculation</title> - <para> - Many hardware ECC implementations provide Reed-Solomon - codes and calculate an error syndrome on read. The syndrome - must be converted to a standard Reed-Solomon syndrome - before calling the error correction code in the generic - Reed-Solomon library. - </para> - <para> - The ECC bytes must be placed immediately after the data - bytes in order to make the syndrome generator work. This - is contrary to the usual layout used by software ECC. The - separation of data and out of band area is not longer - possible. The nand driver code handles this layout and - the remaining free bytes in the oob area are managed by - the autoplacement code. Provide a matching oob-layout - in this case. See rts_from4.c and diskonchip.c for - implementation reference. In those cases we must also - use bad block tables on FLASH, because the ECC layout is - interfering with the bad block marker positions. - See bad block table support for details. - </para> - </sect2> - </sect1> - <sect1 id="Bad_Block_table_support"> - <title>Bad block table support</title> - <para> - Most NAND chips mark the bad blocks at a defined - position in the spare area. Those blocks must - not be erased under any circumstances as the bad - block information would be lost. - It is possible to check the bad block mark each - time when the blocks are accessed by reading the - spare area of the first page in the block. This - is time consuming so a bad block table is used. - </para> - <para> - The nand driver supports various types of bad block - tables. - <itemizedlist> - <listitem><para>Per device</para><para> - The bad block table contains all bad block information - of the device which can consist of multiple chips. - </para> </listitem> - <listitem><para>Per chip</para><para> - A bad block table is used per chip and contains the - bad block information for this particular chip. - </para> </listitem> - <listitem><para>Fixed offset</para><para> - The bad block table is located at a fixed offset - in the chip (device). This applies to various - DiskOnChip devices. - </para> </listitem> - <listitem><para>Automatic placed</para><para> - The bad block table is automatically placed and - detected either at the end or at the beginning - of a chip (device) - </para> </listitem> - <listitem><para>Mirrored tables</para><para> - The bad block table is mirrored on the chip (device) to - allow updates of the bad block table without data loss. - </para> </listitem> - </itemizedlist> - </para> - <para> - nand_scan() calls the function nand_default_bbt(). - nand_default_bbt() selects appropriate default - bad block table descriptors depending on the chip information - which was retrieved by nand_scan(). - </para> - <para> - The standard policy is scanning the device for bad - blocks and build a ram based bad block table which - allows faster access than always checking the - bad block information on the flash chip itself. - </para> - <sect2 id="Flash_based_tables"> - <title>Flash based tables</title> - <para> - It may be desired or necessary to keep a bad block table in FLASH. - For AG-AND chips this is mandatory, as they have no factory marked - bad blocks. They have factory marked good blocks. The marker pattern - is erased when the block is erased to be reused. So in case of - powerloss before writing the pattern back to the chip this block - would be lost and added to the bad blocks. Therefore we scan the - chip(s) when we detect them the first time for good blocks and - store this information in a bad block table before erasing any - of the blocks. - </para> - <para> - The blocks in which the tables are stored are protected against - accidental access by marking them bad in the memory bad block - table. The bad block table management functions are allowed - to circumvent this protection. - </para> - <para> - The simplest way to activate the FLASH based bad block table support - is to set the option NAND_BBT_USE_FLASH in the bbt_option field of - the nand chip structure before calling nand_scan(). For AG-AND - chips is this done by default. - This activates the default FLASH based bad block table functionality - of the NAND driver. The default bad block table options are - <itemizedlist> - <listitem><para>Store bad block table per chip</para></listitem> - <listitem><para>Use 2 bits per block</para></listitem> - <listitem><para>Automatic placement at the end of the chip</para></listitem> - <listitem><para>Use mirrored tables with version numbers</para></listitem> - <listitem><para>Reserve 4 blocks at the end of the chip</para></listitem> - </itemizedlist> - </para> - </sect2> - <sect2 id="User_defined_tables"> - <title>User defined tables</title> - <para> - User defined tables are created by filling out a - nand_bbt_descr structure and storing the pointer in the - nand_chip structure member bbt_td before calling nand_scan(). - If a mirror table is necessary a second structure must be - created and a pointer to this structure must be stored - in bbt_md inside the nand_chip structure. If the bbt_md - member is set to NULL then only the main table is used - and no scan for the mirrored table is performed. - </para> - <para> - The most important field in the nand_bbt_descr structure - is the options field. The options define most of the - table properties. Use the predefined constants from - nand.h to define the options. - <itemizedlist> - <listitem><para>Number of bits per block</para> - <para>The supported number of bits is 1, 2, 4, 8.</para></listitem> - <listitem><para>Table per chip</para> - <para>Setting the constant NAND_BBT_PERCHIP selects that - a bad block table is managed for each chip in a chip array. - If this option is not set then a per device bad block table - is used.</para></listitem> - <listitem><para>Table location is absolute</para> - <para>Use the option constant NAND_BBT_ABSPAGE and - define the absolute page number where the bad block - table starts in the field pages. If you have selected bad block - tables per chip and you have a multi chip array then the start page - must be given for each chip in the chip array. Note: there is no scan - for a table ident pattern performed, so the fields - pattern, veroffs, offs, len can be left uninitialized</para></listitem> - <listitem><para>Table location is automatically detected</para> - <para>The table can either be located in the first or the last good - blocks of the chip (device). Set NAND_BBT_LASTBLOCK to place - the bad block table at the end of the chip (device). The - bad block tables are marked and identified by a pattern which - is stored in the spare area of the first page in the block which - holds the bad block table. Store a pointer to the pattern - in the pattern field. Further the length of the pattern has to be - stored in len and the offset in the spare area must be given - in the offs member of the nand_bbt_descr structure. For mirrored - bad block tables different patterns are mandatory.</para></listitem> - <listitem><para>Table creation</para> - <para>Set the option NAND_BBT_CREATE to enable the table creation - if no table can be found during the scan. Usually this is done only - once if a new chip is found. </para></listitem> - <listitem><para>Table write support</para> - <para>Set the option NAND_BBT_WRITE to enable the table write support. - This allows the update of the bad block table(s) in case a block has - to be marked bad due to wear. The MTD interface function block_markbad - is calling the update function of the bad block table. If the write - support is enabled then the table is updated on FLASH.</para> - <para> - Note: Write support should only be enabled for mirrored tables with - version control. - </para></listitem> - <listitem><para>Table version control</para> - <para>Set the option NAND_BBT_VERSION to enable the table version control. - It's highly recommended to enable this for mirrored tables with write - support. It makes sure that the risk of losing the bad block - table information is reduced to the loss of the information about the - one worn out block which should be marked bad. The version is stored in - 4 consecutive bytes in the spare area of the device. The position of - the version number is defined by the member veroffs in the bad block table - descriptor.</para></listitem> - <listitem><para>Save block contents on write</para> - <para> - In case that the block which holds the bad block table does contain - other useful information, set the option NAND_BBT_SAVECONTENT. When - the bad block table is written then the whole block is read the bad - block table is updated and the block is erased and everything is - written back. If this option is not set only the bad block table - is written and everything else in the block is ignored and erased. - </para></listitem> - <listitem><para>Number of reserved blocks</para> - <para> - For automatic placement some blocks must be reserved for - bad block table storage. The number of reserved blocks is defined - in the maxblocks member of the bad block table description structure. - Reserving 4 blocks for mirrored tables should be a reasonable number. - This also limits the number of blocks which are scanned for the bad - block table ident pattern. - </para></listitem> - </itemizedlist> - </para> - </sect2> - </sect1> - <sect1 id="Spare_area_placement"> - <title>Spare area (auto)placement</title> - <para> - The nand driver implements different possibilities for - placement of filesystem data in the spare area, - <itemizedlist> - <listitem><para>Placement defined by fs driver</para></listitem> - <listitem><para>Automatic placement</para></listitem> - </itemizedlist> - The default placement function is automatic placement. The - nand driver has built in default placement schemes for the - various chiptypes. If due to hardware ECC functionality the - default placement does not fit then the board driver can - provide a own placement scheme. - </para> - <para> - File system drivers can provide a own placement scheme which - is used instead of the default placement scheme. - </para> - <para> - Placement schemes are defined by a nand_oobinfo structure - <programlisting> -struct nand_oobinfo { - int useecc; - int eccbytes; - int eccpos[24]; - int oobfree[8][2]; -}; - </programlisting> - <itemizedlist> - <listitem><para>useecc</para><para> - The useecc member controls the ecc and placement function. The header - file include/mtd/mtd-abi.h contains constants to select ecc and - placement. MTD_NANDECC_OFF switches off the ecc complete. This is - not recommended and available for testing and diagnosis only. - MTD_NANDECC_PLACE selects caller defined placement, MTD_NANDECC_AUTOPLACE - selects automatic placement. - </para></listitem> - <listitem><para>eccbytes</para><para> - The eccbytes member defines the number of ecc bytes per page. - </para></listitem> - <listitem><para>eccpos</para><para> - The eccpos array holds the byte offsets in the spare area where - the ecc codes are placed. - </para></listitem> - <listitem><para>oobfree</para><para> - The oobfree array defines the areas in the spare area which can be - used for automatic placement. The information is given in the format - {offset, size}. offset defines the start of the usable area, size the - length in bytes. More than one area can be defined. The list is terminated - by an {0, 0} entry. - </para></listitem> - </itemizedlist> - </para> - <sect2 id="Placement_defined_by_fs_driver"> - <title>Placement defined by fs driver</title> - <para> - The calling function provides a pointer to a nand_oobinfo - structure which defines the ecc placement. For writes the - caller must provide a spare area buffer along with the - data buffer. The spare area buffer size is (number of pages) * - (size of spare area). For reads the buffer size is - (number of pages) * ((size of spare area) + (number of ecc - steps per page) * sizeof (int)). The driver stores the - result of the ecc check for each tuple in the spare buffer. - The storage sequence is - </para> - <para> - <spare data page 0><ecc result 0>...<ecc result n> - </para> - <para> - ... - </para> - <para> - <spare data page n><ecc result 0>...<ecc result n> - </para> - <para> - This is a legacy mode used by YAFFS1. - </para> - <para> - If the spare area buffer is NULL then only the ECC placement is - done according to the given scheme in the nand_oobinfo structure. - </para> - </sect2> - <sect2 id="Automatic_placement"> - <title>Automatic placement</title> - <para> - Automatic placement uses the built in defaults to place the - ecc bytes in the spare area. If filesystem data have to be stored / - read into the spare area then the calling function must provide a - buffer. The buffer size per page is determined by the oobfree array in - the nand_oobinfo structure. - </para> - <para> - If the spare area buffer is NULL then only the ECC placement is - done according to the default builtin scheme. - </para> - </sect2> - </sect1> - <sect1 id="Spare_area_autoplacement_default"> - <title>Spare area autoplacement default schemes</title> - <sect2 id="pagesize_256"> - <title>256 byte pagesize</title> -<informaltable><tgroup cols="3"><tbody> -<row> -<entry>Offset</entry> -<entry>Content</entry> -<entry>Comment</entry> -</row> -<row> -<entry>0x00</entry> -<entry>ECC byte 0</entry> -<entry>Error correction code byte 0</entry> -</row> -<row> -<entry>0x01</entry> -<entry>ECC byte 1</entry> -<entry>Error correction code byte 1</entry> -</row> -<row> -<entry>0x02</entry> -<entry>ECC byte 2</entry> -<entry>Error correction code byte 2</entry> -</row> -<row> -<entry>0x03</entry> -<entry>Autoplace 0</entry> -<entry></entry> -</row> -<row> -<entry>0x04</entry> -<entry>Autoplace 1</entry> -<entry></entry> -</row> -<row> -<entry>0x05</entry> -<entry>Bad block marker</entry> -<entry>If any bit in this byte is zero, then this block is bad. -This applies only to the first page in a block. In the remaining -pages this byte is reserved</entry> -</row> -<row> -<entry>0x06</entry> -<entry>Autoplace 2</entry> -<entry></entry> -</row> -<row> -<entry>0x07</entry> -<entry>Autoplace 3</entry> -<entry></entry> -</row> -</tbody></tgroup></informaltable> - </sect2> - <sect2 id="pagesize_512"> - <title>512 byte pagesize</title> -<informaltable><tgroup cols="3"><tbody> -<row> -<entry>Offset</entry> -<entry>Content</entry> -<entry>Comment</entry> -</row> -<row> -<entry>0x00</entry> -<entry>ECC byte 0</entry> -<entry>Error correction code byte 0 of the lower 256 Byte data in -this page</entry> -</row> -<row> -<entry>0x01</entry> -<entry>ECC byte 1</entry> -<entry>Error correction code byte 1 of the lower 256 Bytes of data -in this page</entry> -</row> -<row> -<entry>0x02</entry> -<entry>ECC byte 2</entry> -<entry>Error correction code byte 2 of the lower 256 Bytes of data -in this page</entry> -</row> -<row> -<entry>0x03</entry> -<entry>ECC byte 3</entry> -<entry>Error correction code byte 0 of the upper 256 Bytes of data -in this page</entry> -</row> -<row> -<entry>0x04</entry> -<entry>reserved</entry> -<entry>reserved</entry> -</row> -<row> -<entry>0x05</entry> -<entry>Bad block marker</entry> -<entry>If any bit in this byte is zero, then this block is bad. -This applies only to the first page in a block. In the remaining -pages this byte is reserved</entry> -</row> -<row> -<entry>0x06</entry> -<entry>ECC byte 4</entry> -<entry>Error correction code byte 1 of the upper 256 Bytes of data -in this page</entry> -</row> -<row> -<entry>0x07</entry> -<entry>ECC byte 5</entry> -<entry>Error correction code byte 2 of the upper 256 Bytes of data -in this page</entry> -</row> -<row> -<entry>0x08 - 0x0F</entry> -<entry>Autoplace 0 - 7</entry> -<entry></entry> -</row> -</tbody></tgroup></informaltable> - </sect2> - <sect2 id="pagesize_2048"> - <title>2048 byte pagesize</title> -<informaltable><tgroup cols="3"><tbody> -<row> -<entry>Offset</entry> -<entry>Content</entry> -<entry>Comment</entry> -</row> -<row> -<entry>0x00</entry> -<entry>Bad block marker</entry> -<entry>If any bit in this byte is zero, then this block is bad. -This applies only to the first page in a block. In the remaining -pages this byte is reserved</entry> -</row> -<row> -<entry>0x01</entry> -<entry>Reserved</entry> -<entry>Reserved</entry> -</row> -<row> -<entry>0x02-0x27</entry> -<entry>Autoplace 0 - 37</entry> -<entry></entry> -</row> -<row> -<entry>0x28</entry> -<entry>ECC byte 0</entry> -<entry>Error correction code byte 0 of the first 256 Byte data in -this page</entry> -</row> -<row> -<entry>0x29</entry> -<entry>ECC byte 1</entry> -<entry>Error correction code byte 1 of the first 256 Bytes of data -in this page</entry> -</row> -<row> -<entry>0x2A</entry> -<entry>ECC byte 2</entry> -<entry>Error correction code byte 2 of the first 256 Bytes data in -this page</entry> -</row> -<row> -<entry>0x2B</entry> -<entry>ECC byte 3</entry> -<entry>Error correction code byte 0 of the second 256 Bytes of data -in this page</entry> -</row> -<row> -<entry>0x2C</entry> -<entry>ECC byte 4</entry> -<entry>Error correction code byte 1 of the second 256 Bytes of data -in this page</entry> -</row> -<row> -<entry>0x2D</entry> -<entry>ECC byte 5</entry> -<entry>Error correction code byte 2 of the second 256 Bytes of data -in this page</entry> -</row> -<row> -<entry>0x2E</entry> -<entry>ECC byte 6</entry> -<entry>Error correction code byte 0 of the third 256 Bytes of data -in this page</entry> -</row> -<row> -<entry>0x2F</entry> -<entry>ECC byte 7</entry> -<entry>Error correction code byte 1 of the third 256 Bytes of data -in this page</entry> -</row> -<row> -<entry>0x30</entry> -<entry>ECC byte 8</entry> -<entry>Error correction code byte 2 of the third 256 Bytes of data -in this page</entry> -</row> -<row> -<entry>0x31</entry> -<entry>ECC byte 9</entry> -<entry>Error correction code byte 0 of the fourth 256 Bytes of data -in this page</entry> -</row> -<row> -<entry>0x32</entry> -<entry>ECC byte 10</entry> -<entry>Error correction code byte 1 of the fourth 256 Bytes of data -in this page</entry> -</row> -<row> -<entry>0x33</entry> -<entry>ECC byte 11</entry> -<entry>Error correction code byte 2 of the fourth 256 Bytes of data -in this page</entry> -</row> -<row> -<entry>0x34</entry> -<entry>ECC byte 12</entry> -<entry>Error correction code byte 0 of the fifth 256 Bytes of data -in this page</entry> -</row> -<row> -<entry>0x35</entry> -<entry>ECC byte 13</entry> -<entry>Error correction code byte 1 of the fifth 256 Bytes of data -in this page</entry> -</row> -<row> -<entry>0x36</entry> -<entry>ECC byte 14</entry> -<entry>Error correction code byte 2 of the fifth 256 Bytes of data -in this page</entry> -</row> -<row> -<entry>0x37</entry> -<entry>ECC byte 15</entry> -<entry>Error correction code byte 0 of the sixt 256 Bytes of data -in this page</entry> -</row> -<row> -<entry>0x38</entry> -<entry>ECC byte 16</entry> -<entry>Error correction code byte 1 of the sixt 256 Bytes of data -in this page</entry> -</row> -<row> -<entry>0x39</entry> -<entry>ECC byte 17</entry> -<entry>Error correction code byte 2 of the sixt 256 Bytes of data -in this page</entry> -</row> -<row> -<entry>0x3A</entry> -<entry>ECC byte 18</entry> -<entry>Error correction code byte 0 of the seventh 256 Bytes of -data in this page</entry> -</row> -<row> -<entry>0x3B</entry> -<entry>ECC byte 19</entry> -<entry>Error correction code byte 1 of the seventh 256 Bytes of -data in this page</entry> -</row> -<row> -<entry>0x3C</entry> -<entry>ECC byte 20</entry> -<entry>Error correction code byte 2 of the seventh 256 Bytes of -data in this page</entry> -</row> -<row> -<entry>0x3D</entry> -<entry>ECC byte 21</entry> -<entry>Error correction code byte 0 of the eighth 256 Bytes of data -in this page</entry> -</row> -<row> -<entry>0x3E</entry> -<entry>ECC byte 22</entry> -<entry>Error correction code byte 1 of the eighth 256 Bytes of data -in this page</entry> -</row> -<row> -<entry>0x3F</entry> -<entry>ECC byte 23</entry> -<entry>Error correction code byte 2 of the eighth 256 Bytes of data -in this page</entry> -</row> -</tbody></tgroup></informaltable> - </sect2> - </sect1> - </chapter> - - <chapter id="filesystems"> - <title>Filesystem support</title> - <para> - The NAND driver provides all necessary functions for a - filesystem via the MTD interface. - </para> - <para> - Filesystems must be aware of the NAND peculiarities and - restrictions. One major restrictions of NAND Flash is, that you cannot - write as often as you want to a page. The consecutive writes to a page, - before erasing it again, are restricted to 1-3 writes, depending on the - manufacturers specifications. This applies similar to the spare area. - </para> - <para> - Therefore NAND aware filesystems must either write in page size chunks - or hold a writebuffer to collect smaller writes until they sum up to - pagesize. Available NAND aware filesystems: JFFS2, YAFFS. - </para> - <para> - The spare area usage to store filesystem data is controlled by - the spare area placement functionality which is described in one - of the earlier chapters. - </para> - </chapter> - <chapter id="tools"> - <title>Tools</title> - <para> - The MTD project provides a couple of helpful tools to handle NAND Flash. - <itemizedlist> - <listitem><para>flasherase, flasheraseall: Erase and format FLASH partitions</para></listitem> - <listitem><para>nandwrite: write filesystem images to NAND FLASH</para></listitem> - <listitem><para>nanddump: dump the contents of a NAND FLASH partitions</para></listitem> - </itemizedlist> - </para> - <para> - These tools are aware of the NAND restrictions. Please use those tools - instead of complaining about errors which are caused by non NAND aware - access methods. - </para> - </chapter> - - <chapter id="defines"> - <title>Constants</title> - <para> - This chapter describes the constants which might be relevant for a driver developer. - </para> - <sect1 id="Chip_option_constants"> - <title>Chip option constants</title> - <sect2 id="Constants_for_chip_id_table"> - <title>Constants for chip id table</title> - <para> - These constants are defined in nand.h. They are ored together to describe - the chip functionality. - <programlisting> -/* Buswitdh is 16 bit */ -#define NAND_BUSWIDTH_16 0x00000002 -/* Device supports partial programming without padding */ -#define NAND_NO_PADDING 0x00000004 -/* Chip has cache program function */ -#define NAND_CACHEPRG 0x00000008 -/* Chip has copy back function */ -#define NAND_COPYBACK 0x00000010 -/* AND Chip which has 4 banks and a confusing page / block - * assignment. See Renesas datasheet for further information */ -#define NAND_IS_AND 0x00000020 -/* Chip has a array of 4 pages which can be read without - * additional ready /busy waits */ -#define NAND_4PAGE_ARRAY 0x00000040 - </programlisting> - </para> - </sect2> - <sect2 id="Constants_for_runtime_options"> - <title>Constants for runtime options</title> - <para> - These constants are defined in nand.h. They are ored together to describe - the functionality. - <programlisting> -/* The hw ecc generator provides a syndrome instead a ecc value on read - * This can only work if we have the ecc bytes directly behind the - * data bytes. Applies for DOC and AG-AND Renesas HW Reed Solomon generators */ -#define NAND_HWECC_SYNDROME 0x00020000 - </programlisting> - </para> - </sect2> - </sect1> - - <sect1 id="EEC_selection_constants"> - <title>ECC selection constants</title> - <para> - Use these constants to select the ECC algorithm. - <programlisting> -/* No ECC. Usage is not recommended ! */ -#define NAND_ECC_NONE 0 -/* Software ECC 3 byte ECC per 256 Byte data */ -#define NAND_ECC_SOFT 1 -/* Hardware ECC 3 byte ECC per 256 Byte data */ -#define NAND_ECC_HW3_256 2 -/* Hardware ECC 3 byte ECC per 512 Byte data */ -#define NAND_ECC_HW3_512 3 -/* Hardware ECC 6 byte ECC per 512 Byte data */ -#define NAND_ECC_HW6_512 4 -/* Hardware ECC 6 byte ECC per 512 Byte data */ -#define NAND_ECC_HW8_512 6 - </programlisting> - </para> - </sect1> - - <sect1 id="Hardware_control_related_constants"> - <title>Hardware control related constants</title> - <para> - These constants describe the requested hardware access function when - the boardspecific hardware control function is called - <programlisting> -/* Select the chip by setting nCE to low */ -#define NAND_CTL_SETNCE 1 -/* Deselect the chip by setting nCE to high */ -#define NAND_CTL_CLRNCE 2 -/* Select the command latch by setting CLE to high */ -#define NAND_CTL_SETCLE 3 -/* Deselect the command latch by setting CLE to low */ -#define NAND_CTL_CLRCLE 4 -/* Select the address latch by setting ALE to high */ -#define NAND_CTL_SETALE 5 -/* Deselect the address latch by setting ALE to low */ -#define NAND_CTL_CLRALE 6 -/* Set write protection by setting WP to high. Not used! */ -#define NAND_CTL_SETWP 7 -/* Clear write protection by setting WP to low. Not used! */ -#define NAND_CTL_CLRWP 8 - </programlisting> - </para> - </sect1> - - <sect1 id="Bad_block_table_constants"> - <title>Bad block table related constants</title> - <para> - These constants describe the options used for bad block - table descriptors. - <programlisting> -/* Options for the bad block table descriptors */ - -/* The number of bits used per block in the bbt on the device */ -#define NAND_BBT_NRBITS_MSK 0x0000000F -#define NAND_BBT_1BIT 0x00000001 -#define NAND_BBT_2BIT 0x00000002 -#define NAND_BBT_4BIT 0x00000004 -#define NAND_BBT_8BIT 0x00000008 -/* The bad block table is in the last good block of the device */ -#define NAND_BBT_LASTBLOCK 0x00000010 -/* The bbt is at the given page, else we must scan for the bbt */ -#define NAND_BBT_ABSPAGE 0x00000020 -/* bbt is stored per chip on multichip devices */ -#define NAND_BBT_PERCHIP 0x00000080 -/* bbt has a version counter at offset veroffs */ -#define NAND_BBT_VERSION 0x00000100 -/* Create a bbt if none axists */ -#define NAND_BBT_CREATE 0x00000200 -/* Write bbt if necessary */ -#define NAND_BBT_WRITE 0x00001000 -/* Read and write back block contents when writing bbt */ -#define NAND_BBT_SAVECONTENT 0x00002000 - </programlisting> - </para> - </sect1> - - </chapter> - - <chapter id="structs"> - <title>Structures</title> - <para> - This chapter contains the autogenerated documentation of the structures which are - used in the NAND driver and might be relevant for a driver developer. Each - struct member has a short description which is marked with an [XXX] identifier. - See the chapter "Documentation hints" for an explanation. - </para> -!Iinclude/linux/mtd/nand.h - </chapter> - - <chapter id="pubfunctions"> - <title>Public Functions Provided</title> - <para> - This chapter contains the autogenerated documentation of the NAND kernel API functions - which are exported. Each function has a short description which is marked with an [XXX] identifier. - See the chapter "Documentation hints" for an explanation. - </para> -!Edrivers/mtd/nand/nand_base.c -!Edrivers/mtd/nand/nand_bbt.c -!Edrivers/mtd/nand/nand_ecc.c - </chapter> - - <chapter id="intfunctions"> - <title>Internal Functions Provided</title> - <para> - This chapter contains the autogenerated documentation of the NAND driver internal functions. - Each function has a short description which is marked with an [XXX] identifier. - See the chapter "Documentation hints" for an explanation. - The functions marked with [DEFAULT] might be relevant for a board driver developer. - </para> -!Idrivers/mtd/nand/nand_base.c -!Idrivers/mtd/nand/nand_bbt.c -<!-- No internal functions for kernel-doc: -X!Idrivers/mtd/nand/nand_ecc.c ---> - </chapter> - - <chapter id="credits"> - <title>Credits</title> - <para> - The following people have contributed to the NAND driver: - <orderedlist> - <listitem><para>Steven J. Hill<email>sjhill@realitydiluted.com</email></para></listitem> - <listitem><para>David Woodhouse<email>dwmw2@infradead.org</email></para></listitem> - <listitem><para>Thomas Gleixner<email>tglx@linutronix.de</email></para></listitem> - </orderedlist> - A lot of users have provided bugfixes, improvements and helping hands for testing. - Thanks a lot. - </para> - <para> - The following people have contributed to this document: - <orderedlist> - <listitem><para>Thomas Gleixner<email>tglx@linutronix.de</email></para></listitem> - </orderedlist> - </para> - </chapter> -</book> diff --git a/Documentation/DocBook/networking.tmpl b/Documentation/DocBook/networking.tmpl deleted file mode 100644 index 29df25016c7c..000000000000 --- a/Documentation/DocBook/networking.tmpl +++ /dev/null @@ -1,111 +0,0 @@ -<?xml version="1.0" encoding="UTF-8"?> -<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" - "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []> - -<book id="LinuxNetworking"> - <bookinfo> - <title>Linux Networking and Network Devices APIs</title> - - <legalnotice> - <para> - This documentation is free software; you can redistribute - it and/or modify it under the terms of the GNU General Public - License as published by the Free Software Foundation; either - version 2 of the License, or (at your option) any later - version. - </para> - - <para> - This program is distributed in the hope that it will be - useful, but WITHOUT ANY WARRANTY; without even the implied - warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. - See the GNU General Public License for more details. - </para> - - <para> - You should have received a copy of the GNU General Public - License along with this program; if not, write to the Free - Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, - MA 02111-1307 USA - </para> - - <para> - For more details see the file COPYING in the source - distribution of Linux. - </para> - </legalnotice> - </bookinfo> - -<toc></toc> - - <chapter id="netcore"> - <title>Linux Networking</title> - <sect1><title>Networking Base Types</title> -!Iinclude/linux/net.h - </sect1> - <sect1><title>Socket Buffer Functions</title> -!Iinclude/linux/skbuff.h -!Iinclude/net/sock.h -!Enet/socket.c -!Enet/core/skbuff.c -!Enet/core/sock.c -!Enet/core/datagram.c -!Enet/core/stream.c - </sect1> - <sect1><title>Socket Filter</title> -!Enet/core/filter.c - </sect1> - <sect1><title>Generic Network Statistics</title> -!Iinclude/uapi/linux/gen_stats.h -!Enet/core/gen_stats.c -!Enet/core/gen_estimator.c - </sect1> - <sect1><title>SUN RPC subsystem</title> -<!-- The !D functionality is not perfect, garbage has to be protected by comments -!Dnet/sunrpc/sunrpc_syms.c ---> -!Enet/sunrpc/xdr.c -!Enet/sunrpc/svc_xprt.c -!Enet/sunrpc/xprt.c -!Enet/sunrpc/sched.c -!Enet/sunrpc/socklib.c -!Enet/sunrpc/stats.c -!Enet/sunrpc/rpc_pipe.c -!Enet/sunrpc/rpcb_clnt.c -!Enet/sunrpc/clnt.c - </sect1> - <sect1><title>WiMAX</title> -!Enet/wimax/op-msg.c -!Enet/wimax/op-reset.c -!Enet/wimax/op-rfkill.c -!Enet/wimax/stack.c -!Iinclude/net/wimax.h -!Iinclude/uapi/linux/wimax.h - </sect1> - </chapter> - - <chapter id="netdev"> - <title>Network device support</title> - <sect1><title>Driver Support</title> -!Enet/core/dev.c -!Enet/ethernet/eth.c -!Enet/sched/sch_generic.c -!Iinclude/linux/etherdevice.h -!Iinclude/linux/netdevice.h - </sect1> - <sect1><title>PHY Support</title> -!Edrivers/net/phy/phy.c -!Idrivers/net/phy/phy.c -!Edrivers/net/phy/phy_device.c -!Idrivers/net/phy/phy_device.c -!Edrivers/net/phy/mdio_bus.c -!Idrivers/net/phy/mdio_bus.c - </sect1> -<!-- FIXME: Removed for now since no structured comments in source - <sect1><title>Wireless</title> -X!Enet/core/wireless.c - </sect1> ---> - </chapter> - -</book> diff --git a/Documentation/DocBook/rapidio.tmpl b/Documentation/DocBook/rapidio.tmpl deleted file mode 100644 index ac3cca3399a1..000000000000 --- a/Documentation/DocBook/rapidio.tmpl +++ /dev/null @@ -1,155 +0,0 @@ -<?xml version="1.0" encoding="UTF-8"?> -<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" - "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" [ - <!ENTITY rapidio SYSTEM "rapidio.xml"> - ]> - -<book id="RapidIO-Guide"> - <bookinfo> - <title>RapidIO Subsystem Guide</title> - - <authorgroup> - <author> - <firstname>Matt</firstname> - <surname>Porter</surname> - <affiliation> - <address> - <email>mporter@kernel.crashing.org</email> - <email>mporter@mvista.com</email> - </address> - </affiliation> - </author> - </authorgroup> - - <copyright> - <year>2005</year> - <holder>MontaVista Software, Inc.</holder> - </copyright> - - <legalnotice> - <para> - This documentation is free software; you can redistribute - it and/or modify it under the terms of the GNU General Public - License version 2 as published by the Free Software Foundation. - </para> - - <para> - This program is distributed in the hope that it will be - useful, but WITHOUT ANY WARRANTY; without even the implied - warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. - See the GNU General Public License for more details. - </para> - - <para> - You should have received a copy of the GNU General Public - License along with this program; if not, write to the Free - Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, - MA 02111-1307 USA - </para> - - <para> - For more details see the file COPYING in the source - distribution of Linux. - </para> - </legalnotice> - </bookinfo> - -<toc></toc> - - <chapter id="intro"> - <title>Introduction</title> - <para> - RapidIO is a high speed switched fabric interconnect with - features aimed at the embedded market. RapidIO provides - support for memory-mapped I/O as well as message-based - transactions over the switched fabric network. RapidIO has - a standardized discovery mechanism not unlike the PCI bus - standard that allows simple detection of devices in a - network. - </para> - <para> - This documentation is provided for developers intending - to support RapidIO on new architectures, write new drivers, - or to understand the subsystem internals. - </para> - </chapter> - - <chapter id="bugs"> - <title>Known Bugs and Limitations</title> - - <sect1 id="known_bugs"> - <title>Bugs</title> - <para>None. ;)</para> - </sect1> - <sect1 id="Limitations"> - <title>Limitations</title> - <para> - <orderedlist> - <listitem><para>Access/management of RapidIO memory regions is not supported</para></listitem> - <listitem><para>Multiple host enumeration is not supported</para></listitem> - </orderedlist> - </para> - </sect1> - </chapter> - - <chapter id="drivers"> - <title>RapidIO driver interface</title> - <para> - Drivers are provided a set of calls in order - to interface with the subsystem to gather info - on devices, request/map memory region resources, - and manage mailboxes/doorbells. - </para> - <sect1 id="Functions"> - <title>Functions</title> -!Iinclude/linux/rio_drv.h -!Edrivers/rapidio/rio-driver.c -!Edrivers/rapidio/rio.c - </sect1> - </chapter> - - <chapter id="internals"> - <title>Internals</title> - - <para> - This chapter contains the autogenerated documentation of the RapidIO - subsystem. - </para> - - <sect1 id="Structures"><title>Structures</title> -!Iinclude/linux/rio.h - </sect1> - <sect1 id="Enumeration_and_Discovery"><title>Enumeration and Discovery</title> -!Idrivers/rapidio/rio-scan.c - </sect1> - <sect1 id="Driver_functionality"><title>Driver functionality</title> -!Idrivers/rapidio/rio.c -!Idrivers/rapidio/rio-access.c - </sect1> - <sect1 id="Device_model_support"><title>Device model support</title> -!Idrivers/rapidio/rio-driver.c - </sect1> - <sect1 id="PPC32_support"><title>PPC32 support</title> -!Iarch/powerpc/sysdev/fsl_rio.c - </sect1> - </chapter> - - <chapter id="credits"> - <title>Credits</title> - <para> - The following people have contributed to the RapidIO - subsystem directly or indirectly: - <orderedlist> - <listitem><para>Matt Porter<email>mporter@kernel.crashing.org</email></para></listitem> - <listitem><para>Randy Vinson<email>rvinson@mvista.com</email></para></listitem> - <listitem><para>Dan Malek<email>dan@embeddedalley.com</email></para></listitem> - </orderedlist> - </para> - <para> - The following people have contributed to this document: - <orderedlist> - <listitem><para>Matt Porter<email>mporter@kernel.crashing.org</email></para></listitem> - </orderedlist> - </para> - </chapter> -</book> diff --git a/Documentation/DocBook/s390-drivers.tmpl b/Documentation/DocBook/s390-drivers.tmpl deleted file mode 100644 index 95bfc12e5439..000000000000 --- a/Documentation/DocBook/s390-drivers.tmpl +++ /dev/null @@ -1,161 +0,0 @@ -<?xml version="1.0" encoding="UTF-8"?> -<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" - "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []> - -<book id="s390drivers"> - <bookinfo> - <title>Writing s390 channel device drivers</title> - - <authorgroup> - <author> - <firstname>Cornelia</firstname> - <surname>Huck</surname> - <affiliation> - <address> - <email>cornelia.huck@de.ibm.com</email> - </address> - </affiliation> - </author> - </authorgroup> - - <copyright> - <year>2007</year> - <holder>IBM Corp.</holder> - </copyright> - - <legalnotice> - <para> - This documentation is free software; you can redistribute - it and/or modify it under the terms of the GNU General Public - License as published by the Free Software Foundation; either - version 2 of the License, or (at your option) any later - version. - </para> - - <para> - This program is distributed in the hope that it will be - useful, but WITHOUT ANY WARRANTY; without even the implied - warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. - See the GNU General Public License for more details. - </para> - - <para> - You should have received a copy of the GNU General Public - License along with this program; if not, write to the Free - Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, - MA 02111-1307 USA - </para> - - <para> - For more details see the file COPYING in the source - distribution of Linux. - </para> - </legalnotice> - </bookinfo> - -<toc></toc> - - <chapter id="intro"> - <title>Introduction</title> - <para> - This document describes the interfaces available for device drivers that - drive s390 based channel attached I/O devices. This includes interfaces for - interaction with the hardware and interfaces for interacting with the - common driver core. Those interfaces are provided by the s390 common I/O - layer. - </para> - <para> - The document assumes a familarity with the technical terms associated - with the s390 channel I/O architecture. For a description of this - architecture, please refer to the "z/Architecture: Principles of - Operation", IBM publication no. SA22-7832. - </para> - <para> - While most I/O devices on a s390 system are typically driven through the - channel I/O mechanism described here, there are various other methods - (like the diag interface). These are out of the scope of this document. - </para> - <para> - Some additional information can also be found in the kernel source - under Documentation/s390/driver-model.txt. - </para> - </chapter> - <chapter id="ccw"> - <title>The ccw bus</title> - <para> - The ccw bus typically contains the majority of devices available to - a s390 system. Named after the channel command word (ccw), the basic - command structure used to address its devices, the ccw bus contains - so-called channel attached devices. They are addressed via I/O - subchannels, visible on the css bus. A device driver for - channel-attached devices, however, will never interact with the - subchannel directly, but only via the I/O device on the ccw bus, - the ccw device. - </para> - <sect1 id="channelIO"> - <title>I/O functions for channel-attached devices</title> - <para> - Some hardware structures have been translated into C structures for use - by the common I/O layer and device drivers. For more information on - the hardware structures represented here, please consult the Principles - of Operation. - </para> -!Iarch/s390/include/asm/cio.h - </sect1> - <sect1 id="ccwdev"> - <title>ccw devices</title> - <para> - Devices that want to initiate channel I/O need to attach to the ccw bus. - Interaction with the driver core is done via the common I/O layer, which - provides the abstractions of ccw devices and ccw device drivers. - </para> - <para> - The functions that initiate or terminate channel I/O all act upon a - ccw device structure. Device drivers must not bypass those functions - or strange side effects may happen. - </para> -!Iarch/s390/include/asm/ccwdev.h -!Edrivers/s390/cio/device.c -!Edrivers/s390/cio/device_ops.c - </sect1> - <sect1 id="cmf"> - <title>The channel-measurement facility</title> - <para> - The channel-measurement facility provides a means to collect - measurement data which is made available by the channel subsystem - for each channel attached device. - </para> -!Iarch/s390/include/asm/cmb.h -!Edrivers/s390/cio/cmf.c - </sect1> - </chapter> - - <chapter id="ccwgroup"> - <title>The ccwgroup bus</title> - <para> - The ccwgroup bus only contains artificial devices, created by the user. - Many networking devices (e.g. qeth) are in fact composed of several - ccw devices (like read, write and data channel for qeth). The - ccwgroup bus provides a mechanism to create a meta-device which - contains those ccw devices as slave devices and can be associated - with the netdevice. - </para> - <sect1 id="ccwgroupdevices"> - <title>ccw group devices</title> -!Iarch/s390/include/asm/ccwgroup.h -!Edrivers/s390/cio/ccwgroup.c - </sect1> - </chapter> - - <chapter id="genericinterfaces"> - <title>Generic interfaces</title> - <para> - Some interfaces are available to other drivers that do not necessarily - have anything to do with the busses described above, but still are - indirectly using basic infrastructure in the common I/O layer. - One example is the support for adapter interrupts. - </para> -!Edrivers/s390/cio/airq.c - </chapter> - -</book> diff --git a/Documentation/DocBook/scsi.tmpl b/Documentation/DocBook/scsi.tmpl deleted file mode 100644 index 4b9b9b286cea..000000000000 --- a/Documentation/DocBook/scsi.tmpl +++ /dev/null @@ -1,409 +0,0 @@ -<?xml version="1.0" encoding="UTF-8"?> -<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" - "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []> - -<book id="scsimid"> - <bookinfo> - <title>SCSI Interfaces Guide</title> - - <authorgroup> - <author> - <firstname>James</firstname> - <surname>Bottomley</surname> - <affiliation> - <address> - <email>James.Bottomley@hansenpartnership.com</email> - </address> - </affiliation> - </author> - - <author> - <firstname>Rob</firstname> - <surname>Landley</surname> - <affiliation> - <address> - <email>rob@landley.net</email> - </address> - </affiliation> - </author> - - </authorgroup> - - <copyright> - <year>2007</year> - <holder>Linux Foundation</holder> - </copyright> - - <legalnotice> - <para> - This documentation is free software; you can redistribute - it and/or modify it under the terms of the GNU General Public - License version 2. - </para> - - <para> - This program is distributed in the hope that it will be - useful, but WITHOUT ANY WARRANTY; without even the implied - warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. - For more details see the file COPYING in the source - distribution of Linux. - </para> - </legalnotice> - </bookinfo> - - <toc></toc> - - <chapter id="intro"> - <title>Introduction</title> - <sect1 id="protocol_vs_bus"> - <title>Protocol vs bus</title> - <para> - Once upon a time, the Small Computer Systems Interface defined both - a parallel I/O bus and a data protocol to connect a wide variety of - peripherals (disk drives, tape drives, modems, printers, scanners, - optical drives, test equipment, and medical devices) to a host - computer. - </para> - <para> - Although the old parallel (fast/wide/ultra) SCSI bus has largely - fallen out of use, the SCSI command set is more widely used than ever - to communicate with devices over a number of different busses. - </para> - <para> - The <ulink url='http://www.t10.org/scsi-3.htm'>SCSI protocol</ulink> - is a big-endian peer-to-peer packet based protocol. SCSI commands - are 6, 10, 12, or 16 bytes long, often followed by an associated data - payload. - </para> - <para> - SCSI commands can be transported over just about any kind of bus, and - are the default protocol for storage devices attached to USB, SATA, - SAS, Fibre Channel, FireWire, and ATAPI devices. SCSI packets are - also commonly exchanged over Infiniband, - <ulink url='http://i2o.shadowconnect.com/faq.php'>I20</ulink>, TCP/IP - (<ulink url='https://en.wikipedia.org/wiki/ISCSI'>iSCSI</ulink>), even - <ulink url='http://cyberelk.net/tim/parport/parscsi.html'>Parallel - ports</ulink>. - </para> - </sect1> - <sect1 id="subsystem_design"> - <title>Design of the Linux SCSI subsystem</title> - <para> - The SCSI subsystem uses a three layer design, with upper, mid, and low - layers. Every operation involving the SCSI subsystem (such as reading - a sector from a disk) uses one driver at each of the 3 levels: one - upper layer driver, one lower layer driver, and the SCSI midlayer. - </para> - <para> - The SCSI upper layer provides the interface between userspace and the - kernel, in the form of block and char device nodes for I/O and - ioctl(). The SCSI lower layer contains drivers for specific hardware - devices. - </para> - <para> - In between is the SCSI mid-layer, analogous to a network routing - layer such as the IPv4 stack. The SCSI mid-layer routes a packet - based data protocol between the upper layer's /dev nodes and the - corresponding devices in the lower layer. It manages command queues, - provides error handling and power management functions, and responds - to ioctl() requests. - </para> - </sect1> - </chapter> - - <chapter id="upper_layer"> - <title>SCSI upper layer</title> - <para> - The upper layer supports the user-kernel interface by providing - device nodes. - </para> - <sect1 id="sd"> - <title>sd (SCSI Disk)</title> - <para>sd (sd_mod.o)</para> -<!-- !Idrivers/scsi/sd.c --> - </sect1> - <sect1 id="sr"> - <title>sr (SCSI CD-ROM)</title> - <para>sr (sr_mod.o)</para> - </sect1> - <sect1 id="st"> - <title>st (SCSI Tape)</title> - <para>st (st.o)</para> - </sect1> - <sect1 id="sg"> - <title>sg (SCSI Generic)</title> - <para>sg (sg.o)</para> - </sect1> - <sect1 id="ch"> - <title>ch (SCSI Media Changer)</title> - <para>ch (ch.c)</para> - </sect1> - </chapter> - - <chapter id="mid_layer"> - <title>SCSI mid layer</title> - - <sect1 id="midlayer_implementation"> - <title>SCSI midlayer implementation</title> - <sect2 id="scsi_device.h"> - <title>include/scsi/scsi_device.h</title> - <para> - </para> -!Iinclude/scsi/scsi_device.h - </sect2> - - <sect2 id="scsi.c"> - <title>drivers/scsi/scsi.c</title> - <para>Main file for the SCSI midlayer.</para> -!Edrivers/scsi/scsi.c - </sect2> - <sect2 id="scsicam.c"> - <title>drivers/scsi/scsicam.c</title> - <para> - <ulink url='http://www.t10.org/ftp/t10/drafts/cam/cam-r12b.pdf'>SCSI - Common Access Method</ulink> support functions, for use with - HDIO_GETGEO, etc. - </para> -!Edrivers/scsi/scsicam.c - </sect2> - <sect2 id="scsi_error.c"> - <title>drivers/scsi/scsi_error.c</title> - <para>Common SCSI error/timeout handling routines.</para> -!Edrivers/scsi/scsi_error.c - </sect2> - <sect2 id="scsi_devinfo.c"> - <title>drivers/scsi/scsi_devinfo.c</title> - <para> - Manage scsi_dev_info_list, which tracks blacklisted and whitelisted - devices. - </para> -!Idrivers/scsi/scsi_devinfo.c - </sect2> - <sect2 id="scsi_ioctl.c"> - <title>drivers/scsi/scsi_ioctl.c</title> - <para> - Handle ioctl() calls for SCSI devices. - </para> -!Edrivers/scsi/scsi_ioctl.c - </sect2> - <sect2 id="scsi_lib.c"> - <title>drivers/scsi/scsi_lib.c</title> - <para> - SCSI queuing library. - </para> -!Edrivers/scsi/scsi_lib.c - </sect2> - <sect2 id="scsi_lib_dma.c"> - <title>drivers/scsi/scsi_lib_dma.c</title> - <para> - SCSI library functions depending on DMA - (map and unmap scatter-gather lists). - </para> -!Edrivers/scsi/scsi_lib_dma.c - </sect2> - <sect2 id="scsi_module.c"> - <title>drivers/scsi/scsi_module.c</title> - <para> - The file drivers/scsi/scsi_module.c contains legacy support for - old-style host templates. It should never be used by any new driver. - </para> - </sect2> - <sect2 id="scsi_proc.c"> - <title>drivers/scsi/scsi_proc.c</title> - <para> - The functions in this file provide an interface between - the PROC file system and the SCSI device drivers - It is mainly used for debugging, statistics and to pass - information directly to the lowlevel driver. - - I.E. plumbing to manage /proc/scsi/* - </para> -!Idrivers/scsi/scsi_proc.c - </sect2> - <sect2 id="scsi_netlink.c"> - <title>drivers/scsi/scsi_netlink.c</title> - <para> - Infrastructure to provide async events from transports to userspace - via netlink, using a single NETLINK_SCSITRANSPORT protocol for all - transports. - - See <ulink url='http://marc.info/?l=linux-scsi&m=115507374832500&w=2'>the - original patch submission</ulink> for more details. - </para> -!Idrivers/scsi/scsi_netlink.c - </sect2> - <sect2 id="scsi_scan.c"> - <title>drivers/scsi/scsi_scan.c</title> - <para> - Scan a host to determine which (if any) devices are attached. - - The general scanning/probing algorithm is as follows, exceptions are - made to it depending on device specific flags, compilation options, - and global variable (boot or module load time) settings. - - A specific LUN is scanned via an INQUIRY command; if the LUN has a - device attached, a scsi_device is allocated and setup for it. - - For every id of every channel on the given host, start by scanning - LUN 0. Skip hosts that don't respond at all to a scan of LUN 0. - Otherwise, if LUN 0 has a device attached, allocate and setup a - scsi_device for it. If target is SCSI-3 or up, issue a REPORT LUN, - and scan all of the LUNs returned by the REPORT LUN; else, - sequentially scan LUNs up until some maximum is reached, or a LUN is - seen that cannot have a device attached to it. - </para> -!Idrivers/scsi/scsi_scan.c - </sect2> - <sect2 id="scsi_sysctl.c"> - <title>drivers/scsi/scsi_sysctl.c</title> - <para> - Set up the sysctl entry: "/dev/scsi/logging_level" - (DEV_SCSI_LOGGING_LEVEL) which sets/returns scsi_logging_level. - </para> - </sect2> - <sect2 id="scsi_sysfs.c"> - <title>drivers/scsi/scsi_sysfs.c</title> - <para> - SCSI sysfs interface routines. - </para> -!Edrivers/scsi/scsi_sysfs.c - </sect2> - <sect2 id="hosts.c"> - <title>drivers/scsi/hosts.c</title> - <para> - mid to lowlevel SCSI driver interface - </para> -!Edrivers/scsi/hosts.c - </sect2> - <sect2 id="constants.c"> - <title>drivers/scsi/constants.c</title> - <para> - mid to lowlevel SCSI driver interface - </para> -!Edrivers/scsi/constants.c - </sect2> - </sect1> - - <sect1 id="Transport_classes"> - <title>Transport classes</title> - <para> - Transport classes are service libraries for drivers in the SCSI - lower layer, which expose transport attributes in sysfs. - </para> - <sect2 id="Fibre_Channel_transport"> - <title>Fibre Channel transport</title> - <para> - The file drivers/scsi/scsi_transport_fc.c defines transport attributes - for Fibre Channel. - </para> -!Edrivers/scsi/scsi_transport_fc.c - </sect2> - <sect2 id="iSCSI_transport"> - <title>iSCSI transport class</title> - <para> - The file drivers/scsi/scsi_transport_iscsi.c defines transport - attributes for the iSCSI class, which sends SCSI packets over TCP/IP - connections. - </para> -!Edrivers/scsi/scsi_transport_iscsi.c - </sect2> - <sect2 id="SAS_transport"> - <title>Serial Attached SCSI (SAS) transport class</title> - <para> - The file drivers/scsi/scsi_transport_sas.c defines transport - attributes for Serial Attached SCSI, a variant of SATA aimed at - large high-end systems. - </para> - <para> - The SAS transport class contains common code to deal with SAS HBAs, - an aproximated representation of SAS topologies in the driver model, - and various sysfs attributes to expose these topologies and management - interfaces to userspace. - </para> - <para> - In addition to the basic SCSI core objects this transport class - introduces two additional intermediate objects: The SAS PHY - as represented by struct sas_phy defines an "outgoing" PHY on - a SAS HBA or Expander, and the SAS remote PHY represented by - struct sas_rphy defines an "incoming" PHY on a SAS Expander or - end device. Note that this is purely a software concept, the - underlying hardware for a PHY and a remote PHY is the exactly - the same. - </para> - <para> - There is no concept of a SAS port in this code, users can see - what PHYs form a wide port based on the port_identifier attribute, - which is the same for all PHYs in a port. - </para> -!Edrivers/scsi/scsi_transport_sas.c - </sect2> - <sect2 id="SATA_transport"> - <title>SATA transport class</title> - <para> - The SATA transport is handled by libata, which has its own book of - documentation in this directory. - </para> - </sect2> - <sect2 id="SPI_transport"> - <title>Parallel SCSI (SPI) transport class</title> - <para> - The file drivers/scsi/scsi_transport_spi.c defines transport - attributes for traditional (fast/wide/ultra) SCSI busses. - </para> -!Edrivers/scsi/scsi_transport_spi.c - </sect2> - <sect2 id="SRP_transport"> - <title>SCSI RDMA (SRP) transport class</title> - <para> - The file drivers/scsi/scsi_transport_srp.c defines transport - attributes for SCSI over Remote Direct Memory Access. - </para> -!Edrivers/scsi/scsi_transport_srp.c - </sect2> - </sect1> - - </chapter> - - <chapter id="lower_layer"> - <title>SCSI lower layer</title> - <sect1 id="hba_drivers"> - <title>Host Bus Adapter transport types</title> - <para> - Many modern device controllers use the SCSI command set as a protocol to - communicate with their devices through many different types of physical - connections. - </para> - <para> - In SCSI language a bus capable of carrying SCSI commands is - called a "transport", and a controller connecting to such a bus is - called a "host bus adapter" (HBA). - </para> - <sect2 id="scsi_debug.c"> - <title>Debug transport</title> - <para> - The file drivers/scsi/scsi_debug.c simulates a host adapter with a - variable number of disks (or disk like devices) attached, sharing a - common amount of RAM. Does a lot of checking to make sure that we are - not getting blocks mixed up, and panics the kernel if anything out of - the ordinary is seen. - </para> - <para> - To be more realistic, the simulated devices have the transport - attributes of SAS disks. - </para> - <para> - For documentation see - <ulink url='http://sg.danny.cz/sg/sdebug26.html'>http://sg.danny.cz/sg/sdebug26.html</ulink> - </para> -<!-- !Edrivers/scsi/scsi_debug.c --> - </sect2> - <sect2 id="todo"> - <title>todo</title> - <para>Parallel (fast/wide/ultra) SCSI, USB, SATA, - SAS, Fibre Channel, FireWire, ATAPI devices, Infiniband, - I20, iSCSI, Parallel ports, netlink... - </para> - </sect2> - </sect1> - </chapter> -</book> diff --git a/Documentation/DocBook/sh.tmpl b/Documentation/DocBook/sh.tmpl deleted file mode 100644 index 4a38f604fa66..000000000000 --- a/Documentation/DocBook/sh.tmpl +++ /dev/null @@ -1,105 +0,0 @@ -<?xml version="1.0" encoding="UTF-8"?> -<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" - "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []> - -<book id="sh-drivers"> - <bookinfo> - <title>SuperH Interfaces Guide</title> - - <authorgroup> - <author> - <firstname>Paul</firstname> - <surname>Mundt</surname> - <affiliation> - <address> - <email>lethal@linux-sh.org</email> - </address> - </affiliation> - </author> - </authorgroup> - - <copyright> - <year>2008-2010</year> - <holder>Paul Mundt</holder> - </copyright> - <copyright> - <year>2008-2010</year> - <holder>Renesas Technology Corp.</holder> - </copyright> - <copyright> - <year>2010</year> - <holder>Renesas Electronics Corp.</holder> - </copyright> - - <legalnotice> - <para> - This documentation is free software; you can redistribute - it and/or modify it under the terms of the GNU General Public - License version 2 as published by the Free Software Foundation. - </para> - - <para> - This program is distributed in the hope that it will be - useful, but WITHOUT ANY WARRANTY; without even the implied - warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. - See the GNU General Public License for more details. - </para> - - <para> - You should have received a copy of the GNU General Public - License along with this program; if not, write to the Free - Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, - MA 02111-1307 USA - </para> - - <para> - For more details see the file COPYING in the source - distribution of Linux. - </para> - </legalnotice> - </bookinfo> - -<toc></toc> - - <chapter id="mm"> - <title>Memory Management</title> - <sect1 id="sh4"> - <title>SH-4</title> - <sect2 id="sq"> - <title>Store Queue API</title> -!Earch/sh/kernel/cpu/sh4/sq.c - </sect2> - </sect1> - <sect1 id="sh5"> - <title>SH-5</title> - <sect2 id="tlb"> - <title>TLB Interfaces</title> -!Iarch/sh/mm/tlb-sh5.c -!Iarch/sh/include/asm/tlb_64.h - </sect2> - </sect1> - </chapter> - <chapter id="mach"> - <title>Machine Specific Interfaces</title> - <sect1 id="dreamcast"> - <title>mach-dreamcast</title> -!Iarch/sh/boards/mach-dreamcast/rtc.c - </sect1> - <sect1 id="x3proto"> - <title>mach-x3proto</title> -!Earch/sh/boards/mach-x3proto/ilsel.c - </sect1> - </chapter> - <chapter id="busses"> - <title>Busses</title> - <sect1 id="superhyway"> - <title>SuperHyway</title> -!Edrivers/sh/superhyway/superhyway.c - </sect1> - - <sect1 id="maple"> - <title>Maple</title> -!Edrivers/sh/maple/maple.c - </sect1> - </chapter> -</book> diff --git a/Documentation/DocBook/stylesheet.xsl b/Documentation/DocBook/stylesheet.xsl deleted file mode 100644 index 3bf4ecf3d760..000000000000 --- a/Documentation/DocBook/stylesheet.xsl +++ /dev/null @@ -1,11 +0,0 @@ -<?xml version="1.0" encoding="UTF-8"?> -<stylesheet xmlns="http://www.w3.org/1999/XSL/Transform" version="1.0"> -<param name="chunk.quietly">1</param> -<param name="funcsynopsis.style">ansi</param> -<param name="funcsynopsis.tabular.threshold">80</param> -<param name="callout.graphics">0</param> -<!-- <param name="paper.type">A4</param> --> -<param name="generate.consistent.ids">1</param> -<param name="generate.section.toc.level">2</param> -<param name="use.id.as.filename">1</param> -</stylesheet> diff --git a/Documentation/DocBook/w1.tmpl b/Documentation/DocBook/w1.tmpl deleted file mode 100644 index b0228d4c81bb..000000000000 --- a/Documentation/DocBook/w1.tmpl +++ /dev/null @@ -1,101 +0,0 @@ -<?xml version="1.0" encoding="UTF-8"?> -<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" - "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []> - -<book id="w1id"> - <bookinfo> - <title>W1: Dallas' 1-wire bus</title> - - <authorgroup> - <author> - <firstname>David</firstname> - <surname>Fries</surname> - <affiliation> - <address> - <email>David@Fries.net</email> - </address> - </affiliation> - </author> - - </authorgroup> - - <copyright> - <year>2013</year> - <!-- - <holder></holder> - --> - </copyright> - - <legalnotice> - <para> - This documentation is free software; you can redistribute - it and/or modify it under the terms of the GNU General Public - License version 2. - </para> - - <para> - This program is distributed in the hope that it will be - useful, but WITHOUT ANY WARRANTY; without even the implied - warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. - For more details see the file COPYING in the source - distribution of Linux. - </para> - </legalnotice> - </bookinfo> - - <toc></toc> - - <chapter id="w1_internal"> - <title>W1 API internal to the kernel</title> - - <sect1 id="w1_internal_api"> - <title>W1 API internal to the kernel</title> - <sect2 id="w1.h"> - <title>drivers/w1/w1.h</title> - <para>W1 core functions.</para> -!Idrivers/w1/w1.h - </sect2> - - <sect2 id="w1.c"> - <title>drivers/w1/w1.c</title> - <para>W1 core functions.</para> -!Idrivers/w1/w1.c - </sect2> - - <sect2 id="w1_family.h"> - <title>drivers/w1/w1_family.h</title> - <para>Allows registering device family operations.</para> -!Idrivers/w1/w1_family.h - </sect2> - - <sect2 id="w1_family.c"> - <title>drivers/w1/w1_family.c</title> - <para>Allows registering device family operations.</para> -!Edrivers/w1/w1_family.c - </sect2> - - <sect2 id="w1_int.c"> - <title>drivers/w1/w1_int.c</title> - <para>W1 internal initialization for master devices.</para> -!Edrivers/w1/w1_int.c - </sect2> - - <sect2 id="w1_netlink.h"> - <title>drivers/w1/w1_netlink.h</title> - <para>W1 external netlink API structures and commands.</para> -!Idrivers/w1/w1_netlink.h - </sect2> - - <sect2 id="w1_io.c"> - <title>drivers/w1/w1_io.c</title> - <para>W1 input/output.</para> -!Edrivers/w1/w1_io.c -!Idrivers/w1/w1_io.c - </sect2> - - </sect1> - - - </chapter> - -</book> diff --git a/Documentation/DocBook/z8530book.tmpl b/Documentation/DocBook/z8530book.tmpl deleted file mode 100644 index 6f3883be877e..000000000000 --- a/Documentation/DocBook/z8530book.tmpl +++ /dev/null @@ -1,371 +0,0 @@ -<?xml version="1.0" encoding="UTF-8"?> -<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" - "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []> - -<book id="Z85230Guide"> - <bookinfo> - <title>Z8530 Programming Guide</title> - - <authorgroup> - <author> - <firstname>Alan</firstname> - <surname>Cox</surname> - <affiliation> - <address> - <email>alan@lxorguk.ukuu.org.uk</email> - </address> - </affiliation> - </author> - </authorgroup> - - <copyright> - <year>2000</year> - <holder>Alan Cox</holder> - </copyright> - - <legalnotice> - <para> - This documentation is free software; you can redistribute - it and/or modify it under the terms of the GNU General Public - License as published by the Free Software Foundation; either - version 2 of the License, or (at your option) any later - version. - </para> - - <para> - This program is distributed in the hope that it will be - useful, but WITHOUT ANY WARRANTY; without even the implied - warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. - See the GNU General Public License for more details. - </para> - - <para> - You should have received a copy of the GNU General Public - License along with this program; if not, write to the Free - Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, - MA 02111-1307 USA - </para> - - <para> - For more details see the file COPYING in the source - distribution of Linux. - </para> - </legalnotice> - </bookinfo> - -<toc></toc> - - <chapter id="intro"> - <title>Introduction</title> - <para> - The Z85x30 family synchronous/asynchronous controller chips are - used on a large number of cheap network interface cards. The - kernel provides a core interface layer that is designed to make - it easy to provide WAN services using this chip. - </para> - <para> - The current driver only support synchronous operation. Merging the - asynchronous driver support into this code to allow any Z85x30 - device to be used as both a tty interface and as a synchronous - controller is a project for Linux post the 2.4 release - </para> - </chapter> - - <chapter id="Driver_Modes"> - <title>Driver Modes</title> - <para> - The Z85230 driver layer can drive Z8530, Z85C30 and Z85230 devices - in three different modes. Each mode can be applied to an individual - channel on the chip (each chip has two channels). - </para> - <para> - The PIO synchronous mode supports the most common Z8530 wiring. Here - the chip is interface to the I/O and interrupt facilities of the - host machine but not to the DMA subsystem. When running PIO the - Z8530 has extremely tight timing requirements. Doing high speeds, - even with a Z85230 will be tricky. Typically you should expect to - achieve at best 9600 baud with a Z8C530 and 64Kbits with a Z85230. - </para> - <para> - The DMA mode supports the chip when it is configured to use dual DMA - channels on an ISA bus. The better cards tend to support this mode - of operation for a single channel. With DMA running the Z85230 tops - out when it starts to hit ISA DMA constraints at about 512Kbits. It - is worth noting here that many PC machines hang or crash when the - chip is driven fast enough to hold the ISA bus solid. - </para> - <para> - Transmit DMA mode uses a single DMA channel. The DMA channel is used - for transmission as the transmit FIFO is smaller than the receive - FIFO. it gives better performance than pure PIO mode but is nowhere - near as ideal as pure DMA mode. - </para> - </chapter> - - <chapter id="Using_the_Z85230_driver"> - <title>Using the Z85230 driver</title> - <para> - The Z85230 driver provides the back end interface to your board. To - configure a Z8530 interface you need to detect the board and to - identify its ports and interrupt resources. It is also your problem - to verify the resources are available. - </para> - <para> - Having identified the chip you need to fill in a struct z8530_dev, - which describes each chip. This object must exist until you finally - shutdown the board. Firstly zero the active field. This ensures - nothing goes off without you intending it. The irq field should - be set to the interrupt number of the chip. (Each chip has a single - interrupt source rather than each channel). You are responsible - for allocating the interrupt line. The interrupt handler should be - set to <function>z8530_interrupt</function>. The device id should - be set to the z8530_dev structure pointer. Whether the interrupt can - be shared or not is board dependent, and up to you to initialise. - </para> - <para> - The structure holds two channel structures. - Initialise chanA.ctrlio and chanA.dataio with the address of the - control and data ports. You can or this with Z8530_PORT_SLEEP to - indicate your interface needs the 5uS delay for chip settling done - in software. The PORT_SLEEP option is architecture specific. Other - flags may become available on future platforms, eg for MMIO. - Initialise the chanA.irqs to &z8530_nop to start the chip up - as disabled and discarding interrupt events. This ensures that - stray interrupts will be mopped up and not hang the bus. Set - chanA.dev to point to the device structure itself. The - private and name field you may use as you wish. The private field - is unused by the Z85230 layer. The name is used for error reporting - and it may thus make sense to make it match the network name. - </para> - <para> - Repeat the same operation with the B channel if your chip has - both channels wired to something useful. This isn't always the - case. If it is not wired then the I/O values do not matter, but - you must initialise chanB.dev. - </para> - <para> - If your board has DMA facilities then initialise the txdma and - rxdma fields for the relevant channels. You must also allocate the - ISA DMA channels and do any necessary board level initialisation - to configure them. The low level driver will do the Z8530 and - DMA controller programming but not board specific magic. - </para> - <para> - Having initialised the device you can then call - <function>z8530_init</function>. This will probe the chip and - reset it into a known state. An identification sequence is then - run to identify the chip type. If the checks fail to pass the - function returns a non zero error code. Typically this indicates - that the port given is not valid. After this call the - type field of the z8530_dev structure is initialised to either - Z8530, Z85C30 or Z85230 according to the chip found. - </para> - <para> - Once you have called z8530_init you can also make use of the utility - function <function>z8530_describe</function>. This provides a - consistent reporting format for the Z8530 devices, and allows all - the drivers to provide consistent reporting. - </para> - </chapter> - - <chapter id="Attaching_Network_Interfaces"> - <title>Attaching Network Interfaces</title> - <para> - If you wish to use the network interface facilities of the driver, - then you need to attach a network device to each channel that is - present and in use. In addition to use the generic HDLC - you need to follow some additional plumbing rules. They may seem - complex but a look at the example hostess_sv11 driver should - reassure you. - </para> - <para> - The network device used for each channel should be pointed to by - the netdevice field of each channel. The hdlc-> priv field of the - network device points to your private data - you will need to be - able to find your private data from this. - </para> - <para> - The way most drivers approach this particular problem is to - create a structure holding the Z8530 device definition and - put that into the private field of the network device. The - network device fields of the channels then point back to the - network devices. - </para> - <para> - If you wish to use the generic HDLC then you need to register - the HDLC device. - </para> - <para> - Before you register your network device you will also need to - provide suitable handlers for most of the network device callbacks. - See the network device documentation for more details on this. - </para> - </chapter> - - <chapter id="Configuring_And_Activating_The_Port"> - <title>Configuring And Activating The Port</title> - <para> - The Z85230 driver provides helper functions and tables to load the - port registers on the Z8530 chips. When programming the register - settings for a channel be aware that the documentation recommends - initialisation orders. Strange things happen when these are not - followed. - </para> - <para> - <function>z8530_channel_load</function> takes an array of - pairs of initialisation values in an array of u8 type. The first - value is the Z8530 register number. Add 16 to indicate the alternate - register bank on the later chips. The array is terminated by a 255. - </para> - <para> - The driver provides a pair of public tables. The - z8530_hdlc_kilostream table is for the UK 'Kilostream' service and - also happens to cover most other end host configurations. The - z8530_hdlc_kilostream_85230 table is the same configuration using - the enhancements of the 85230 chip. The configuration loaded is - standard NRZ encoded synchronous data with HDLC bitstuffing. All - of the timing is taken from the other end of the link. - </para> - <para> - When writing your own tables be aware that the driver internally - tracks register values. It may need to reload values. You should - therefore be sure to set registers 1-7, 9-11, 14 and 15 in all - configurations. Where the register settings depend on DMA selection - the driver will update the bits itself when you open or close. - Loading a new table with the interface open is not recommended. - </para> - <para> - There are three standard configurations supported by the core - code. In PIO mode the interface is programmed up to use - interrupt driven PIO. This places high demands on the host processor - to avoid latency. The driver is written to take account of latency - issues but it cannot avoid latencies caused by other drivers, - notably IDE in PIO mode. Because the drivers allocate buffers you - must also prevent MTU changes while the port is open. - </para> - <para> - Once the port is open it will call the rx_function of each channel - whenever a completed packet arrived. This is invoked from - interrupt context and passes you the channel and a network - buffer (struct sk_buff) holding the data. The data includes - the CRC bytes so most users will want to trim the last two - bytes before processing the data. This function is very timing - critical. When you wish to simply discard data the support - code provides the function <function>z8530_null_rx</function> - to discard the data. - </para> - <para> - To active PIO mode sending and receiving the <function> - z8530_sync_open</function> is called. This expects to be passed - the network device and the channel. Typically this is called from - your network device open callback. On a failure a non zero error - status is returned. The <function>z8530_sync_close</function> - function shuts down a PIO channel. This must be done before the - channel is opened again and before the driver shuts down - and unloads. - </para> - <para> - The ideal mode of operation is dual channel DMA mode. Here the - kernel driver will configure the board for DMA in both directions. - The driver also handles ISA DMA issues such as controller - programming and the memory range limit for you. This mode is - activated by calling the <function>z8530_sync_dma_open</function> - function. On failure a non zero error value is returned. - Once this mode is activated it can be shut down by calling the - <function>z8530_sync_dma_close</function>. You must call the close - function matching the open mode you used. - </para> - <para> - The final supported mode uses a single DMA channel to drive the - transmit side. As the Z85C30 has a larger FIFO on the receive - channel this tends to increase the maximum speed a little. - This is activated by calling the <function>z8530_sync_txdma_open - </function>. This returns a non zero error code on failure. The - <function>z8530_sync_txdma_close</function> function closes down - the Z8530 interface from this mode. - </para> - </chapter> - - <chapter id="Network_Layer_Functions"> - <title>Network Layer Functions</title> - <para> - The Z8530 layer provides functions to queue packets for - transmission. The driver internally buffers the frame currently - being transmitted and one further frame (in order to keep back - to back transmission running). Any further buffering is up to - the caller. - </para> - <para> - The function <function>z8530_queue_xmit</function> takes a network - buffer in sk_buff format and queues it for transmission. The - caller must provide the entire packet with the exception of the - bitstuffing and CRC. This is normally done by the caller via - the generic HDLC interface layer. It returns 0 if the buffer has been - queued and non zero values for queue full. If the function accepts - the buffer it becomes property of the Z8530 layer and the caller - should not free it. - </para> - <para> - The function <function>z8530_get_stats</function> returns a pointer - to an internally maintained per interface statistics block. This - provides most of the interface code needed to implement the network - layer get_stats callback. - </para> - </chapter> - - <chapter id="Porting_The_Z8530_Driver"> - <title>Porting The Z8530 Driver</title> - <para> - The Z8530 driver is written to be portable. In DMA mode it makes - assumptions about the use of ISA DMA. These are probably warranted - in most cases as the Z85230 in particular was designed to glue to PC - type machines. The PIO mode makes no real assumptions. - </para> - <para> - Should you need to retarget the Z8530 driver to another architecture - the only code that should need changing are the port I/O functions. - At the moment these assume PC I/O port accesses. This may not be - appropriate for all platforms. Replacing - <function>z8530_read_port</function> and <function>z8530_write_port - </function> is intended to be all that is required to port this - driver layer. - </para> - </chapter> - - <chapter id="bugs"> - <title>Known Bugs And Assumptions</title> - <para> - <variablelist> - <varlistentry><term>Interrupt Locking</term> - <listitem> - <para> - The locking in the driver is done via the global cli/sti lock. This - makes for relatively poor SMP performance. Switching this to use a - per device spin lock would probably materially improve performance. - </para> - </listitem></varlistentry> - - <varlistentry><term>Occasional Failures</term> - <listitem> - <para> - We have reports of occasional failures when run for very long - periods of time and the driver starts to receive junk frames. At - the moment the cause of this is not clear. - </para> - </listitem></varlistentry> - </variablelist> - - </para> - </chapter> - - <chapter id="pubfunctions"> - <title>Public Functions Provided</title> -!Edrivers/net/wan/z85230.c - </chapter> - - <chapter id="intfunctions"> - <title>Internal Functions</title> -!Idrivers/net/wan/z85230.c - </chapter> - -</book> diff --git a/Documentation/IPMI.txt b/Documentation/IPMI.txt index 6962cab997ef..aa77a25a0940 100644 --- a/Documentation/IPMI.txt +++ b/Documentation/IPMI.txt @@ -1,9 +1,8 @@ +===================== +The Linux IPMI Driver +===================== - The Linux IPMI Driver - --------------------- - Corey Minyard - <minyard@mvista.com> - <minyard@acm.org> +:Author: Corey Minyard <minyard@mvista.com> / <minyard@acm.org> The Intelligent Platform Management Interface, or IPMI, is a standard for controlling intelligent devices that monitor a system. @@ -141,7 +140,7 @@ Addressing ---------- The IPMI addressing works much like IP addresses, you have an overlay -to handle the different address types. The overlay is: +to handle the different address types. The overlay is:: struct ipmi_addr { @@ -153,7 +152,7 @@ to handle the different address types. The overlay is: The addr_type determines what the address really is. The driver currently understands two different types of addresses. -"System Interface" addresses are defined as: +"System Interface" addresses are defined as:: struct ipmi_system_interface_addr { @@ -166,7 +165,7 @@ straight to the BMC on the current card. The channel must be IPMI_BMC_CHANNEL. Messages that are destined to go out on the IPMB bus use the -IPMI_IPMB_ADDR_TYPE address type. The format is +IPMI_IPMB_ADDR_TYPE address type. The format is:: struct ipmi_ipmb_addr { @@ -184,16 +183,16 @@ spec. Messages -------- -Messages are defined as: +Messages are defined as:: -struct ipmi_msg -{ + struct ipmi_msg + { unsigned char netfn; unsigned char lun; unsigned char cmd; unsigned char *data; int data_len; -}; + }; The driver takes care of adding/stripping the header information. The data portion is just the data to be send (do NOT put addressing info @@ -208,7 +207,7 @@ block of data, even when receiving messages. Otherwise the driver will have no place to put the message. Messages coming up from the message handler in kernelland will come in -as: +as:: struct ipmi_recv_msg { @@ -246,6 +245,7 @@ and the user should not have to care what type of SMI is below them. Watching For Interfaces +^^^^^^^^^^^^^^^^^^^^^^^ When your code comes up, the IPMI driver may or may not have detected if IPMI devices exist. So you might have to defer your setup until @@ -256,6 +256,7 @@ and tell you when they come and go. Creating the User +^^^^^^^^^^^^^^^^^ To use the message handler, you must first create a user using ipmi_create_user. The interface number specifies which SMI you want @@ -272,6 +273,7 @@ closing the device automatically destroys the user. Messaging +^^^^^^^^^ To send a message from kernel-land, the ipmi_request_settime() call does pretty much all message handling. Most of the parameter are @@ -321,6 +323,7 @@ though, since it is tricky to manage your own buffers. Events and Incoming Commands +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The driver takes care of polling for IPMI events and receiving commands (commands are messages that are not responses, they are @@ -367,7 +370,7 @@ in the system. It discovers interfaces through a host of different methods, depending on the system. You can specify up to four interfaces on the module load line and -control some module parameters: +control some module parameters:: modprobe ipmi_si.o type=<type1>,<type2>.... ports=<port1>,<port2>... addrs=<addr1>,<addr2>... @@ -437,7 +440,7 @@ default is one. Setting to 0 is useful with the hotmod, but is obviously only useful for modules. When compiled into the kernel, the parameters can be specified on the -kernel command line as: +kernel command line as:: ipmi_si.type=<type1>,<type2>... ipmi_si.ports=<port1>,<port2>... ipmi_si.addrs=<addr1>,<addr2>... @@ -474,16 +477,22 @@ The driver supports a hot add and remove of interfaces. This way, interfaces can be added or removed after the kernel is up and running. This is done using /sys/modules/ipmi_si/parameters/hotmod, which is a write-only parameter. You write a string to this interface. The string -has the format: +has the format:: + <op1>[:op2[:op3...]] -The "op"s are: + +The "op"s are:: + add|remove,kcs|bt|smic,mem|i/o,<address>[,<opt1>[,<opt2>[,...]]] -You can specify more than one interface on the line. The "opt"s are: + +You can specify more than one interface on the line. The "opt"s are:: + rsp=<regspacing> rsi=<regsize> rsh=<regshift> irq=<irq> ipmb=<ipmb slave addr> + and these have the same meanings as discussed above. Note that you can also use this on the kernel command line for a more compact format for specifying an interface. Note that when removing an interface, @@ -496,7 +505,7 @@ The SMBus Driver (SSIF) The SMBus driver allows up to 4 SMBus devices to be configured in the system. By default, the driver will only register with something it finds in DMI or ACPI tables. You can change this -at module load time (for a module) with: +at module load time (for a module) with:: modprobe ipmi_ssif.o addr=<i2caddr1>[,<i2caddr2>[,...]] @@ -535,7 +544,7 @@ the smb_addr parameter unless you have DMI or ACPI data to tell the driver what to use. When compiled into the kernel, the addresses can be specified on the -kernel command line as: +kernel command line as:: ipmb_ssif.addr=<i2caddr1>[,<i2caddr2>[...]] ipmi_ssif.adapter=<adapter1>[,<adapter2>[...]] @@ -565,9 +574,9 @@ Some users need more detailed information about a device, like where the address came from or the raw base device for the IPMI interface. You can use the IPMI smi_watcher to catch the IPMI interfaces as they come or go, and to grab the information, you can use the function -ipmi_get_smi_info(), which returns the following structure: +ipmi_get_smi_info(), which returns the following structure:: -struct ipmi_smi_info { + struct ipmi_smi_info { enum ipmi_addr_src addr_src; struct device *dev; union { @@ -575,7 +584,7 @@ struct ipmi_smi_info { void *acpi_handle; } acpi_info; } addr_info; -}; + }; Currently special info for only for SI_ACPI address sources is returned. Others may be added as necessary. @@ -590,7 +599,7 @@ Watchdog A watchdog timer is provided that implements the Linux-standard watchdog timer interface. It has three module parameters that can be -used to control it: +used to control it:: modprobe ipmi_watchdog timeout=<t> pretimeout=<t> action=<action type> preaction=<preaction type> preop=<preop type> start_now=x @@ -635,7 +644,7 @@ watchdog device is closed. The default value of nowayout is true if the CONFIG_WATCHDOG_NOWAYOUT option is enabled, or false if not. When compiled into the kernel, the kernel command line is available -for configuring the watchdog: +for configuring the watchdog:: ipmi_watchdog.timeout=<t> ipmi_watchdog.pretimeout=<t> ipmi_watchdog.action=<action type> @@ -675,6 +684,7 @@ also get a bunch of OEM events holding the panic string. The field settings of the events are: + * Generator ID: 0x21 (kernel) * EvM Rev: 0x03 (this event is formatting in IPMI 1.0 format) * Sensor Type: 0x20 (OS critical stop sensor) @@ -683,18 +693,20 @@ The field settings of the events are: * Event Data 1: 0xa1 (Runtime stop in OEM bytes 2 and 3) * Event data 2: second byte of panic string * Event data 3: third byte of panic string + See the IPMI spec for the details of the event layout. This event is always sent to the local management controller. It will handle routing the message to the right place Other OEM events have the following format: -Record ID (bytes 0-1): Set by the SEL. -Record type (byte 2): 0xf0 (OEM non-timestamped) -byte 3: The slave address of the card saving the panic -byte 4: A sequence number (starting at zero) -The rest of the bytes (11 bytes) are the panic string. If the panic string -is longer than 11 bytes, multiple messages will be sent with increasing -sequence numbers. + +* Record ID (bytes 0-1): Set by the SEL. +* Record type (byte 2): 0xf0 (OEM non-timestamped) +* byte 3: The slave address of the card saving the panic +* byte 4: A sequence number (starting at zero) + The rest of the bytes (11 bytes) are the panic string. If the panic string + is longer than 11 bytes, multiple messages will be sent with increasing + sequence numbers. Because you cannot send OEM events using the standard interface, this function will attempt to find an SEL and add the events there. It diff --git a/Documentation/IRQ-affinity.txt b/Documentation/IRQ-affinity.txt index 01a675175a36..29da5000836a 100644 --- a/Documentation/IRQ-affinity.txt +++ b/Documentation/IRQ-affinity.txt @@ -1,8 +1,11 @@ +================ +SMP IRQ affinity +================ + ChangeLog: - Started by Ingo Molnar <mingo@redhat.com> - Update by Max Krasnyansky <maxk@qualcomm.com> + - Started by Ingo Molnar <mingo@redhat.com> + - Update by Max Krasnyansky <maxk@qualcomm.com> -SMP IRQ affinity /proc/irq/IRQ#/smp_affinity and /proc/irq/IRQ#/smp_affinity_list specify which target CPUs are permitted for a given IRQ source. It's a bitmask @@ -16,50 +19,52 @@ will be set to the default mask. It can then be changed as described above. Default mask is 0xffffffff. Here is an example of restricting IRQ44 (eth1) to CPU0-3 then restricting -it to CPU4-7 (this is an 8-CPU SMP box): +it to CPU4-7 (this is an 8-CPU SMP box):: -[root@moon 44]# cd /proc/irq/44 -[root@moon 44]# cat smp_affinity -ffffffff + [root@moon 44]# cd /proc/irq/44 + [root@moon 44]# cat smp_affinity + ffffffff -[root@moon 44]# echo 0f > smp_affinity -[root@moon 44]# cat smp_affinity -0000000f -[root@moon 44]# ping -f h -PING hell (195.4.7.3): 56 data bytes -... ---- hell ping statistics --- -6029 packets transmitted, 6027 packets received, 0% packet loss -round-trip min/avg/max = 0.1/0.1/0.4 ms -[root@moon 44]# cat /proc/interrupts | grep 'CPU\|44:' - CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 - 44: 1068 1785 1785 1783 0 0 0 0 IO-APIC-level eth1 + [root@moon 44]# echo 0f > smp_affinity + [root@moon 44]# cat smp_affinity + 0000000f + [root@moon 44]# ping -f h + PING hell (195.4.7.3): 56 data bytes + ... + --- hell ping statistics --- + 6029 packets transmitted, 6027 packets received, 0% packet loss + round-trip min/avg/max = 0.1/0.1/0.4 ms + [root@moon 44]# cat /proc/interrupts | grep 'CPU\|44:' + CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 + 44: 1068 1785 1785 1783 0 0 0 0 IO-APIC-level eth1 As can be seen from the line above IRQ44 was delivered only to the first four processors (0-3). Now lets restrict that IRQ to CPU(4-7). -[root@moon 44]# echo f0 > smp_affinity -[root@moon 44]# cat smp_affinity -000000f0 -[root@moon 44]# ping -f h -PING hell (195.4.7.3): 56 data bytes -.. ---- hell ping statistics --- -2779 packets transmitted, 2777 packets received, 0% packet loss -round-trip min/avg/max = 0.1/0.5/585.4 ms -[root@moon 44]# cat /proc/interrupts | 'CPU\|44:' - CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 - 44: 1068 1785 1785 1783 1784 1069 1070 1069 IO-APIC-level eth1 +:: + + [root@moon 44]# echo f0 > smp_affinity + [root@moon 44]# cat smp_affinity + 000000f0 + [root@moon 44]# ping -f h + PING hell (195.4.7.3): 56 data bytes + .. + --- hell ping statistics --- + 2779 packets transmitted, 2777 packets received, 0% packet loss + round-trip min/avg/max = 0.1/0.5/585.4 ms + [root@moon 44]# cat /proc/interrupts | 'CPU\|44:' + CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 + 44: 1068 1785 1785 1783 1784 1069 1070 1069 IO-APIC-level eth1 This time around IRQ44 was delivered only to the last four processors. i.e counters for the CPU0-3 did not change. -Here is an example of limiting that same irq (44) to cpus 1024 to 1031: +Here is an example of limiting that same irq (44) to cpus 1024 to 1031:: -[root@moon 44]# echo 1024-1031 > smp_affinity_list -[root@moon 44]# cat smp_affinity_list -1024-1031 + [root@moon 44]# echo 1024-1031 > smp_affinity_list + [root@moon 44]# cat smp_affinity_list + 1024-1031 Note that to do this with a bitmask would require 32 bitmasks of zero to follow the pertinent one. diff --git a/Documentation/IRQ-domain.txt b/Documentation/IRQ-domain.txt index 82001a25a14b..4a1cd7645d85 100644 --- a/Documentation/IRQ-domain.txt +++ b/Documentation/IRQ-domain.txt @@ -1,4 +1,6 @@ -irq_domain interrupt number mapping library +=============================================== +The irq_domain interrupt number mapping library +=============================================== The current design of the Linux kernel uses a single large number space where each separate IRQ source is assigned a different number. @@ -36,7 +38,9 @@ irq_domain also implements translation from an abstract irq_fwspec structure to hwirq numbers (Device Tree and ACPI GSI so far), and can be easily extended to support other IRQ topology data sources. -=== irq_domain usage === +irq_domain usage +================ + An interrupt controller driver creates and registers an irq_domain by calling one of the irq_domain_add_*() functions (each mapping method has a different allocator function, more on that later). The function @@ -62,15 +66,21 @@ If the driver has the Linux IRQ number or the irq_data pointer, and needs to know the associated hwirq number (such as in the irq_chip callbacks) then it can be directly obtained from irq_data->hwirq. -=== Types of irq_domain mappings === +Types of irq_domain mappings +============================ + There are several mechanisms available for reverse mapping from hwirq to Linux irq, and each mechanism uses a different allocation function. Which reverse map type should be used depends on the use case. Each of the reverse map types are described below: -==== Linear ==== -irq_domain_add_linear() -irq_domain_create_linear() +Linear +------ + +:: + + irq_domain_add_linear() + irq_domain_create_linear() The linear reverse map maintains a fixed size table indexed by the hwirq number. When a hwirq is mapped, an irq_desc is allocated for @@ -89,9 +99,13 @@ accepts a more general abstraction 'struct fwnode_handle'. The majority of drivers should use the linear map. -==== Tree ==== -irq_domain_add_tree() -irq_domain_create_tree() +Tree +---- + +:: + + irq_domain_add_tree() + irq_domain_create_tree() The irq_domain maintains a radix tree map from hwirq numbers to Linux IRQs. When an hwirq is mapped, an irq_desc is allocated and the @@ -109,8 +123,12 @@ accepts a more general abstraction 'struct fwnode_handle'. Very few drivers should need this mapping. -==== No Map ===- -irq_domain_add_nomap() +No Map +------ + +:: + + irq_domain_add_nomap() The No Map mapping is to be used when the hwirq number is programmable in the hardware. In this case it is best to program the @@ -121,10 +139,14 @@ Linux IRQ number into the hardware. Most drivers cannot use this mapping. -==== Legacy ==== -irq_domain_add_simple() -irq_domain_add_legacy() -irq_domain_add_legacy_isa() +Legacy +------ + +:: + + irq_domain_add_simple() + irq_domain_add_legacy() + irq_domain_add_legacy_isa() The Legacy mapping is a special case for drivers that already have a range of irq_descs allocated for the hwirqs. It is used when the @@ -163,14 +185,17 @@ that the driver using the simple domain call irq_create_mapping() before any irq_find_mapping() since the latter will actually work for the static IRQ assignment case. -==== Hierarchy IRQ domain ==== +Hierarchy IRQ domain +-------------------- + On some architectures, there may be multiple interrupt controllers involved in delivering an interrupt from the device to the target CPU. -Let's look at a typical interrupt delivering path on x86 platforms: +Let's look at a typical interrupt delivering path on x86 platforms:: -Device --> IOAPIC -> Interrupt remapping Controller -> Local APIC -> CPU + Device --> IOAPIC -> Interrupt remapping Controller -> Local APIC -> CPU There are three interrupt controllers involved: + 1) IOAPIC controller 2) Interrupt remapping controller 3) Local APIC controller @@ -180,7 +205,8 @@ hardware architecture, an irq_domain data structure is built for each interrupt controller and those irq_domains are organized into hierarchy. When building irq_domain hierarchy, the irq_domain near to the device is child and the irq_domain near to CPU is parent. So a hierarchy structure -as below will be built for the example above. +as below will be built for the example above:: + CPU Vector irq_domain (root irq_domain to manage CPU vectors) ^ | @@ -190,6 +216,7 @@ as below will be built for the example above. IOAPIC irq_domain (manage IOAPIC delivery entries/pins) There are four major interfaces to use hierarchy irq_domain: + 1) irq_domain_alloc_irqs(): allocate IRQ descriptors and interrupt controller related resources to deliver these interrupts. 2) irq_domain_free_irqs(): free IRQ descriptors and interrupt controller @@ -199,7 +226,8 @@ There are four major interfaces to use hierarchy irq_domain: 4) irq_domain_deactivate_irq(): deactivate interrupt controller hardware to stop delivering the interrupt. -Following changes are needed to support hierarchy irq_domain. +Following changes are needed to support hierarchy irq_domain: + 1) a new field 'parent' is added to struct irq_domain; it's used to maintain irq_domain hierarchy information. 2) a new field 'parent_data' is added to struct irq_data; it's used to @@ -223,6 +251,7 @@ software architecture. For an interrupt controller driver to support hierarchy irq_domain, it needs to: + 1) Implement irq_domain_ops.alloc and irq_domain_ops.free 2) Optionally implement irq_domain_ops.activate and irq_domain_ops.deactivate. @@ -231,5 +260,42 @@ needs to: 4) No need to implement irq_domain_ops.map and irq_domain_ops.unmap, they are unused with hierarchy irq_domain. -Hierarchy irq_domain may also be used to support other architectures, -such as ARM, ARM64 etc. +Hierarchy irq_domain is in no way x86 specific, and is heavily used to +support other architectures, such as ARM, ARM64 etc. + +=== Debugging === + +If you switch on CONFIG_IRQ_DOMAIN_DEBUG (which depends on +CONFIG_IRQ_DOMAIN and CONFIG_DEBUG_FS), you will find a new file in +your debugfs mount point, called irq_domain_mapping. This file +contains a live snapshot of all the IRQ domains in the system: + + name mapped linear-max direct-max devtree-node + pl061 8 8 0 /smb/gpio@e0080000 + pl061 8 8 0 /smb/gpio@e1050000 + pMSI 0 0 0 /interrupt-controller@e1101000/v2m@e0080000 + MSI 37 0 0 /interrupt-controller@e1101000/v2m@e0080000 + GICv2m 37 0 0 /interrupt-controller@e1101000/v2m@e0080000 + GICv2 448 448 0 /interrupt-controller@e1101000 + +it also iterates over the interrupts to display their mapping in the +domains, and makes the domain stacking visible: + + +irq hwirq chip name chip data active type domain + 1 0x00019 GICv2 0xffff00000916bfd8 * LINEAR GICv2 + 2 0x0001d GICv2 0xffff00000916bfd8 LINEAR GICv2 + 3 0x0001e GICv2 0xffff00000916bfd8 * LINEAR GICv2 + 4 0x0001b GICv2 0xffff00000916bfd8 * LINEAR GICv2 + 5 0x0001a GICv2 0xffff00000916bfd8 LINEAR GICv2 +[...] + 96 0x81808 MSI 0x (null) RADIX MSI + 96+ 0x00063 GICv2m 0xffff8003ee116980 RADIX GICv2m + 96+ 0x00063 GICv2 0xffff00000916bfd8 LINEAR GICv2 + 97 0x08800 MSI 0x (null) * RADIX MSI + 97+ 0x00064 GICv2m 0xffff8003ee116980 * RADIX GICv2m + 97+ 0x00064 GICv2 0xffff00000916bfd8 * LINEAR GICv2 + +Here, interrupts 1-5 are only using a single domain, while 96 and 97 +are build out of a stack of three domain, each level performing a +particular function. diff --git a/Documentation/IRQ.txt b/Documentation/IRQ.txt index 1011e7175021..4273806a606b 100644 --- a/Documentation/IRQ.txt +++ b/Documentation/IRQ.txt @@ -1,4 +1,6 @@ +=============== What is an IRQ? +=============== An IRQ is an interrupt request from a device. Currently they can come in over a pin, or over a packet. diff --git a/Documentation/Intel-IOMMU.txt b/Documentation/Intel-IOMMU.txt index 49585b6e1ea2..9dae6b47e398 100644 --- a/Documentation/Intel-IOMMU.txt +++ b/Documentation/Intel-IOMMU.txt @@ -1,3 +1,4 @@ +=================== Linux IOMMU Support =================== @@ -9,11 +10,11 @@ This guide gives a quick cheat sheet for some basic understanding. Some Keywords -DMAR - DMA remapping -DRHD - DMA Remapping Hardware Unit Definition -RMRR - Reserved memory Region Reporting Structure -ZLR - Zero length reads from PCI devices -IOVA - IO Virtual address. +- DMAR - DMA remapping +- DRHD - DMA Remapping Hardware Unit Definition +- RMRR - Reserved memory Region Reporting Structure +- ZLR - Zero length reads from PCI devices +- IOVA - IO Virtual address. Basic stuff ----------- @@ -33,7 +34,7 @@ devices that need to access these regions. OS is expected to setup unity mappings for these regions for these devices to access these regions. How is IOVA generated? ---------------------- +---------------------- Well behaved drivers call pci_map_*() calls before sending command to device that needs to perform DMA. Once DMA is completed and mapping is no longer @@ -82,14 +83,14 @@ in ACPI. ACPI: DMAR (v001 A M I OEMDMAR 0x00000001 MSFT 0x00000097) @ 0x000000007f5b5ef0 When DMAR is being processed and initialized by ACPI, prints DMAR locations -and any RMRR's processed. +and any RMRR's processed:: -ACPI DMAR:Host address width 36 -ACPI DMAR:DRHD (flags: 0x00000000)base: 0x00000000fed90000 -ACPI DMAR:DRHD (flags: 0x00000000)base: 0x00000000fed91000 -ACPI DMAR:DRHD (flags: 0x00000001)base: 0x00000000fed93000 -ACPI DMAR:RMRR base: 0x00000000000ed000 end: 0x00000000000effff -ACPI DMAR:RMRR base: 0x000000007f600000 end: 0x000000007fffffff + ACPI DMAR:Host address width 36 + ACPI DMAR:DRHD (flags: 0x00000000)base: 0x00000000fed90000 + ACPI DMAR:DRHD (flags: 0x00000000)base: 0x00000000fed91000 + ACPI DMAR:DRHD (flags: 0x00000001)base: 0x00000000fed93000 + ACPI DMAR:RMRR base: 0x00000000000ed000 end: 0x00000000000effff + ACPI DMAR:RMRR base: 0x000000007f600000 end: 0x000000007fffffff When DMAR is enabled for use, you will notice.. @@ -98,10 +99,12 @@ PCI-DMA: Using DMAR IOMMU Fault reporting --------------- -DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000 -DMAR:[fault reason 05] PTE Write access is not set -DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000 -DMAR:[fault reason 05] PTE Write access is not set +:: + + DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000 + DMAR:[fault reason 05] PTE Write access is not set + DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000 + DMAR:[fault reason 05] PTE Write access is not set TBD ---- diff --git a/Documentation/Makefile b/Documentation/Makefile index c2a469112c37..a42320385df3 100644 --- a/Documentation/Makefile +++ b/Documentation/Makefile @@ -1 +1,126 @@ +# -*- makefile -*- +# Makefile for Sphinx documentation +# + subdir-y := + +# You can set these variables from the command line. +SPHINXBUILD = sphinx-build +SPHINXOPTS = +SPHINXDIRS = . +_SPHINXDIRS = $(patsubst $(srctree)/Documentation/%/conf.py,%,$(wildcard $(srctree)/Documentation/*/conf.py)) +SPHINX_CONF = conf.py +PAPER = +BUILDDIR = $(obj)/output +PDFLATEX = xelatex +LATEXOPTS = -interaction=batchmode + +# User-friendly check for sphinx-build +HAVE_SPHINX := $(shell if which $(SPHINXBUILD) >/dev/null 2>&1; then echo 1; else echo 0; fi) + +ifeq ($(HAVE_SPHINX),0) + +.DEFAULT: + $(warning The '$(SPHINXBUILD)' command was not found. Make sure you have Sphinx installed and in PATH, or set the SPHINXBUILD make variable to point to the full path of the '$(SPHINXBUILD)' executable.) + @echo " SKIP Sphinx $@ target." + +else # HAVE_SPHINX + +# User-friendly check for pdflatex +HAVE_PDFLATEX := $(shell if which $(PDFLATEX) >/dev/null 2>&1; then echo 1; else echo 0; fi) + +# Internal variables. +PAPEROPT_a4 = -D latex_paper_size=a4 +PAPEROPT_letter = -D latex_paper_size=letter +KERNELDOC = $(srctree)/scripts/kernel-doc +KERNELDOC_CONF = -D kerneldoc_srctree=$(srctree) -D kerneldoc_bin=$(KERNELDOC) +ALLSPHINXOPTS = $(KERNELDOC_CONF) $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) +# the i18n builder cannot share the environment and doctrees with the others +I18NSPHINXOPTS = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) . + +# commands; the 'cmd' from scripts/Kbuild.include is not *loopable* +loop_cmd = $(echo-cmd) $(cmd_$(1)) || exit; + +# $2 sphinx builder e.g. "html" +# $3 name of the build subfolder / e.g. "media", used as: +# * dest folder relative to $(BUILDDIR) and +# * cache folder relative to $(BUILDDIR)/.doctrees +# $4 dest subfolder e.g. "man" for man pages at media/man +# $5 reST source folder relative to $(srctree)/$(src), +# e.g. "media" for the linux-tv book-set at ./Documentation/media + +quiet_cmd_sphinx = SPHINX $@ --> file://$(abspath $(BUILDDIR)/$3/$4) + cmd_sphinx = $(MAKE) BUILDDIR=$(abspath $(BUILDDIR)) $(build)=Documentation/media $2 && \ + PYTHONDONTWRITEBYTECODE=1 \ + BUILDDIR=$(abspath $(BUILDDIR)) SPHINX_CONF=$(abspath $(srctree)/$(src)/$5/$(SPHINX_CONF)) \ + $(SPHINXBUILD) \ + -b $2 \ + -c $(abspath $(srctree)/$(src)) \ + -d $(abspath $(BUILDDIR)/.doctrees/$3) \ + -D version=$(KERNELVERSION) -D release=$(KERNELRELEASE) \ + $(ALLSPHINXOPTS) \ + $(abspath $(srctree)/$(src)/$5) \ + $(abspath $(BUILDDIR)/$3/$4) + +htmldocs: + @+$(foreach var,$(SPHINXDIRS),$(call loop_cmd,sphinx,html,$(var),,$(var))) + +linkcheckdocs: + @$(foreach var,$(SPHINXDIRS),$(call loop_cmd,sphinx,linkcheck,$(var),,$(var))) + +latexdocs: + @+$(foreach var,$(SPHINXDIRS),$(call loop_cmd,sphinx,latex,$(var),latex,$(var))) + +ifeq ($(HAVE_PDFLATEX),0) + +pdfdocs: + $(warning The '$(PDFLATEX)' command was not found. Make sure you have it installed and in PATH to produce PDF output.) + @echo " SKIP Sphinx $@ target." + +else # HAVE_PDFLATEX + +pdfdocs: latexdocs + $(foreach var,$(SPHINXDIRS), $(MAKE) PDFLATEX=$(PDFLATEX) LATEXOPTS="$(LATEXOPTS)" -C $(BUILDDIR)/$(var)/latex || exit;) + +endif # HAVE_PDFLATEX + +epubdocs: + @+$(foreach var,$(SPHINXDIRS),$(call loop_cmd,sphinx,epub,$(var),epub,$(var))) + +xmldocs: + @+$(foreach var,$(SPHINXDIRS),$(call loop_cmd,sphinx,xml,$(var),xml,$(var))) + +endif # HAVE_SPHINX + +# The following targets are independent of HAVE_SPHINX, and the rules should +# work or silently pass without Sphinx. + +# no-ops for the Sphinx toolchain +sgmldocs: + @: +psdocs: + @: +mandocs: + @: +installmandocs: + @: + +cleandocs: + $(Q)rm -rf $(BUILDDIR) + $(Q)$(MAKE) BUILDDIR=$(abspath $(BUILDDIR)) $(build)=Documentation/media clean + +dochelp: + @echo ' Linux kernel internal documentation in different formats from ReST:' + @echo ' htmldocs - HTML' + @echo ' latexdocs - LaTeX' + @echo ' pdfdocs - PDF' + @echo ' epubdocs - EPUB' + @echo ' xmldocs - XML' + @echo ' linkcheckdocs - check for broken external links (will connect to external hosts)' + @echo ' cleandocs - clean all generated files' + @echo + @echo ' make SPHINXDIRS="s1 s2" [target] Generate only docs of folder s1, s2' + @echo ' valid values for SPHINXDIRS are: $(_SPHINXDIRS)' + @echo + @echo ' make SPHINX_CONF={conf-file} [target] use *additional* sphinx-build' + @echo ' configuration. This is e.g. useful to build with nit-picking config.' diff --git a/Documentation/Makefile.sphinx b/Documentation/Makefile.sphinx deleted file mode 100644 index bcf529f6cf9b..000000000000 --- a/Documentation/Makefile.sphinx +++ /dev/null @@ -1,130 +0,0 @@ -# -*- makefile -*- -# Makefile for Sphinx documentation -# - -# You can set these variables from the command line. -SPHINXBUILD = sphinx-build -SPHINXOPTS = -SPHINXDIRS = . -_SPHINXDIRS = $(patsubst $(srctree)/Documentation/%/conf.py,%,$(wildcard $(srctree)/Documentation/*/conf.py)) -SPHINX_CONF = conf.py -PAPER = -BUILDDIR = $(obj)/output -PDFLATEX = xelatex -LATEXOPTS = -interaction=batchmode - -# User-friendly check for sphinx-build -HAVE_SPHINX := $(shell if which $(SPHINXBUILD) >/dev/null 2>&1; then echo 1; else echo 0; fi) - -ifeq ($(HAVE_SPHINX),0) - -.DEFAULT: - $(warning The '$(SPHINXBUILD)' command was not found. Make sure you have Sphinx installed and in PATH, or set the SPHINXBUILD make variable to point to the full path of the '$(SPHINXBUILD)' executable.) - @echo " SKIP Sphinx $@ target." - -else ifneq ($(DOCBOOKS),) - -# Skip Sphinx build if the user explicitly requested DOCBOOKS. -.DEFAULT: - @echo " SKIP Sphinx $@ target (DOCBOOKS specified)." - -else # HAVE_SPHINX - -# User-friendly check for pdflatex -HAVE_PDFLATEX := $(shell if which $(PDFLATEX) >/dev/null 2>&1; then echo 1; else echo 0; fi) - -# Internal variables. -PAPEROPT_a4 = -D latex_paper_size=a4 -PAPEROPT_letter = -D latex_paper_size=letter -KERNELDOC = $(srctree)/scripts/kernel-doc -KERNELDOC_CONF = -D kerneldoc_srctree=$(srctree) -D kerneldoc_bin=$(KERNELDOC) -ALLSPHINXOPTS = $(KERNELDOC_CONF) $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) -# the i18n builder cannot share the environment and doctrees with the others -I18NSPHINXOPTS = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) . - -# commands; the 'cmd' from scripts/Kbuild.include is not *loopable* -loop_cmd = $(echo-cmd) $(cmd_$(1)) || exit; - -# $2 sphinx builder e.g. "html" -# $3 name of the build subfolder / e.g. "media", used as: -# * dest folder relative to $(BUILDDIR) and -# * cache folder relative to $(BUILDDIR)/.doctrees -# $4 dest subfolder e.g. "man" for man pages at media/man -# $5 reST source folder relative to $(srctree)/$(src), -# e.g. "media" for the linux-tv book-set at ./Documentation/media - -quiet_cmd_sphinx = SPHINX $@ --> file://$(abspath $(BUILDDIR)/$3/$4) - cmd_sphinx = $(MAKE) BUILDDIR=$(abspath $(BUILDDIR)) $(build)=Documentation/media $2 && \ - PYTHONDONTWRITEBYTECODE=1 \ - BUILDDIR=$(abspath $(BUILDDIR)) SPHINX_CONF=$(abspath $(srctree)/$(src)/$5/$(SPHINX_CONF)) \ - $(SPHINXBUILD) \ - -b $2 \ - -c $(abspath $(srctree)/$(src)) \ - -d $(abspath $(BUILDDIR)/.doctrees/$3) \ - -D version=$(KERNELVERSION) -D release=$(KERNELRELEASE) \ - $(ALLSPHINXOPTS) \ - $(abspath $(srctree)/$(src)/$5) \ - $(abspath $(BUILDDIR)/$3/$4) - -htmldocs: - @+$(foreach var,$(SPHINXDIRS),$(call loop_cmd,sphinx,html,$(var),,$(var))) - -linkcheckdocs: - @$(foreach var,$(SPHINXDIRS),$(call loop_cmd,sphinx,linkcheck,$(var),,$(var))) - -latexdocs: - @+$(foreach var,$(SPHINXDIRS),$(call loop_cmd,sphinx,latex,$(var),latex,$(var))) - -ifeq ($(HAVE_PDFLATEX),0) - -pdfdocs: - $(warning The '$(PDFLATEX)' command was not found. Make sure you have it installed and in PATH to produce PDF output.) - @echo " SKIP Sphinx $@ target." - -else # HAVE_PDFLATEX - -pdfdocs: latexdocs - $(foreach var,$(SPHINXDIRS), $(MAKE) PDFLATEX=$(PDFLATEX) LATEXOPTS="$(LATEXOPTS)" -C $(BUILDDIR)/$(var)/latex || exit;) - -endif # HAVE_PDFLATEX - -epubdocs: - @+$(foreach var,$(SPHINXDIRS),$(call loop_cmd,sphinx,epub,$(var),epub,$(var))) - -xmldocs: - @+$(foreach var,$(SPHINXDIRS),$(call loop_cmd,sphinx,xml,$(var),xml,$(var))) - -endif # HAVE_SPHINX - -# The following targets are independent of HAVE_SPHINX, and the rules should -# work or silently pass without Sphinx. - -# no-ops for the Sphinx toolchain -sgmldocs: - @: -psdocs: - @: -mandocs: - @: -installmandocs: - @: - -cleandocs: - $(Q)rm -rf $(BUILDDIR) - $(Q)$(MAKE) BUILDDIR=$(abspath $(BUILDDIR)) $(build)=Documentation/media clean - -dochelp: - @echo ' Linux kernel internal documentation in different formats (Sphinx):' - @echo ' htmldocs - HTML' - @echo ' latexdocs - LaTeX' - @echo ' pdfdocs - PDF' - @echo ' epubdocs - EPUB' - @echo ' xmldocs - XML' - @echo ' linkcheckdocs - check for broken external links (will connect to external hosts)' - @echo ' cleandocs - clean all generated files' - @echo - @echo ' make SPHINXDIRS="s1 s2" [target] Generate only docs of folder s1, s2' - @echo ' valid values for SPHINXDIRS are: $(_SPHINXDIRS)' - @echo - @echo ' make SPHINX_CONF={conf-file} [target] use *additional* sphinx-build' - @echo ' configuration. This is e.g. useful to build with nit-picking config.' diff --git a/Documentation/PCI/MSI-HOWTO.txt b/Documentation/PCI/MSI-HOWTO.txt index 1e37138027a3..618e13d5e276 100644 --- a/Documentation/PCI/MSI-HOWTO.txt +++ b/Documentation/PCI/MSI-HOWTO.txt @@ -186,7 +186,7 @@ must disable interrupts while the lock is held. If the device sends a different interrupt, the driver will deadlock trying to recursively acquire the spinlock. Such deadlocks can be avoided by using spin_lock_irqsave() or spin_lock_irq() which disable local interrupts -and acquire the lock (see Documentation/DocBook/kernel-locking). +and acquire the lock (see Documentation/kernel-hacking/locking.rst). 4.5 How to tell whether MSI/MSI-X is enabled on a device diff --git a/Documentation/RCU/00-INDEX b/Documentation/RCU/00-INDEX index 1672573b037a..f46980c060aa 100644 --- a/Documentation/RCU/00-INDEX +++ b/Documentation/RCU/00-INDEX @@ -28,8 +28,6 @@ stallwarn.txt - RCU CPU stall warnings (module parameter rcu_cpu_stall_suppress) torture.txt - RCU Torture Test Operation (CONFIG_RCU_TORTURE_TEST) -trace.txt - - CONFIG_RCU_TRACE debugfs files and formats UP.txt - RCU on Uniprocessor Systems whatisRCU.txt diff --git a/Documentation/RCU/Design/Requirements/Requirements.html b/Documentation/RCU/Design/Requirements/Requirements.html index f60adf112663..95b30fa25d56 100644 --- a/Documentation/RCU/Design/Requirements/Requirements.html +++ b/Documentation/RCU/Design/Requirements/Requirements.html @@ -559,9 +559,7 @@ The <tt>rcu_access_pointer()</tt> on line 6 is similar to For <tt>remove_gp_synchronous()</tt>, as long as all modifications to <tt>gp</tt> are carried out while holding <tt>gp_lock</tt>, the above optimizations are harmless. - However, - with <tt>CONFIG_SPARSE_RCU_POINTER=y</tt>, - <tt>sparse</tt> will complain if you + However, <tt>sparse</tt> will complain if you define <tt>gp</tt> with <tt>__rcu</tt> and then access it without using either <tt>rcu_access_pointer()</tt> or <tt>rcu_dereference()</tt>. @@ -1849,7 +1847,8 @@ mass storage, or user patience, whichever comes first. If the nesting is not visible to the compiler, as is the case with mutually recursive functions each in its own translation unit, stack overflow will result. -If the nesting takes the form of loops, either the control variable +If the nesting takes the form of loops, perhaps in the guise of tail +recursion, either the control variable will overflow or (in the Linux kernel) you will get an RCU CPU stall warning. Nevertheless, this class of RCU implementations is one of the most composable constructs in existence. @@ -1977,9 +1976,8 @@ guard against mishaps and misuse: and <tt>rcu_dereference()</tt>, perhaps (incorrectly) substituting a simple assignment. To catch this sort of error, a given RCU-protected pointer may be - tagged with <tt>__rcu</tt>, after which running sparse - with <tt>CONFIG_SPARSE_RCU_POINTER=y</tt> will complain - about simple-assignment accesses to that pointer. + tagged with <tt>__rcu</tt>, after which sparse + will complain about simple-assignment accesses to that pointer. Arnd Bergmann made me aware of this requirement, and also supplied the needed <a href="https://lwn.net/Articles/376011/">patch series</a>. @@ -2036,7 +2034,7 @@ guard against mishaps and misuse: some other synchronization mechanism, for example, reference counting. <li> In kernels built with <tt>CONFIG_RCU_TRACE=y</tt>, RCU-related - information is provided via both debugfs and event tracing. + information is provided via event tracing. <li> Open-coded use of <tt>rcu_assign_pointer()</tt> and <tt>rcu_dereference()</tt> to create typical linked data structures can be surprisingly error-prone. @@ -2519,11 +2517,7 @@ It is similarly socially unacceptable to interrupt an <tt>nohz_full</tt> CPU running in userspace. RCU must therefore track <tt>nohz_full</tt> userspace execution. -And in -<a href="https://lwn.net/Articles/558284/"><tt>CONFIG_NO_HZ_FULL_SYSIDLE=y</tt></a> -kernels, RCU must separately track idle CPUs on the one hand and -CPUs that are either idle or executing in userspace on the other. -In both cases, RCU must be able to sample state at two points in +RCU must therefore be able to sample state at two points in time, and be able to determine whether or not some other CPU spent any time idle and/or executing in userspace. @@ -2936,6 +2930,20 @@ to whether or not a CPU is online, which means that <tt>srcu_barrier()</tt> need not exclude CPU-hotplug operations. <p> +SRCU also differs from other RCU flavors in that SRCU's expedited and +non-expedited grace periods are implemented by the same mechanism. +This means that in the current SRCU implementation, expediting a +future grace period has the side effect of expediting all prior +grace periods that have not yet completed. +(But please note that this is a property of the current implementation, +not necessarily of future implementations.) +In addition, if SRCU has been idle for longer than the interval +specified by the <tt>srcutree.exp_holdoff</tt> kernel boot parameter +(25 microseconds by default), +and if a <tt>synchronize_srcu()</tt> invocation ends this idle period, +that invocation will be automatically expedited. + +<p> As of v4.12, SRCU's callbacks are maintained per-CPU, eliminating a locking bottleneck present in prior kernel versions. Although this will allow users to put much heavier stress on diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt index 877947130ebe..6beda556faf3 100644 --- a/Documentation/RCU/checklist.txt +++ b/Documentation/RCU/checklist.txt @@ -413,11 +413,11 @@ over a rather long period of time, but improvements are always welcome! read-side critical sections. It is the responsibility of the RCU update-side primitives to deal with this. -17. Use CONFIG_PROVE_RCU, CONFIG_DEBUG_OBJECTS_RCU_HEAD, and the - __rcu sparse checks (enabled by CONFIG_SPARSE_RCU_POINTER) to - validate your RCU code. These can help find problems as follows: +17. Use CONFIG_PROVE_LOCKING, CONFIG_DEBUG_OBJECTS_RCU_HEAD, and the + __rcu sparse checks to validate your RCU code. These can help + find problems as follows: - CONFIG_PROVE_RCU: check that accesses to RCU-protected data + CONFIG_PROVE_LOCKING: check that accesses to RCU-protected data structures are carried out under the proper RCU read-side critical section, while holding the right combination of locks, or whatever other conditions diff --git a/Documentation/RCU/trace.txt b/Documentation/RCU/trace.txt deleted file mode 100644 index 6549012033f9..000000000000 --- a/Documentation/RCU/trace.txt +++ /dev/null @@ -1,535 +0,0 @@ -CONFIG_RCU_TRACE debugfs Files and Formats - - -The rcutree and rcutiny implementations of RCU provide debugfs trace -output that summarizes counters and state. This information is useful for -debugging RCU itself, and can sometimes also help to debug abuses of RCU. -The following sections describe the debugfs files and formats, first -for rcutree and next for rcutiny. - - -CONFIG_TREE_RCU and CONFIG_PREEMPT_RCU debugfs Files and Formats - -These implementations of RCU provide several debugfs directories under the -top-level directory "rcu": - -rcu/rcu_bh -rcu/rcu_preempt -rcu/rcu_sched - -Each directory contains files for the corresponding flavor of RCU. -Note that rcu/rcu_preempt is only present for CONFIG_PREEMPT_RCU. -For CONFIG_TREE_RCU, the RCU flavor maps onto the RCU-sched flavor, -so that activity for both appears in rcu/rcu_sched. - -In addition, the following file appears in the top-level directory: -rcu/rcutorture. This file displays rcutorture test progress. The output -of "cat rcu/rcutorture" looks as follows: - -rcutorture test sequence: 0 (test in progress) -rcutorture update version number: 615 - -The first line shows the number of rcutorture tests that have completed -since boot. If a test is currently running, the "(test in progress)" -string will appear as shown above. The second line shows the number of -update cycles that the current test has started, or zero if there is -no test in progress. - - -Within each flavor directory (rcu/rcu_bh, rcu/rcu_sched, and possibly -also rcu/rcu_preempt) the following files will be present: - -rcudata: - Displays fields in struct rcu_data. -rcuexp: - Displays statistics for expedited grace periods. -rcugp: - Displays grace-period counters. -rcuhier: - Displays the struct rcu_node hierarchy. -rcu_pending: - Displays counts of the reasons rcu_pending() decided that RCU had - work to do. -rcuboost: - Displays RCU boosting statistics. Only present if - CONFIG_RCU_BOOST=y. - -The output of "cat rcu/rcu_preempt/rcudata" looks as follows: - - 0!c=30455 g=30456 cnq=1/0:1 dt=126535/140000000000000/0 df=2002 of=4 ql=0/0 qs=N... b=10 ci=74572 nci=0 co=1131 ca=716 - 1!c=30719 g=30720 cnq=1/0:0 dt=132007/140000000000000/0 df=1874 of=10 ql=0/0 qs=N... b=10 ci=123209 nci=0 co=685 ca=982 - 2!c=30150 g=30151 cnq=1/1:1 dt=138537/140000000000000/0 df=1707 of=8 ql=0/0 qs=N... b=10 ci=80132 nci=0 co=1328 ca=1458 - 3 c=31249 g=31250 cnq=1/1:0 dt=107255/140000000000000/0 df=1749 of=6 ql=0/450 qs=NRW. b=10 ci=151700 nci=0 co=509 ca=622 - 4!c=29502 g=29503 cnq=1/0:1 dt=83647/140000000000000/0 df=965 of=5 ql=0/0 qs=N... b=10 ci=65643 nci=0 co=1373 ca=1521 - 5 c=31201 g=31202 cnq=1/0:1 dt=70422/0/0 df=535 of=7 ql=0/0 qs=.... b=10 ci=58500 nci=0 co=764 ca=698 - 6!c=30253 g=30254 cnq=1/0:1 dt=95363/140000000000000/0 df=780 of=5 ql=0/0 qs=N... b=10 ci=100607 nci=0 co=1414 ca=1353 - 7 c=31178 g=31178 cnq=1/0:0 dt=91536/0/0 df=547 of=4 ql=0/0 qs=.... b=10 ci=109819 nci=0 co=1115 ca=969 - -This file has one line per CPU, or eight for this 8-CPU system. -The fields are as follows: - -o The number at the beginning of each line is the CPU number. - CPUs numbers followed by an exclamation mark are offline, - but have been online at least once since boot. There will be - no output for CPUs that have never been online, which can be - a good thing in the surprisingly common case where NR_CPUS is - substantially larger than the number of actual CPUs. - -o "c" is the count of grace periods that this CPU believes have - completed. Offlined CPUs and CPUs in dynticks idle mode may lag - quite a ways behind, for example, CPU 4 under "rcu_sched" above, - which has been offline through 16 RCU grace periods. It is not - unusual to see offline CPUs lagging by thousands of grace periods. - Note that although the grace-period number is an unsigned long, - it is printed out as a signed long to allow more human-friendly - representation near boot time. - -o "g" is the count of grace periods that this CPU believes have - started. Again, offlined CPUs and CPUs in dynticks idle mode - may lag behind. If the "c" and "g" values are equal, this CPU - has already reported a quiescent state for the last RCU grace - period that it is aware of, otherwise, the CPU believes that it - owes RCU a quiescent state. - -o "pq" indicates that this CPU has passed through a quiescent state - for the current grace period. It is possible for "pq" to be - "1" and "c" different than "g", which indicates that although - the CPU has passed through a quiescent state, either (1) this - CPU has not yet reported that fact, (2) some other CPU has not - yet reported for this grace period, or (3) both. - -o "qp" indicates that RCU still expects a quiescent state from - this CPU. Offlined CPUs and CPUs in dyntick idle mode might - well have qp=1, which is OK: RCU is still ignoring them. - -o "dt" is the current value of the dyntick counter that is incremented - when entering or leaving idle, either due to a context switch or - due to an interrupt. This number is even if the CPU is in idle - from RCU's viewpoint and odd otherwise. The number after the - first "/" is the interrupt nesting depth when in idle state, - or a large number added to the interrupt-nesting depth when - running a non-idle task. Some architectures do not accurately - count interrupt nesting when running in non-idle kernel context, - which can result in interesting anomalies such as negative - interrupt-nesting levels. The number after the second "/" - is the NMI nesting depth. - -o "df" is the number of times that some other CPU has forced a - quiescent state on behalf of this CPU due to this CPU being in - idle state. - -o "of" is the number of times that some other CPU has forced a - quiescent state on behalf of this CPU due to this CPU being - offline. In a perfect world, this might never happen, but it - turns out that offlining and onlining a CPU can take several grace - periods, and so there is likely to be an extended period of time - when RCU believes that the CPU is online when it really is not. - Please note that erring in the other direction (RCU believing a - CPU is offline when it is really alive and kicking) is a fatal - error, so it makes sense to err conservatively. - -o "ql" is the number of RCU callbacks currently residing on - this CPU. The first number is the number of "lazy" callbacks - that are known to RCU to only be freeing memory, and the number - after the "/" is the total number of callbacks, lazy or not. - These counters count callbacks regardless of what phase of - grace-period processing that they are in (new, waiting for - grace period to start, waiting for grace period to end, ready - to invoke). - -o "qs" gives an indication of the state of the callback queue - with four characters: - - "N" Indicates that there are callbacks queued that are not - ready to be handled by the next grace period, and thus - will be handled by the grace period following the next - one. - - "R" Indicates that there are callbacks queued that are - ready to be handled by the next grace period. - - "W" Indicates that there are callbacks queued that are - waiting on the current grace period. - - "D" Indicates that there are callbacks queued that have - already been handled by a prior grace period, and are - thus waiting to be invoked. Note that callbacks in - the process of being invoked are not counted here. - Callbacks in the process of being invoked are those - that have been removed from the rcu_data structures - queues by rcu_do_batch(), but which have not yet been - invoked. - - If there are no callbacks in a given one of the above states, - the corresponding character is replaced by ".". - -o "b" is the batch limit for this CPU. If more than this number - of RCU callbacks is ready to invoke, then the remainder will - be deferred. - -o "ci" is the number of RCU callbacks that have been invoked for - this CPU. Note that ci+nci+ql is the number of callbacks that have - been registered in absence of CPU-hotplug activity. - -o "nci" is the number of RCU callbacks that have been offloaded from - this CPU. This will always be zero unless the kernel was built - with CONFIG_RCU_NOCB_CPU=y and the "rcu_nocbs=" kernel boot - parameter was specified. - -o "co" is the number of RCU callbacks that have been orphaned due to - this CPU going offline. These orphaned callbacks have been moved - to an arbitrarily chosen online CPU. - -o "ca" is the number of RCU callbacks that have been adopted by this - CPU due to other CPUs going offline. Note that ci+co-ca+ql is - the number of RCU callbacks registered on this CPU. - - -Kernels compiled with CONFIG_RCU_BOOST=y display the following from -/debug/rcu/rcu_preempt/rcudata: - - 0!c=12865 g=12866 cnq=1/0:1 dt=83113/140000000000000/0 df=288 of=11 ql=0/0 qs=N... kt=0/O ktl=944 b=10 ci=60709 nci=0 co=748 ca=871 - 1 c=14407 g=14408 cnq=1/0:0 dt=100679/140000000000000/0 df=378 of=7 ql=0/119 qs=NRW. kt=0/W ktl=9b6 b=10 ci=109740 nci=0 co=589 ca=485 - 2 c=14407 g=14408 cnq=1/0:0 dt=105486/0/0 df=90 of=9 ql=0/89 qs=NRW. kt=0/W ktl=c0c b=10 ci=83113 nci=0 co=533 ca=490 - 3 c=14407 g=14408 cnq=1/0:0 dt=107138/0/0 df=142 of=8 ql=0/188 qs=NRW. kt=0/W ktl=b96 b=10 ci=121114 nci=0 co=426 ca=290 - 4 c=14405 g=14406 cnq=1/0:1 dt=50238/0/0 df=706 of=7 ql=0/0 qs=.... kt=0/W ktl=812 b=10 ci=34929 nci=0 co=643 ca=114 - 5!c=14168 g=14169 cnq=1/0:0 dt=45465/140000000000000/0 df=161 of=11 ql=0/0 qs=N... kt=0/O ktl=b4d b=10 ci=47712 nci=0 co=677 ca=722 - 6 c=14404 g=14405 cnq=1/0:0 dt=59454/0/0 df=94 of=6 ql=0/0 qs=.... kt=0/W ktl=e57 b=10 ci=55597 nci=0 co=701 ca=811 - 7 c=14407 g=14408 cnq=1/0:1 dt=68850/0/0 df=31 of=8 ql=0/0 qs=.... kt=0/W ktl=14bd b=10 ci=77475 nci=0 co=508 ca=1042 - -This is similar to the output discussed above, but contains the following -additional fields: - -o "kt" is the per-CPU kernel-thread state. The digit preceding - the first slash is zero if there is no work pending and 1 - otherwise. The character between the first pair of slashes is - as follows: - - "S" The kernel thread is stopped, in other words, all - CPUs corresponding to this rcu_node structure are - offline. - - "R" The kernel thread is running. - - "W" The kernel thread is waiting because there is no work - for it to do. - - "O" The kernel thread is waiting because it has been - forced off of its designated CPU or because its - ->cpus_allowed mask permits it to run on other than - its designated CPU. - - "Y" The kernel thread is yielding to avoid hogging CPU. - - "?" Unknown value, indicates a bug. - - The number after the final slash is the CPU that the kthread - is actually running on. - - This field is displayed only for CONFIG_RCU_BOOST kernels. - -o "ktl" is the low-order 16 bits (in hexadecimal) of the count of - the number of times that this CPU's per-CPU kthread has gone - through its loop servicing invoke_rcu_cpu_kthread() requests. - - This field is displayed only for CONFIG_RCU_BOOST kernels. - - -The output of "cat rcu/rcu_preempt/rcuexp" looks as follows: - -s=21872 wd1=0 wd2=0 wd3=5 enq=0 sc=21872 - -These fields are as follows: - -o "s" is the sequence number, with an odd number indicating that - an expedited grace period is in progress. - -o "wd1", "wd2", and "wd3" are the number of times that an attempt - to start an expedited grace period found that someone else had - completed an expedited grace period that satisfies the attempted - request. "Our work is done." - -o "enq" is the number of quiescent states still outstanding. - -o "sc" is the number of times that the attempt to start a - new expedited grace period succeeded. - - -The output of "cat rcu/rcu_preempt/rcugp" looks as follows: - -completed=31249 gpnum=31250 age=1 max=18 - -These fields are taken from the rcu_state structure, and are as follows: - -o "completed" is the number of grace periods that have completed. - It is comparable to the "c" field from rcu/rcudata in that a - CPU whose "c" field matches the value of "completed" is aware - that the corresponding RCU grace period has completed. - -o "gpnum" is the number of grace periods that have started. It is - similarly comparable to the "g" field from rcu/rcudata in that - a CPU whose "g" field matches the value of "gpnum" is aware that - the corresponding RCU grace period has started. - - If these two fields are equal, then there is no grace period - in progress, in other words, RCU is idle. On the other hand, - if the two fields differ (as they are above), then an RCU grace - period is in progress. - -o "age" is the number of jiffies that the current grace period - has extended for, or zero if there is no grace period currently - in effect. - -o "max" is the age in jiffies of the longest-duration grace period - thus far. - -The output of "cat rcu/rcu_preempt/rcuhier" looks as follows: - -c=14407 g=14408 s=0 jfq=2 j=c863 nfqs=12040/nfqsng=0(12040) fqlh=1051 oqlen=0/0 -3/3 ..>. 0:7 ^0 -e/e ..>. 0:3 ^0 d/d ..>. 4:7 ^1 - -The fields are as follows: - -o "c" is exactly the same as "completed" under rcu/rcu_preempt/rcugp. - -o "g" is exactly the same as "gpnum" under rcu/rcu_preempt/rcugp. - -o "s" is the current state of the force_quiescent_state() - state machine. - -o "jfq" is the number of jiffies remaining for this grace period - before force_quiescent_state() is invoked to help push things - along. Note that CPUs in idle mode throughout the grace period - will not report on their own, but rather must be check by some - other CPU via force_quiescent_state(). - -o "j" is the low-order four hex digits of the jiffies counter. - Yes, Paul did run into a number of problems that turned out to - be due to the jiffies counter no longer counting. Why do you ask? - -o "nfqs" is the number of calls to force_quiescent_state() since - boot. - -o "nfqsng" is the number of useless calls to force_quiescent_state(), - where there wasn't actually a grace period active. This can - no longer happen due to grace-period processing being pushed - into a kthread. The number in parentheses is the difference - between "nfqs" and "nfqsng", or the number of times that - force_quiescent_state() actually did some real work. - -o "fqlh" is the number of calls to force_quiescent_state() that - exited immediately (without even being counted in nfqs above) - due to contention on ->fqslock. - -o Each element of the form "3/3 ..>. 0:7 ^0" represents one rcu_node - structure. Each line represents one level of the hierarchy, - from root to leaves. It is best to think of the rcu_data - structures as forming yet another level after the leaves. - Note that there might be either one, two, three, or even four - levels of rcu_node structures, depending on the relationship - between CONFIG_RCU_FANOUT, CONFIG_RCU_FANOUT_LEAF (possibly - adjusted using the rcu_fanout_leaf kernel boot parameter), and - CONFIG_NR_CPUS (possibly adjusted using the nr_cpu_ids count of - possible CPUs for the booting hardware). - - o The numbers separated by the "/" are the qsmask followed - by the qsmaskinit. The qsmask will have one bit - set for each entity in the next lower level that has - not yet checked in for the current grace period ("e" - indicating CPUs 5, 6, and 7 in the example above). - The qsmaskinit will have one bit for each entity that is - currently expected to check in during each grace period. - The value of qsmaskinit is assigned to that of qsmask - at the beginning of each grace period. - - o The characters separated by the ">" indicate the state - of the blocked-tasks lists. A "G" preceding the ">" - indicates that at least one task blocked in an RCU - read-side critical section blocks the current grace - period, while a "E" preceding the ">" indicates that - at least one task blocked in an RCU read-side critical - section blocks the current expedited grace period. - A "T" character following the ">" indicates that at - least one task is blocked within an RCU read-side - critical section, regardless of whether any current - grace period (expedited or normal) is inconvenienced. - A "." character appears if the corresponding condition - does not hold, so that "..>." indicates that no tasks - are blocked. In contrast, "GE>T" indicates maximal - inconvenience from blocked tasks. CONFIG_TREE_RCU - builds of the kernel will always show "..>.". - - o The numbers separated by the ":" are the range of CPUs - served by this struct rcu_node. This can be helpful - in working out how the hierarchy is wired together. - - For example, the example rcu_node structure shown above - has "0:7", indicating that it covers CPUs 0 through 7. - - o The number after the "^" indicates the bit in the - next higher level rcu_node structure that this rcu_node - structure corresponds to. For example, the "d/d ..>. 4:7 - ^1" has a "1" in this position, indicating that it - corresponds to the "1" bit in the "3" shown in the - "3/3 ..>. 0:7 ^0" entry on the next level up. - - -The output of "cat rcu/rcu_sched/rcu_pending" looks as follows: - - 0!np=26111 qsp=29 rpq=5386 cbr=1 cng=570 gpc=3674 gps=577 nn=15903 ndw=0 - 1!np=28913 qsp=35 rpq=6097 cbr=1 cng=448 gpc=3700 gps=554 nn=18113 ndw=0 - 2!np=32740 qsp=37 rpq=6202 cbr=0 cng=476 gpc=4627 gps=546 nn=20889 ndw=0 - 3 np=23679 qsp=22 rpq=5044 cbr=1 cng=415 gpc=3403 gps=347 nn=14469 ndw=0 - 4!np=30714 qsp=4 rpq=5574 cbr=0 cng=528 gpc=3931 gps=639 nn=20042 ndw=0 - 5 np=28910 qsp=2 rpq=5246 cbr=0 cng=428 gpc=4105 gps=709 nn=18422 ndw=0 - 6!np=38648 qsp=5 rpq=7076 cbr=0 cng=840 gpc=4072 gps=961 nn=25699 ndw=0 - 7 np=37275 qsp=2 rpq=6873 cbr=0 cng=868 gpc=3416 gps=971 nn=25147 ndw=0 - -The fields are as follows: - -o The leading number is the CPU number, with "!" indicating - an offline CPU. - -o "np" is the number of times that __rcu_pending() has been invoked - for the corresponding flavor of RCU. - -o "qsp" is the number of times that the RCU was waiting for a - quiescent state from this CPU. - -o "rpq" is the number of times that the CPU had passed through - a quiescent state, but not yet reported it to RCU. - -o "cbr" is the number of times that this CPU had RCU callbacks - that had passed through a grace period, and were thus ready - to be invoked. - -o "cng" is the number of times that this CPU needed another - grace period while RCU was idle. - -o "gpc" is the number of times that an old grace period had - completed, but this CPU was not yet aware of it. - -o "gps" is the number of times that a new grace period had started, - but this CPU was not yet aware of it. - -o "ndw" is the number of times that a wakeup of an rcuo - callback-offload kthread had to be deferred in order to avoid - deadlock. - -o "nn" is the number of times that this CPU needed nothing. - - -The output of "cat rcu/rcuboost" looks as follows: - -0:3 tasks=.... kt=W ntb=0 neb=0 nnb=0 j=c864 bt=c894 - balk: nt=0 egt=4695 bt=0 nb=0 ny=56 nos=0 -4:7 tasks=.... kt=W ntb=0 neb=0 nnb=0 j=c864 bt=c894 - balk: nt=0 egt=6541 bt=0 nb=0 ny=126 nos=0 - -This information is output only for rcu_preempt. Each two-line entry -corresponds to a leaf rcu_node structure. The fields are as follows: - -o "n:m" is the CPU-number range for the corresponding two-line - entry. In the sample output above, the first entry covers - CPUs zero through three and the second entry covers CPUs four - through seven. - -o "tasks=TNEB" gives the state of the various segments of the - rnp->blocked_tasks list: - - "T" This indicates that there are some tasks that blocked - while running on one of the corresponding CPUs while - in an RCU read-side critical section. - - "N" This indicates that some of the blocked tasks are preventing - the current normal (non-expedited) grace period from - completing. - - "E" This indicates that some of the blocked tasks are preventing - the current expedited grace period from completing. - - "B" This indicates that some of the blocked tasks are in - need of RCU priority boosting. - - Each character is replaced with "." if the corresponding - condition does not hold. - -o "kt" is the state of the RCU priority-boosting kernel - thread associated with the corresponding rcu_node structure. - The state can be one of the following: - - "S" The kernel thread is stopped, in other words, all - CPUs corresponding to this rcu_node structure are - offline. - - "R" The kernel thread is running. - - "W" The kernel thread is waiting because there is no work - for it to do. - - "Y" The kernel thread is yielding to avoid hogging CPU. - - "?" Unknown value, indicates a bug. - -o "ntb" is the number of tasks boosted. - -o "neb" is the number of tasks boosted in order to complete an - expedited grace period. - -o "nnb" is the number of tasks boosted in order to complete a - normal (non-expedited) grace period. When boosting a task - that was blocking both an expedited and a normal grace period, - it is counted against the expedited total above. - -o "j" is the low-order 16 bits of the jiffies counter in - hexadecimal. - -o "bt" is the low-order 16 bits of the value that the jiffies - counter will have when we next start boosting, assuming that - the current grace period does not end beforehand. This is - also in hexadecimal. - -o "balk: nt" counts the number of times we didn't boost (in - other words, we balked) even though it was time to boost because - there were no blocked tasks to boost. This situation occurs - when there is one blocked task on one rcu_node structure and - none on some other rcu_node structure. - -o "egt" counts the number of times we balked because although - there were blocked tasks, none of them were blocking the - current grace period, whether expedited or otherwise. - -o "bt" counts the number of times we balked because boosting - had already been initiated for the current grace period. - -o "nb" counts the number of times we balked because there - was at least one task blocking the current non-expedited grace - period that never had blocked. If it is already running, it - just won't help to boost its priority! - -o "ny" counts the number of times we balked because it was - not yet time to start boosting. - -o "nos" counts the number of times we balked for other - reasons, e.g., the grace period ended first. - - -CONFIG_TINY_RCU debugfs Files and Formats - -These implementations of RCU provides a single debugfs file under the -top-level directory RCU, namely rcu/rcudata, which displays fields in -rcu_bh_ctrlblk and rcu_sched_ctrlblk. - -The output of "cat rcu/rcudata" is as follows: - -rcu_sched: qlen: 0 -rcu_bh: qlen: 0 - -This is split into rcu_sched and rcu_bh sections. The field is as -follows: - -o "qlen" is the number of RCU callbacks currently waiting either - for an RCU grace period or waiting to be invoked. This is the - only field present for rcu_sched and rcu_bh, due to the - short-circuiting of grace period in those two cases. diff --git a/Documentation/SAK.txt b/Documentation/SAK.txt index 74be14679ed8..260e1d3687bd 100644 --- a/Documentation/SAK.txt +++ b/Documentation/SAK.txt @@ -1,5 +1,9 @@ -Linux 2.4.2 Secure Attention Key (SAK) handling -18 March 2001, Andrew Morton +========================================= +Linux Secure Attention Key (SAK) handling +========================================= + +:Date: 18 March 2001 +:Author: Andrew Morton An operating system's Secure Attention Key is a security tool which is provided as protection against trojan password capturing programs. It @@ -13,7 +17,7 @@ this sequence. It is only available if the kernel was compiled with sysrq support. The proper way of generating a SAK is to define the key sequence using -`loadkeys'. This will work whether or not sysrq support is compiled +``loadkeys``. This will work whether or not sysrq support is compiled into the kernel. SAK works correctly when the keyboard is in raw mode. This means that @@ -25,64 +29,63 @@ What key sequence should you use? Well, CTRL-ALT-DEL is used to reboot the machine. CTRL-ALT-BACKSPACE is magical to the X server. We'll choose CTRL-ALT-PAUSE. -In your rc.sysinit (or rc.local) file, add the command +In your rc.sysinit (or rc.local) file, add the command:: echo "control alt keycode 101 = SAK" | /bin/loadkeys And that's it! Only the superuser may reprogram the SAK key. -NOTES -===== +.. note:: -1: Linux SAK is said to be not a "true SAK" as is required by - systems which implement C2 level security. This author does not - know why. + 1. Linux SAK is said to be not a "true SAK" as is required by + systems which implement C2 level security. This author does not + know why. -2: On the PC keyboard, SAK kills all applications which have - /dev/console opened. + 2. On the PC keyboard, SAK kills all applications which have + /dev/console opened. - Unfortunately this includes a number of things which you don't - actually want killed. This is because these applications are - incorrectly holding /dev/console open. Be sure to complain to your - Linux distributor about this! + Unfortunately this includes a number of things which you don't + actually want killed. This is because these applications are + incorrectly holding /dev/console open. Be sure to complain to your + Linux distributor about this! - You can identify processes which will be killed by SAK with the - command + You can identify processes which will be killed by SAK with the + command:: # ls -l /proc/[0-9]*/fd/* | grep console l-wx------ 1 root root 64 Mar 18 00:46 /proc/579/fd/0 -> /dev/console - Then: + Then:: # ps aux|grep 579 root 579 0.0 0.1 1088 436 ? S 00:43 0:00 gpm -t ps/2 - So `gpm' will be killed by SAK. This is a bug in gpm. It should - be closing standard input. You can work around this by finding the - initscript which launches gpm and changing it thusly: + So ``gpm`` will be killed by SAK. This is a bug in gpm. It should + be closing standard input. You can work around this by finding the + initscript which launches gpm and changing it thusly: - Old: + Old:: daemon gpm - New: + New:: daemon gpm < /dev/null - Vixie cron also seems to have this problem, and needs the same treatment. + Vixie cron also seems to have this problem, and needs the same treatment. - Also, one prominent Linux distribution has the following three - lines in its rc.sysinit and rc scripts: + Also, one prominent Linux distribution has the following three + lines in its rc.sysinit and rc scripts:: exec 3<&0 exec 4>&1 exec 5>&2 - These commands cause *all* daemons which are launched by the - initscripts to have file descriptors 3, 4 and 5 attached to - /dev/console. So SAK kills them all. A workaround is to simply - delete these lines, but this may cause system management - applications to malfunction - test everything well. + These commands cause **all** daemons which are launched by the + initscripts to have file descriptors 3, 4 and 5 attached to + /dev/console. So SAK kills them all. A workaround is to simply + delete these lines, but this may cause system management + applications to malfunction - test everything well. diff --git a/Documentation/SM501.txt b/Documentation/SM501.txt index 561826f82093..882507453ba4 100644 --- a/Documentation/SM501.txt +++ b/Documentation/SM501.txt @@ -1,7 +1,10 @@ - SM501 Driver - ============ +.. include:: <isonum.txt> -Copyright 2006, 2007 Simtec Electronics +============ +SM501 Driver +============ + +:Copyright: |copy| 2006, 2007 Simtec Electronics The Silicon Motion SM501 multimedia companion chip is a multifunction device which may provide numerous interfaces including USB host controller USB gadget, diff --git a/Documentation/acpi/gpio-properties.txt b/Documentation/acpi/gpio-properties.txt index 2aff0349facd..88c65cb5bf0a 100644 --- a/Documentation/acpi/gpio-properties.txt +++ b/Documentation/acpi/gpio-properties.txt @@ -156,3 +156,68 @@ pointed to by its first argument. That should be done in the driver's .probe() routine. On removal, the driver should unregister its GPIO mapping table by calling acpi_dev_remove_driver_gpios() on the ACPI device object where that table was previously registered. + +Using the _CRS fallback +----------------------- + +If a device does not have _DSD or the driver does not create ACPI GPIO +mapping, the Linux GPIO framework refuses to return any GPIOs. This is +because the driver does not know what it actually gets. For example if we +have a device like below: + + Device (BTH) + { + Name (_HID, ...) + + Name (_CRS, ResourceTemplate () { + GpioIo (Exclusive, PullNone, 0, 0, IoRestrictionNone, + "\\_SB.GPO0", 0, ResourceConsumer) {15} + GpioIo (Exclusive, PullNone, 0, 0, IoRestrictionNone, + "\\_SB.GPO0", 0, ResourceConsumer) {27} + }) + } + +The driver might expect to get the right GPIO when it does: + + desc = gpiod_get(dev, "reset", GPIOD_OUT_LOW); + +but since there is no way to know the mapping between "reset" and +the GpioIo() in _CRS desc will hold ERR_PTR(-ENOENT). + +The driver author can solve this by passing the mapping explictly +(the recommended way and documented in the above chapter). + +The ACPI GPIO mapping tables should not contaminate drivers that are not +knowing about which exact device they are servicing on. It implies that +the ACPI GPIO mapping tables are hardly linked to ACPI ID and certain +objects, as listed in the above chapter, of the device in question. + +Getting GPIO descriptor +----------------------- + +There are two main approaches to get GPIO resource from ACPI: + desc = gpiod_get(dev, connection_id, flags); + desc = gpiod_get_index(dev, connection_id, index, flags); + +We may consider two different cases here, i.e. when connection ID is +provided and otherwise. + +Case 1: + desc = gpiod_get(dev, "non-null-connection-id", flags); + desc = gpiod_get_index(dev, "non-null-connection-id", index, flags); + +Case 2: + desc = gpiod_get(dev, NULL, flags); + desc = gpiod_get_index(dev, NULL, index, flags); + +Case 1 assumes that corresponding ACPI device description must have +defined device properties and will prevent to getting any GPIO resources +otherwise. + +Case 2 explicitly tells GPIO core to look for resources in _CRS. + +Be aware that gpiod_get_index() in cases 1 and 2, assuming that there +are two versions of ACPI device description provided and no mapping is +present in the driver, will return different resources. That's why a +certain driver has to handle them carefully as explained in previous +chapter. diff --git a/Documentation/security/LoadPin.txt b/Documentation/admin-guide/LSM/LoadPin.rst index e11877f5d3d4..32070762d24c 100644 --- a/Documentation/security/LoadPin.txt +++ b/Documentation/admin-guide/LSM/LoadPin.rst @@ -1,3 +1,7 @@ +======= +LoadPin +======= + LoadPin is a Linux Security Module that ensures all kernel-loaded files (modules, firmware, etc) all originate from the same filesystem, with the expectation that such a filesystem is backed by a read-only device @@ -5,13 +9,13 @@ such as dm-verity or CDROM. This allows systems that have a verified and/or unchangeable filesystem to enforce module and firmware loading restrictions without needing to sign the files individually. -The LSM is selectable at build-time with CONFIG_SECURITY_LOADPIN, and +The LSM is selectable at build-time with ``CONFIG_SECURITY_LOADPIN``, and can be controlled at boot-time with the kernel command line option -"loadpin.enabled". By default, it is enabled, but can be disabled at -boot ("loadpin.enabled=0"). +"``loadpin.enabled``". By default, it is enabled, but can be disabled at +boot ("``loadpin.enabled=0``"). LoadPin starts pinning when it sees the first file loaded. If the block device backing the filesystem is not read-only, a sysctl is -created to toggle pinning: /proc/sys/kernel/loadpin/enabled. (Having +created to toggle pinning: ``/proc/sys/kernel/loadpin/enabled``. (Having a mutable filesystem means pinning is mutable too, but having the sysctl allows for easy testing on systems with a mutable filesystem.) diff --git a/Documentation/security/SELinux.txt b/Documentation/admin-guide/LSM/SELinux.rst index 07eae00f3314..f722c9b4173a 100644 --- a/Documentation/security/SELinux.txt +++ b/Documentation/admin-guide/LSM/SELinux.rst @@ -1,27 +1,33 @@ +======= +SELinux +======= + If you want to use SELinux, chances are you will want to use the distro-provided policies, or install the latest reference policy release from + http://oss.tresys.com/projects/refpolicy However, if you want to install a dummy policy for -testing, you can do using 'mdp' provided under +testing, you can do using ``mdp`` provided under scripts/selinux. Note that this requires the selinux userspace to be installed - in particular you will need checkpolicy to compile a kernel, and setfiles and fixfiles to label the filesystem. 1. Compile the kernel with selinux enabled. - 2. Type 'make' to compile mdp. + 2. Type ``make`` to compile ``mdp``. 3. Make sure that you are not running with SELinux enabled and a real policy. If you are, reboot with selinux disabled before continuing. - 4. Run install_policy.sh: + 4. Run install_policy.sh:: + cd scripts/selinux sh install_policy.sh Step 4 will create a new dummy policy valid for your kernel, with a single selinux user, role, and type. -It will compile the policy, will set your SELINUXTYPE to -dummy in /etc/selinux/config, install the compiled policy -as 'dummy', and relabel your filesystem. +It will compile the policy, will set your ``SELINUXTYPE`` to +``dummy`` in ``/etc/selinux/config``, install the compiled policy +as ``dummy``, and relabel your filesystem. diff --git a/Documentation/security/Smack.txt b/Documentation/admin-guide/LSM/Smack.rst index 945cc633d883..6a5826a13aea 100644 --- a/Documentation/security/Smack.txt +++ b/Documentation/admin-guide/LSM/Smack.rst @@ -1,3 +1,6 @@ +===== +Smack +===== "Good for you, you've decided to clean the elevator!" @@ -14,6 +17,7 @@ available to determine which is best suited to the problem at hand. Smack consists of three major components: + - The kernel - Basic utilities, which are helpful but not required - Configuration data @@ -39,16 +43,24 @@ The current git repository for Smack user space is: This should make and install on most modern distributions. There are five commands included in smackutil: -chsmack - display or set Smack extended attribute values -smackctl - load the Smack access rules -smackaccess - report if a process with one label has access - to an object with another +chsmack: + display or set Smack extended attribute values + +smackctl: + load the Smack access rules + +smackaccess: + report if a process with one label has access + to an object with another These two commands are obsolete with the introduction of the smackfs/load2 and smackfs/cipso2 interfaces. -smackload - properly formats data for writing to smackfs/load -smackcipso - properly formats data for writing to smackfs/cipso +smackload: + properly formats data for writing to smackfs/load + +smackcipso: + properly formats data for writing to smackfs/cipso In keeping with the intent of Smack, configuration data is minimal and not strictly required. The most important @@ -56,15 +68,15 @@ configuration step is mounting the smackfs pseudo filesystem. If smackutil is installed the startup script will take care of this, but it can be manually as well. -Add this line to /etc/fstab: +Add this line to ``/etc/fstab``:: smackfs /sys/fs/smackfs smackfs defaults 0 0 -The /sys/fs/smackfs directory is created by the kernel. +The ``/sys/fs/smackfs`` directory is created by the kernel. Smack uses extended attributes (xattrs) to store labels on filesystem objects. The attributes are stored in the extended attribute security -name space. A process must have CAP_MAC_ADMIN to change any of these +name space. A process must have ``CAP_MAC_ADMIN`` to change any of these attributes. The extended attributes that Smack uses are: @@ -73,14 +85,17 @@ SMACK64 Used to make access control decisions. In almost all cases the label given to a new filesystem object will be the label of the process that created it. + SMACK64EXEC The Smack label of a process that execs a program file with this attribute set will run with this attribute's value. + SMACK64MMAP Don't allow the file to be mmapped by a process whose Smack label does not allow all of the access permitted to a process with the label contained in this attribute. This is a very specific use case for shared libraries. + SMACK64TRANSMUTE Can only have the value "TRUE". If this attribute is present on a directory when an object is created in the directory and @@ -89,27 +104,29 @@ SMACK64TRANSMUTE gets the label of the directory instead of the label of the creating process. If the object being created is a directory the SMACK64TRANSMUTE attribute is set as well. + SMACK64IPIN This attribute is only available on file descriptors for sockets. Use the Smack label in this attribute for access control decisions on packets being delivered to this socket. + SMACK64IPOUT This attribute is only available on file descriptors for sockets. Use the Smack label in this attribute for access control decisions on packets coming from this socket. -There are multiple ways to set a Smack label on a file: +There are multiple ways to set a Smack label on a file:: # attr -S -s SMACK64 -V "value" path # chsmack -a value path A process can see the Smack label it is running with by -reading /proc/self/attr/current. A process with CAP_MAC_ADMIN +reading ``/proc/self/attr/current``. A process with ``CAP_MAC_ADMIN`` can set the process Smack by writing there. Most Smack configuration is accomplished by writing to files in the smackfs filesystem. This pseudo-filesystem is mounted -on /sys/fs/smackfs. +on ``/sys/fs/smackfs``. access Provided for backward compatibility. The access2 interface @@ -120,6 +137,7 @@ access this file. The next read will indicate whether the access would be permitted. The text will be either "1" indicating access, or "0" indicating denial. + access2 This interface reports whether a subject with the specified Smack label has a particular access to an object with a @@ -127,13 +145,17 @@ access2 this file. The next read will indicate whether the access would be permitted. The text will be either "1" indicating access, or "0" indicating denial. + ambient This contains the Smack label applied to unlabeled network packets. + change-rule This interface allows modification of existing access control rules. - The format accepted on write is: + The format accepted on write is:: + "%s %s %s %s" + where the first string is the subject label, the second the object label, the third the access to allow and the fourth the access to deny. The access strings may contain only the characters @@ -141,47 +163,63 @@ change-rule modified by enabling the permissions in the third string and disabling those in the fourth string. If there is no such rule it will be created using the access specified in the third and the fourth strings. + cipso Provided for backward compatibility. The cipso2 interface is preferred and should be used instead. This interface allows a specific CIPSO header to be assigned - to a Smack label. The format accepted on write is: + to a Smack label. The format accepted on write is:: + "%24s%4d%4d"["%4d"]... + The first string is a fixed Smack label. The first number is the level to use. The second number is the number of categories. - The following numbers are the categories. - "level-3-cats-5-19 3 2 5 19" + The following numbers are the categories:: + + "level-3-cats-5-19 3 2 5 19" + cipso2 This interface allows a specific CIPSO header to be assigned - to a Smack label. The format accepted on write is: - "%s%4d%4d"["%4d"]... + to a Smack label. The format accepted on write is:: + + "%s%4d%4d"["%4d"]... + The first string is a long Smack label. The first number is the level to use. The second number is the number of categories. - The following numbers are the categories. - "level-3-cats-5-19 3 2 5 19" + The following numbers are the categories:: + + "level-3-cats-5-19 3 2 5 19" + direct This contains the CIPSO level used for Smack direct label representation in network packets. + doi This contains the CIPSO domain of interpretation used in network packets. + ipv6host This interface allows specific IPv6 internet addresses to be treated as single label hosts. Packets are sent to single label hosts only from processes that have Smack write access to the host label. All packets received from single label hosts - are given the specified label. The format accepted on write is: + are given the specified label. The format accepted on write is:: + "%h:%h:%h:%h:%h:%h:%h:%h label" or "%h:%h:%h:%h:%h:%h:%h:%h/%d label". + The "::" address shortcut is not supported. If label is "-DELETE" a matched entry will be deleted. + load Provided for backward compatibility. The load2 interface is preferred and should be used instead. This interface allows access control rules in addition to the system defined rules to be specified. The format accepted - on write is: + on write is:: + "%24s%24s%5s" + where the first string is the subject label, the second the object label, and the third the requested access. The access string may contain only the characters "rwxat-", and specifies @@ -189,17 +227,21 @@ load permissions that are not allowed. The string "r-x--" would specify read and execute access. Labels are limited to 23 characters in length. + load2 This interface allows access control rules in addition to the system defined rules to be specified. The format accepted - on write is: + on write is:: + "%s %s %s" + where the first string is the subject label, the second the object label, and the third the requested access. The access string may contain only the characters "rwxat-", and specifies which sort of access is allowed. The "-" is a placeholder for permissions that are not allowed. The string "r-x--" would specify read and execute access. + load-self Provided for backward compatibility. The load-self2 interface is preferred and should be used instead. @@ -208,66 +250,83 @@ load-self otherwise be permitted, and are intended to provide additional restrictions on the process. The format is the same as for the load interface. + load-self2 This interface allows process specific access rules to be defined. These rules are only consulted if access would otherwise be permitted, and are intended to provide additional restrictions on the process. The format is the same as for the load2 interface. + logging This contains the Smack logging state. + mapped This contains the CIPSO level used for Smack mapped label representation in network packets. + netlabel This interface allows specific internet addresses to be treated as single label hosts. Packets are sent to single label hosts without CIPSO headers, but only from processes that have Smack write access to the host label. All packets received from single label hosts are given the specified - label. The format accepted on write is: + label. The format accepted on write is:: + "%d.%d.%d.%d label" or "%d.%d.%d.%d/%d label". + If the label specified is "-CIPSO" the address is treated as a host that supports CIPSO headers. + onlycap This contains labels processes must have for CAP_MAC_ADMIN - and CAP_MAC_OVERRIDE to be effective. If this file is empty + and ``CAP_MAC_OVERRIDE`` to be effective. If this file is empty these capabilities are effective at for processes with any label. The values are set by writing the desired labels, separated by spaces, to the file or cleared by writing "-" to the file. + ptrace This is used to define the current ptrace policy - 0 - default: this is the policy that relies on Smack access rules. - For the PTRACE_READ a subject needs to have a read access on - object. For the PTRACE_ATTACH a read-write access is required. - 1 - exact: this is the policy that limits PTRACE_ATTACH. Attach is + + 0 - default: + this is the policy that relies on Smack access rules. + For the ``PTRACE_READ`` a subject needs to have a read access on + object. For the ``PTRACE_ATTACH`` a read-write access is required. + + 1 - exact: + this is the policy that limits ``PTRACE_ATTACH``. Attach is only allowed when subject's and object's labels are equal. - PTRACE_READ is not affected. Can be overridden with CAP_SYS_PTRACE. - 2 - draconian: this policy behaves like the 'exact' above with an - exception that it can't be overridden with CAP_SYS_PTRACE. + ``PTRACE_READ`` is not affected. Can be overridden with ``CAP_SYS_PTRACE``. + + 2 - draconian: + this policy behaves like the 'exact' above with an + exception that it can't be overridden with ``CAP_SYS_PTRACE``. + revoke-subject Writing a Smack label here sets the access to '-' for all access rules with that subject label. + unconfined - If the kernel is configured with CONFIG_SECURITY_SMACK_BRINGUP - a process with CAP_MAC_ADMIN can write a label into this interface. + If the kernel is configured with ``CONFIG_SECURITY_SMACK_BRINGUP`` + a process with ``CAP_MAC_ADMIN`` can write a label into this interface. Thereafter, accesses that involve that label will be logged and the access permitted if it wouldn't be otherwise. Note that this is dangerous and can ruin the proper labeling of your system. It should never be used in production. + relabel-self This interface contains a list of labels to which the process can - transition to, by writing to /proc/self/attr/current. + transition to, by writing to ``/proc/self/attr/current``. Normally a process can change its own label to any legal value, but only - if it has CAP_MAC_ADMIN. This interface allows a process without - CAP_MAC_ADMIN to relabel itself to one of labels from predefined list. - A process without CAP_MAC_ADMIN can change its label only once. When it + if it has ``CAP_MAC_ADMIN``. This interface allows a process without + ``CAP_MAC_ADMIN`` to relabel itself to one of labels from predefined list. + A process without ``CAP_MAC_ADMIN`` can change its label only once. When it does, this list will be cleared. The values are set by writing the desired labels, separated by spaces, to the file or cleared by writing "-" to the file. If you are using the smackload utility -you can add access rules in /etc/smack/accesses. They take the form: +you can add access rules in ``/etc/smack/accesses``. They take the form:: subjectlabel objectlabel access @@ -277,14 +336,14 @@ object with objectlabel. If there is no rule no access is allowed. Look for additional programs on http://schaufler-ca.com -From the Smack Whitepaper: - -The Simplified Mandatory Access Control Kernel +The Simplified Mandatory Access Control Kernel (Whitepaper) +=========================================================== Casey Schaufler casey@schaufler-ca.com Mandatory Access Control +------------------------ Computer systems employ a variety of schemes to constrain how information is shared among the people and services using the machine. Some of these schemes @@ -297,6 +356,7 @@ access control mechanisms because you don't have a choice regarding the users or programs that have access to pieces of data. Bell & LaPadula +--------------- From the middle of the 1980's until the turn of the century Mandatory Access Control (MAC) was very closely associated with the Bell & LaPadula security @@ -306,6 +366,7 @@ within the Capital Beltway and Scandinavian supercomputer centers but was often sited as failing to address general needs. Domain Type Enforcement +----------------------- Around the turn of the century Domain Type Enforcement (DTE) became popular. This scheme organizes users, programs, and data into domains that are @@ -316,6 +377,7 @@ necessary to provide a secure domain mapping leads to the scheme being disabled or used in limited ways in the majority of cases. Smack +----- Smack is a Mandatory Access Control mechanism designed to provide useful MAC while avoiding the pitfalls of its predecessors. The limitations of Bell & @@ -326,46 +388,55 @@ Enforcement and avoided by defining access controls in terms of the access modes already in use. Smack Terminology +----------------- The jargon used to talk about Smack will be familiar to those who have dealt with other MAC systems and shouldn't be too difficult for the uninitiated to pick up. There are four terms that are used in a specific way and that are especially important: - Subject: A subject is an active entity on the computer system. + Subject: + A subject is an active entity on the computer system. On Smack a subject is a task, which is in turn the basic unit of execution. - Object: An object is a passive entity on the computer system. + Object: + An object is a passive entity on the computer system. On Smack files of all types, IPC, and tasks can be objects. - Access: Any attempt by a subject to put information into or get + Access: + Any attempt by a subject to put information into or get information from an object is an access. - Label: Data that identifies the Mandatory Access Control + Label: + Data that identifies the Mandatory Access Control characteristics of a subject or an object. These definitions are consistent with the traditional use in the security community. There are also some terms from Linux that are likely to crop up: - Capability: A task that possesses a capability has permission to + Capability: + A task that possesses a capability has permission to violate an aspect of the system security policy, as identified by the specific capability. A task that possesses one or more capabilities is a privileged task, whereas a task with no capabilities is an unprivileged task. - Privilege: A task that is allowed to violate the system security + Privilege: + A task that is allowed to violate the system security policy is said to have privilege. As of this writing a task can have privilege either by possessing capabilities or by having an effective user of root. Smack Basics +------------ Smack is an extension to a Linux system. It enforces additional restrictions on what subjects can access which objects, based on the labels attached to each of the subject and the object. Labels +~~~~~~ Smack labels are ASCII character strings. They can be up to 255 characters long, but keeping them to twenty-three characters is recommended. @@ -377,7 +448,7 @@ contain unprintable characters, the "/" (slash), the "\" (backslash), the "'" (quote) and '"' (double-quote) characters. Smack labels cannot begin with a '-'. This is reserved for special options. -There are some predefined labels: +There are some predefined labels:: _ Pronounced "floor", a single underscore character. ^ Pronounced "hat", a single circumflex character. @@ -390,14 +461,18 @@ of a process will usually be assigned by the system initialization mechanism. Access Rules +~~~~~~~~~~~~ Smack uses the traditional access modes of Linux. These modes are read, execute, write, and occasionally append. There are a few cases where the access mode may not be obvious. These include: - Signals: A signal is a write operation from the subject task to + Signals: + A signal is a write operation from the subject task to the object task. - Internet Domain IPC: Transmission of a packet is considered a + + Internet Domain IPC: + Transmission of a packet is considered a write operation from the source task to the destination task. Smack restricts access based on the label attached to a subject and the label @@ -417,6 +492,7 @@ order: 7. Any other access is denied. Smack Access Rules +~~~~~~~~~~~~~~~~~~ With the isolation provided by Smack access separation is simple. There are many interesting cases where limited access by subjects to objects with @@ -427,8 +503,9 @@ be "born" highly classified. To accommodate such schemes Smack includes a mechanism for specifying rules allowing access between labels. Access Rule Format +~~~~~~~~~~~~~~~~~~ -The format of an access rule is: +The format of an access rule is:: subject-label object-label access @@ -446,7 +523,7 @@ describe access modes: Uppercase values for the specification letters are allowed as well. Access mode specifications can be in any order. Examples of acceptable rules -are: +are:: TopSecret Secret rx Secret Unclass R @@ -456,7 +533,7 @@ are: New Old rRrRr Closed Off - -Examples of unacceptable rules are: +Examples of unacceptable rules are:: Top Secret Secret rx Ace Ace r @@ -469,6 +546,7 @@ access specifications. The dash is a placeholder, so "a-r" is the same as "ar". A lone dash is used to specify that no access should be allowed. Applying Access Rules +~~~~~~~~~~~~~~~~~~~~~ The developers of Linux rarely define new sorts of things, usually importing schemes and concepts from other systems. Most often, the other systems are @@ -511,6 +589,7 @@ one process to another requires that the sender have write access to the receiver. The receiver is not required to have read access to the sender. Setting Access Rules +~~~~~~~~~~~~~~~~~~~~ The configuration file /etc/smack/accesses contains the rules to be set at system startup. The contents are written to the special file @@ -520,6 +599,7 @@ one rule, with the most recently specified overriding any earlier specification. Task Attribute +~~~~~~~~~~~~~~ The Smack label of a process can be read from /proc/<pid>/attr/current. A process can read its own Smack label from /proc/self/attr/current. A @@ -527,12 +607,14 @@ privileged process can change its own Smack label by writing to /proc/self/attr/current but not the label of another process. File Attribute +~~~~~~~~~~~~~~ The Smack label of a filesystem object is stored as an extended attribute named SMACK64 on the file. This attribute is in the security namespace. It can only be changed by a process with privilege. Privilege +~~~~~~~~~ A process with CAP_MAC_OVERRIDE or CAP_MAC_ADMIN is privileged. CAP_MAC_OVERRIDE allows the process access to objects it would @@ -540,6 +622,7 @@ be denied otherwise. CAP_MAC_ADMIN allows a process to change Smack data, including rules and attributes. Smack Networking +~~~~~~~~~~~~~~~~ As mentioned before, Smack enforces access control on network protocol transmissions. Every packet sent by a Smack process is tagged with its Smack @@ -551,6 +634,7 @@ packet has write access to the receiving process and if that is not the case the packet is dropped. CIPSO Configuration +~~~~~~~~~~~~~~~~~~~ It is normally unnecessary to specify the CIPSO configuration. The default values used by the system handle all internal cases. Smack will compose CIPSO @@ -571,13 +655,13 @@ discarded. The DOI is 3 by default. The value can be read from The label and category set are mapped to a Smack label as defined in /etc/smack/cipso. -A Smack/CIPSO mapping has the form: +A Smack/CIPSO mapping has the form:: smack level [category [category]*] Smack does not expect the level or category sets to be related in any particular way and does not assume or assign accesses based on them. Some -examples of mappings: +examples of mappings:: TopSecret 7 TS:A,B 7 1 2 @@ -597,25 +681,30 @@ value can be read from /sys/fs/smackfs/direct and changed by writing to /sys/fs/smackfs/direct. Socket Attributes +~~~~~~~~~~~~~~~~~ There are two attributes that are associated with sockets. These attributes can only be set by privileged tasks, but any task can read them for their own sockets. - SMACK64IPIN: The Smack label of the task object. A privileged + SMACK64IPIN: + The Smack label of the task object. A privileged program that will enforce policy may set this to the star label. - SMACK64IPOUT: The Smack label transmitted with outgoing packets. + SMACK64IPOUT: + The Smack label transmitted with outgoing packets. A privileged program may set this to match the label of another task with which it hopes to communicate. Smack Netlabel Exceptions +~~~~~~~~~~~~~~~~~~~~~~~~~ You will often find that your labeled application has to talk to the outside, unlabeled world. To do this there's a special file /sys/fs/smackfs/netlabel -where you can add some exceptions in the form of : -@IP1 LABEL1 or -@IP2/MASK LABEL2 +where you can add some exceptions in the form of:: + + @IP1 LABEL1 or + @IP2/MASK LABEL2 It means that your application will have unlabeled access to @IP1 if it has write access on LABEL1, and access to the subnet @IP2/MASK if it has write @@ -624,28 +713,32 @@ access on LABEL2. Entries in the /sys/fs/smackfs/netlabel file are matched by longest mask first, like in classless IPv4 routing. -A special label '@' and an option '-CIPSO' can be used there : -@ means Internet, any application with any label has access to it --CIPSO means standard CIPSO networking +A special label '@' and an option '-CIPSO' can be used there:: -If you don't know what CIPSO is and don't plan to use it, you can just do : -echo 127.0.0.1 -CIPSO > /sys/fs/smackfs/netlabel -echo 0.0.0.0/0 @ > /sys/fs/smackfs/netlabel + @ means Internet, any application with any label has access to it + -CIPSO means standard CIPSO networking + +If you don't know what CIPSO is and don't plan to use it, you can just do:: + + echo 127.0.0.1 -CIPSO > /sys/fs/smackfs/netlabel + echo 0.0.0.0/0 @ > /sys/fs/smackfs/netlabel If you use CIPSO on your 192.168.0.0/16 local network and need also unlabeled -Internet access, you can have : -echo 127.0.0.1 -CIPSO > /sys/fs/smackfs/netlabel -echo 192.168.0.0/16 -CIPSO > /sys/fs/smackfs/netlabel -echo 0.0.0.0/0 @ > /sys/fs/smackfs/netlabel +Internet access, you can have:: + echo 127.0.0.1 -CIPSO > /sys/fs/smackfs/netlabel + echo 192.168.0.0/16 -CIPSO > /sys/fs/smackfs/netlabel + echo 0.0.0.0/0 @ > /sys/fs/smackfs/netlabel Writing Applications for Smack +------------------------------ There are three sorts of applications that will run on a Smack system. How an application interacts with Smack will determine what it will have to do to work properly under Smack. Smack Ignorant Applications +--------------------------- By far the majority of applications have no reason whatever to care about the unique properties of Smack. Since invoking a program has no impact on the @@ -653,12 +746,14 @@ Smack label associated with the process the only concern likely to arise is whether the process has execute access to the program. Smack Relevant Applications +--------------------------- Some programs can be improved by teaching them about Smack, but do not make any security decisions themselves. The utility ls(1) is one example of such a program. Smack Enforcing Applications +---------------------------- These are special programs that not only know about Smack, but participate in the enforcement of system policy. In most cases these are the programs that @@ -666,15 +761,16 @@ set up user sessions. There are also network services that provide information to processes running with various labels. File System Interfaces +---------------------- Smack maintains labels on file system objects using extended attributes. The Smack label of a file, directory, or other file system object can be obtained -using getxattr(2). +using getxattr(2):: len = getxattr("/", "security.SMACK64", value, sizeof (value)); will put the Smack label of the root directory into value. A privileged -process can set the Smack label of a file system object with setxattr(2). +process can set the Smack label of a file system object with setxattr(2):: len = strlen("Rubble"); rc = setxattr("/foo", "security.SMACK64", "Rubble", len, 0); @@ -683,17 +779,18 @@ will set the Smack label of /foo to "Rubble" if the program has appropriate privilege. Socket Interfaces +----------------- The socket attributes can be read using fgetxattr(2). A privileged process can set the Smack label of outgoing packets with -fsetxattr(2). +fsetxattr(2):: len = strlen("Rubble"); rc = fsetxattr(fd, "security.SMACK64IPOUT", "Rubble", len, 0); will set the Smack label "Rubble" on packets going out from the socket if the -program has appropriate privilege. +program has appropriate privilege:: rc = fsetxattr(fd, "security.SMACK64IPIN, "*", strlen("*"), 0); @@ -701,33 +798,40 @@ will set the Smack label "*" as the object label against which incoming packets will be checked if the program has appropriate privilege. Administration +-------------- Smack supports some mount options: - smackfsdef=label: specifies the label to give files that lack + smackfsdef=label: + specifies the label to give files that lack the Smack label extended attribute. - smackfsroot=label: specifies the label to assign the root of the + smackfsroot=label: + specifies the label to assign the root of the file system if it lacks the Smack extended attribute. - smackfshat=label: specifies a label that must have read access to + smackfshat=label: + specifies a label that must have read access to all labels set on the filesystem. Not yet enforced. - smackfsfloor=label: specifies a label to which all labels set on the + smackfsfloor=label: + specifies a label to which all labels set on the filesystem must have read access. Not yet enforced. These mount options apply to all file system types. Smack auditing +-------------- If you want Smack auditing of security events, you need to set CONFIG_AUDIT in your kernel configuration. By default, all denied events will be audited. You can change this behavior by -writing a single character to the /sys/fs/smackfs/logging file : -0 : no logging -1 : log denied (default) -2 : log accepted -3 : log denied & accepted +writing a single character to the /sys/fs/smackfs/logging file:: + + 0 : no logging + 1 : log denied (default) + 2 : log accepted + 3 : log denied & accepted Events are logged as 'key=value' pairs, for each event you at least will get the subject, the object, the rights requested, the action, the kernel function @@ -735,6 +839,7 @@ that triggered the event, plus other pairs depending on the type of event audited. Bringup Mode +------------ Bringup mode provides logging features that can make application configuration and system bringup easier. Configure the kernel with diff --git a/Documentation/security/Yama.txt b/Documentation/admin-guide/LSM/Yama.rst index d9ee7d7a6c7f..13468ea696b7 100644 --- a/Documentation/security/Yama.txt +++ b/Documentation/admin-guide/LSM/Yama.rst @@ -1,13 +1,14 @@ +==== +Yama +==== + Yama is a Linux Security Module that collects system-wide DAC security protections that are not handled by the core kernel itself. This is -selectable at build-time with CONFIG_SECURITY_YAMA, and can be controlled -at run-time through sysctls in /proc/sys/kernel/yama: - -- ptrace_scope +selectable at build-time with ``CONFIG_SECURITY_YAMA``, and can be controlled +at run-time through sysctls in ``/proc/sys/kernel/yama``: -============================================================== - -ptrace_scope: +ptrace_scope +============ As Linux grows in popularity, it will become a larger target for malware. One particularly troubling weakness of the Linux process @@ -25,47 +26,49 @@ exist and remain possible if ptrace is allowed to operate as before. Since ptrace is not commonly used by non-developers and non-admins, system builders should be allowed the option to disable this debugging system. -For a solution, some applications use prctl(PR_SET_DUMPABLE, ...) to +For a solution, some applications use ``prctl(PR_SET_DUMPABLE, ...)`` to specifically disallow such ptrace attachment (e.g. ssh-agent), but many do not. A more general solution is to only allow ptrace directly from a parent to a child process (i.e. direct "gdb EXE" and "strace EXE" still -work), or with CAP_SYS_PTRACE (i.e. "gdb --pid=PID", and "strace -p PID" +work), or with ``CAP_SYS_PTRACE`` (i.e. "gdb --pid=PID", and "strace -p PID" still work as root). In mode 1, software that has defined application-specific relationships between a debugging process and its inferior (crash handlers, etc), -prctl(PR_SET_PTRACER, pid, ...) can be used. An inferior can declare which -other process (and its descendants) are allowed to call PTRACE_ATTACH +``prctl(PR_SET_PTRACER, pid, ...)`` can be used. An inferior can declare which +other process (and its descendants) are allowed to call ``PTRACE_ATTACH`` against it. Only one such declared debugging process can exists for each inferior at a time. For example, this is used by KDE, Chromium, and Firefox's crash handlers, and by Wine for allowing only Wine processes to ptrace each other. If a process wishes to entirely disable these ptrace -restrictions, it can call prctl(PR_SET_PTRACER, PR_SET_PTRACER_ANY, ...) +restrictions, it can call ``prctl(PR_SET_PTRACER, PR_SET_PTRACER_ANY, ...)`` so that any otherwise allowed process (even those in external pid namespaces) may attach. -The sysctl settings (writable only with CAP_SYS_PTRACE) are: +The sysctl settings (writable only with ``CAP_SYS_PTRACE``) are: -0 - classic ptrace permissions: a process can PTRACE_ATTACH to any other +0 - classic ptrace permissions: + a process can ``PTRACE_ATTACH`` to any other process running under the same uid, as long as it is dumpable (i.e. did not transition uids, start privileged, or have called - prctl(PR_SET_DUMPABLE...) already). Similarly, PTRACE_TRACEME is + ``prctl(PR_SET_DUMPABLE...)`` already). Similarly, ``PTRACE_TRACEME`` is unchanged. -1 - restricted ptrace: a process must have a predefined relationship - with the inferior it wants to call PTRACE_ATTACH on. By default, +1 - restricted ptrace: + a process must have a predefined relationship + with the inferior it wants to call ``PTRACE_ATTACH`` on. By default, this relationship is that of only its descendants when the above classic criteria is also met. To change the relationship, an - inferior can call prctl(PR_SET_PTRACER, debugger, ...) to declare - an allowed debugger PID to call PTRACE_ATTACH on the inferior. - Using PTRACE_TRACEME is unchanged. + inferior can call ``prctl(PR_SET_PTRACER, debugger, ...)`` to declare + an allowed debugger PID to call ``PTRACE_ATTACH`` on the inferior. + Using ``PTRACE_TRACEME`` is unchanged. -2 - admin-only attach: only processes with CAP_SYS_PTRACE may use ptrace - with PTRACE_ATTACH, or through children calling PTRACE_TRACEME. +2 - admin-only attach: + only processes with ``CAP_SYS_PTRACE`` may use ptrace + with ``PTRACE_ATTACH``, or through children calling ``PTRACE_TRACEME``. -3 - no attach: no processes may use ptrace with PTRACE_ATTACH nor via - PTRACE_TRACEME. Once set, this sysctl value cannot be changed. +3 - no attach: + no processes may use ptrace with ``PTRACE_ATTACH`` nor via + ``PTRACE_TRACEME``. Once set, this sysctl value cannot be changed. The original children-only logic was based on the restrictions in grsecurity. - -============================================================== diff --git a/Documentation/security/apparmor.txt b/Documentation/admin-guide/LSM/apparmor.rst index 93c1fd7d0635..3e9734bd0e05 100644 --- a/Documentation/security/apparmor.txt +++ b/Documentation/admin-guide/LSM/apparmor.rst @@ -1,4 +1,9 @@ ---- What is AppArmor? --- +======== +AppArmor +======== + +What is AppArmor? +================= AppArmor is MAC style security extension for the Linux kernel. It implements a task centered policy, with task "profiles" being created and loaded @@ -6,34 +11,41 @@ from user space. Tasks on the system that do not have a profile defined for them run in an unconfined state which is equivalent to standard Linux DAC permissions. ---- How to enable/disable --- +How to enable/disable +===================== + +set ``CONFIG_SECURITY_APPARMOR=y`` -set CONFIG_SECURITY_APPARMOR=y +If AppArmor should be selected as the default security module then set:: -If AppArmor should be selected as the default security module then - set CONFIG_DEFAULT_SECURITY="apparmor" - and CONFIG_SECURITY_APPARMOR_BOOTPARAM_VALUE=1 + CONFIG_DEFAULT_SECURITY="apparmor" + CONFIG_SECURITY_APPARMOR_BOOTPARAM_VALUE=1 Build the kernel If AppArmor is not the default security module it can be enabled by passing -security=apparmor on the kernel's command line. +``security=apparmor`` on the kernel's command line. If AppArmor is the default security module it can be disabled by passing -apparmor=0, security=XXXX (where XXX is valid security module), on the -kernel's command line +``apparmor=0, security=XXXX`` (where ``XXXX`` is valid security module), on the +kernel's command line. For AppArmor to enforce any restrictions beyond standard Linux DAC permissions policy must be loaded into the kernel from user space (see the Documentation and tools links). ---- Documentation --- +Documentation +============= -Documentation can be found on the wiki. +Documentation can be found on the wiki, linked below. ---- Links --- +Links +===== Mailing List - apparmor@lists.ubuntu.com + Wiki - http://apparmor.wiki.kernel.org/ + User space tools - https://launchpad.net/apparmor + Kernel module - git://git.kernel.org/pub/scm/linux/kernel/git/jj/apparmor-dev.git diff --git a/Documentation/security/LSM.txt b/Documentation/admin-guide/LSM/index.rst index c2683f28ed36..c980dfe9abf1 100644 --- a/Documentation/security/LSM.txt +++ b/Documentation/admin-guide/LSM/index.rst @@ -1,12 +1,13 @@ -Linux Security Module framework -------------------------------- +=========================== +Linux Security Module Usage +=========================== The Linux Security Module (LSM) framework provides a mechanism for various security checks to be hooked by new kernel extensions. The name "module" is a bit of a misnomer since these extensions are not actually loadable kernel modules. Instead, they are selectable at build-time via CONFIG_DEFAULT_SECURITY and can be overridden at boot-time via the -"security=..." kernel command line argument, in the case where multiple +``"security=..."`` kernel command line argument, in the case where multiple LSMs were built into a given kernel. The primary users of the LSM interface are Mandatory Access Control @@ -19,23 +20,22 @@ in the core functionality of Linux itself. Without a specific LSM built into the kernel, the default LSM will be the Linux capabilities system. Most LSMs choose to extend the capabilities system, building their checks on top of the defined capability hooks. -For more details on capabilities, see capabilities(7) in the Linux +For more details on capabilities, see ``capabilities(7)`` in the Linux man-pages project. A list of the active security modules can be found by reading -/sys/kernel/security/lsm. This is a comma separated list, and +``/sys/kernel/security/lsm``. This is a comma separated list, and will always include the capability module. The list reflects the order in which checks are made. The capability module will always be first, followed by any "minor" modules (e.g. Yama) and then the one "major" module (e.g. SELinux) if there is one configured. -Based on https://lkml.org/lkml/2007/10/26/215, -a new LSM is accepted into the kernel when its intent (a description of -what it tries to protect against and in what cases one would expect to -use it) has been appropriately documented in Documentation/security/. -This allows an LSM's code to be easily compared to its goals, and so -that end users and distros can make a more informed decision about which -LSMs suit their requirements. +.. toctree:: + :maxdepth: 1 -For extensive documentation on the available LSM hook interfaces, please -see include/linux/security.h. + apparmor + LoadPin + SELinux + Smack + tomoyo + Yama diff --git a/Documentation/security/tomoyo.txt b/Documentation/admin-guide/LSM/tomoyo.rst index 200a2d37cbc8..a5947218fa64 100644 --- a/Documentation/security/tomoyo.txt +++ b/Documentation/admin-guide/LSM/tomoyo.rst @@ -1,21 +1,30 @@ ---- What is TOMOYO? --- +====== +TOMOYO +====== + +What is TOMOYO? +=============== TOMOYO is a name-based MAC extension (LSM module) for the Linux kernel. LiveCD-based tutorials are available at + http://tomoyo.sourceforge.jp/1.7/1st-step/ubuntu10.04-live/ -http://tomoyo.sourceforge.jp/1.7/1st-step/centos5-live/ . +http://tomoyo.sourceforge.jp/1.7/1st-step/centos5-live/ + Though these tutorials use non-LSM version of TOMOYO, they are useful for you to know what TOMOYO is. ---- How to enable TOMOYO? --- +How to enable TOMOYO? +===================== -Build the kernel with CONFIG_SECURITY_TOMOYO=y and pass "security=tomoyo" on +Build the kernel with ``CONFIG_SECURITY_TOMOYO=y`` and pass ``security=tomoyo`` on kernel's command line. Please see http://tomoyo.sourceforge.jp/2.3/ for details. ---- Where is documentation? --- +Where is documentation? +======================= User <-> Kernel interface documentation is available at http://tomoyo.sourceforge.jp/2.3/policy-reference.html . @@ -42,7 +51,8 @@ History of TOMOYO? Realities of Mainlining http://sourceforge.jp/projects/tomoyo/docs/lfj2008.pdf ---- What is future plan? --- +What is future plan? +==================== We believe that inode based security and name based security are complementary and both should be used together. But unfortunately, so far, we cannot enable diff --git a/Documentation/admin-guide/README.rst b/Documentation/admin-guide/README.rst index b96e80f79e85..b5343c5aa224 100644 --- a/Documentation/admin-guide/README.rst +++ b/Documentation/admin-guide/README.rst @@ -55,12 +55,6 @@ Documentation contains information about the problems, which may result by upgrading your kernel. - - The Documentation/DocBook/ subdirectory contains several guides for - kernel developers and users. These guides can be rendered in a - number of formats: PostScript (.ps), PDF, HTML, & man-pages, among others. - After installation, ``make psdocs``, ``make pdfdocs``, ``make htmldocs``, - or ``make mandocs`` will render the documentation in the requested format. - Installing the kernel source ---------------------------- diff --git a/Documentation/admin-guide/devices.txt b/Documentation/admin-guide/devices.txt index c9cea2e39c21..6b71852dadc2 100644 --- a/Documentation/admin-guide/devices.txt +++ b/Documentation/admin-guide/devices.txt @@ -369,8 +369,10 @@ 237 = /dev/loop-control Loopback control device 238 = /dev/vhost-net Host kernel accelerator for virtio net 239 = /dev/uhid User-space I/O driver support for HID subsystem + 240 = /dev/userio Serio driver testing device + 241 = /dev/vhost-vsock Host kernel driver for virtio vsock - 240-254 Reserved for local use + 242-254 Reserved for local use 255 Reserved for MISC_DYNAMIC_MINOR 11 char Raw keyboard device (Linux/SPARC only) diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst index 8c60a8a32a1a..5bb9161dbe6a 100644 --- a/Documentation/admin-guide/index.rst +++ b/Documentation/admin-guide/index.rst @@ -61,6 +61,8 @@ configure specific aspects of kernel behavior to your liking. java ras pm/index + thunderbolt + LSM/index .. only:: subproject and html diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 7737ab5d04b2..d9c171ce4190 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -649,6 +649,13 @@ /proc/<pid>/coredump_filter. See also Documentation/filesystems/proc.txt. + coresight_cpu_debug.enable + [ARM,ARM64] + Format: <bool> + Enable/disable the CPU sampling based debugging. + 0: default value, disable debugging + 1: enable debugging at boot time + cpuidle.off=1 [CPU_IDLE] disable the cpuidle sub-system @@ -720,7 +727,8 @@ See also Documentation/input/joystick-parport.txt ddebug_query= [KNL,DYNAMIC_DEBUG] Enable debug messages at early boot - time. See Documentation/dynamic-debug-howto.txt for + time. See + Documentation/admin-guide/dynamic-debug-howto.rst for details. Deprecated, see dyndbg. debug [KNL] Enable kernel debugging (events log level). @@ -883,7 +891,8 @@ dyndbg[="val"] [KNL,DYNAMIC_DEBUG] module.dyndbg[="val"] Enable debug messages at boot time. See - Documentation/dynamic-debug-howto.txt for details. + Documentation/admin-guide/dynamic-debug-howto.rst + for details. nompx [X86] Disables Intel Memory Protection Extensions. See Documentation/x86/intel_mpx.txt for more @@ -954,6 +963,12 @@ must already be setup and configured. Options are not yet supported. + owl,<addr> + Start an early, polled-mode console on a serial port + of an Actions Semi SoC, such as S500 or S900, at the + specified address. The serial port must already be + setup and configured. Options are not yet supported. + smh Use ARM semihosting calls for early console. s3c2410,<addr> @@ -1486,12 +1501,21 @@ in crypto/hash_info.h. ima_policy= [IMA] - The builtin measurement policy to load during IMA - setup. Specyfing "tcb" as the value, measures all - programs exec'd, files mmap'd for exec, and all files - opened with the read mode bit set by either the - effective uid (euid=0) or uid=0. - Format: "tcb" + The builtin policies to load during IMA setup. + Format: "tcb | appraise_tcb | secure_boot" + + The "tcb" policy measures all programs exec'd, files + mmap'd for exec, and all files opened with the read + mode bit set by either the effective uid (euid=0) or + uid=0. + + The "appraise_tcb" policy appraises the integrity of + all files owned by root. (This is the equivalent + of ima_appraise_tcb.) + + The "secure_boot" policy appraises the integrity + of files (eg. kexec kernel image, kernel modules, + firmware, policy, etc) based on file signatures. ima_tcb [IMA] Deprecated. Use ima_policy= instead. Load a policy which meets the needs of the Trusted @@ -1838,6 +1862,18 @@ for all guests. Default is 1 (enabled) if in 64-bit or 32-bit PAE mode. + kvm-arm.vgic_v3_group0_trap= + [KVM,ARM] Trap guest accesses to GICv3 group-0 + system registers + + kvm-arm.vgic_v3_group1_trap= + [KVM,ARM] Trap guest accesses to GICv3 group-1 + system registers + + kvm-arm.vgic_v3_common_trap= + [KVM,ARM] Trap guest accesses to GICv3 common + system registers + kvm-intel.ept= [KVM,Intel] Disable extended page tables (virtualized MMU) support on capable Intel chips. Default is 1 (enabled) @@ -2136,6 +2172,12 @@ memmap=nn[KMG]@ss[KMG] [KNL] Force usage of a specific region of memory. Region of memory to be used is from ss to ss+nn. + If @ss[KMG] is omitted, it is equivalent to mem=nn[KMG], + which limits max address to nn[KMG]. + Multiple different regions can be specified, + comma delimited. + Example: + memmap=100M@2G,100M#3G,1G!1024G memmap=nn[KMG]#ss[KMG] [KNL,ACPI] Mark specific memory as ACPI data. @@ -2148,6 +2190,9 @@ memmap=64K$0x18690000 or memmap=0x10000$0x18690000 + Some bootloaders may need an escape character before '$', + like Grub2, otherwise '$' and the following number + will be eaten. memmap=nn[KMG]!ss[KMG] [KNL,X86] Mark specific memory as protected. @@ -2270,8 +2315,11 @@ that the amount of memory usable for all allocations is not too small. - movable_node [KNL] Boot-time switch to enable the effects - of CONFIG_MOVABLE_NODE=y. See mm/Kconfig for details. + movable_node [KNL] Boot-time switch to make hotplugable memory + NUMA nodes to be movable. This means that the memory + of such nodes will be usable only for movable + allocations which rules out almost all kernel + allocations. Use with caution! MTD_Partition= [MTD] Format: <name>,<region-number>,<size>,<offset> @@ -3238,21 +3286,17 @@ rcutree.gp_cleanup_delay= [KNL] Set the number of jiffies to delay each step of - RCU grace-period cleanup. This only has effect - when CONFIG_RCU_TORTURE_TEST_SLOW_CLEANUP is set. + RCU grace-period cleanup. rcutree.gp_init_delay= [KNL] Set the number of jiffies to delay each step of - RCU grace-period initialization. This only has - effect when CONFIG_RCU_TORTURE_TEST_SLOW_INIT - is set. + RCU grace-period initialization. rcutree.gp_preinit_delay= [KNL] Set the number of jiffies to delay each step of RCU grace-period pre-initialization, that is, the propagation of recent CPU-hotplug changes up - the rcu_node combining tree. This only has effect - when CONFIG_RCU_TORTURE_TEST_SLOW_PREINIT is set. + the rcu_node combining tree. rcutree.rcu_fanout_exact= [KNL] Disable autobalancing of the rcu_node combining @@ -3328,6 +3372,17 @@ This wake_up() will be accompanied by a WARN_ONCE() splat and an ftrace_dump(). + rcuperf.gp_async= [KNL] + Measure performance of asynchronous + grace-period primitives such as call_rcu(). + + rcuperf.gp_async_max= [KNL] + Specify the maximum number of outstanding + callbacks per writer thread. When a writer + thread exceeds this limit, it invokes the + corresponding flavor of rcu_barrier() to allow + previously posted callbacks to drain. + rcuperf.gp_exp= [KNL] Measure performance of expedited synchronous grace-period primitives. @@ -3355,17 +3410,22 @@ rcuperf.perf_runnable= [BOOT] Start rcuperf running at boot time. + rcuperf.perf_type= [KNL] + Specify the RCU implementation to test. + rcuperf.shutdown= [KNL] Shut the system down after performance tests complete. This is useful for hands-off automated testing. - rcuperf.perf_type= [KNL] - Specify the RCU implementation to test. - rcuperf.verbose= [KNL] Enable additional printk() statements. + rcuperf.writer_holdoff= [KNL] + Write-side holdoff between grace periods, + in microseconds. The default of zero says + no holdoff. + rcutorture.cbflood_inter_holdoff= [KNL] Set holdoff time (jiffies) between successive callback-flood tests. @@ -3715,8 +3775,14 @@ slab_nomerge [MM] Disable merging of slabs with similar size. May be necessary if there is some reason to distinguish - allocs to different slabs. Debug options disable - merging on their own. + allocs to different slabs, especially in hardened + environments where the risk of heap overflows and + layout control by attackers can usually be + frustrated by disabling merging. This will reduce + most of the exposure of a heap attack to a single + cache (risks via metadata attacks are mostly + unchanged). Debug options disable merging on their + own. For more information see Documentation/vm/slub.txt. slab_max_order= [MM, SLAB] @@ -3803,6 +3869,15 @@ spia_pedr= spia_peddr= + srcutree.counter_wrap_check [KNL] + Specifies how frequently to check for + grace-period sequence counter wrap for the + srcu_data structure's ->srcu_gp_seq_needed field. + The greater the number of bits set in this kernel + parameter, the less frequently counter wrap will + be checked for. Note that the bottom two bits + are ignored. + srcutree.exp_holdoff [KNL] Specifies how many nanoseconds must elapse since the end of the last SRCU grace period for diff --git a/Documentation/admin-guide/pm/cpufreq.rst b/Documentation/admin-guide/pm/cpufreq.rst index 09aa2e949787..463cf7e73db8 100644 --- a/Documentation/admin-guide/pm/cpufreq.rst +++ b/Documentation/admin-guide/pm/cpufreq.rst @@ -269,16 +269,16 @@ are the following: ``scaling_cur_freq`` Current frequency of all of the CPUs belonging to this policy (in kHz). - For the majority of scaling drivers, this is the frequency of the last - P-state requested by the driver from the hardware using the scaling + In the majority of cases, this is the frequency of the last P-state + requested by the scaling driver from the hardware using the scaling interface provided by it, which may or may not reflect the frequency the CPU is actually running at (due to hardware design and other limitations). - Some scaling drivers (e.g. |intel_pstate|) attempt to provide - information more precisely reflecting the current CPU frequency through - this attribute, but that still may not be the exact current CPU - frequency as seen by the hardware at the moment. + Some architectures (e.g. ``x86``) may attempt to provide information + more precisely reflecting the current CPU frequency through this + attribute, but that still may not be the exact current CPU frequency as + seen by the hardware at the moment. ``scaling_driver`` The scaling driver currently in use. diff --git a/Documentation/admin-guide/pm/intel_pstate.rst b/Documentation/admin-guide/pm/intel_pstate.rst index 33d703989ea8..1d6249825efc 100644 --- a/Documentation/admin-guide/pm/intel_pstate.rst +++ b/Documentation/admin-guide/pm/intel_pstate.rst @@ -157,10 +157,8 @@ Without HWP, this P-state selection algorithm is always the same regardless of the processor model and platform configuration. It selects the maximum P-state it is allowed to use, subject to limits set via -``sysfs``, every time the P-state selection computations are carried out by the -driver's utilization update callback for the given CPU (that does not happen -more often than every 10 ms), but the hardware configuration will not be changed -if the new P-state is the same as the current one. +``sysfs``, every time the driver configuration for the given CPU is updated +(e.g. via ``sysfs``). This is the default P-state selection algorithm if the :c:macro:`CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE` kernel configuration option diff --git a/Documentation/admin-guide/ras.rst b/Documentation/admin-guide/ras.rst index 8c7bbf2c88d2..197896718f81 100644 --- a/Documentation/admin-guide/ras.rst +++ b/Documentation/admin-guide/ras.rst @@ -344,9 +344,9 @@ for more than 2 channels, like Fully Buffered DIMMs (FB-DIMMs) memory controllers. The following example will assume 2 channels: +------------+-----------------------+ - | Chip | Channels | - | Select +-----------+-----------+ - | rows | ``ch0`` | ``ch1`` | + | CS Rows | Channels | + +------------+-----------+-----------+ + | | ``ch0`` | ``ch1`` | +============+===========+===========+ | ``csrow0`` | DIMM_A0 | DIMM_B0 | +------------+ | | @@ -698,7 +698,7 @@ information indicating that errors have been detected:: The structure of the message is: +---------------------------------------+-------------+ - | Content + Example | + | Content | Example | +=======================================+=============+ | The memory controller | MC0 | +---------------------------------------+-------------+ @@ -713,7 +713,7 @@ The structure of the message is: +---------------------------------------+-------------+ | The error syndrome | 0xb741 | +---------------------------------------+-------------+ - | Memory row | row 0 + + | Memory row | row 0 | +---------------------------------------+-------------+ | Memory channel | channel 1 | +---------------------------------------+-------------+ diff --git a/Documentation/admin-guide/thunderbolt.rst b/Documentation/admin-guide/thunderbolt.rst new file mode 100644 index 000000000000..6a4cd1f159ca --- /dev/null +++ b/Documentation/admin-guide/thunderbolt.rst @@ -0,0 +1,199 @@ +============= + Thunderbolt +============= +The interface presented here is not meant for end users. Instead there +should be a userspace tool that handles all the low-level details, keeps +database of the authorized devices and prompts user for new connections. + +More details about the sysfs interface for Thunderbolt devices can be +found in ``Documentation/ABI/testing/sysfs-bus-thunderbolt``. + +Those users who just want to connect any device without any sort of +manual work, can add following line to +``/etc/udev/rules.d/99-local.rules``:: + + ACTION=="add", SUBSYSTEM=="thunderbolt", ATTR{authorized}=="0", ATTR{authorized}="1" + +This will authorize all devices automatically when they appear. However, +keep in mind that this bypasses the security levels and makes the system +vulnerable to DMA attacks. + +Security levels and how to use them +----------------------------------- +Starting from Intel Falcon Ridge Thunderbolt controller there are 4 +security levels available. The reason for these is the fact that the +connected devices can be DMA masters and thus read contents of the host +memory without CPU and OS knowing about it. There are ways to prevent +this by setting up an IOMMU but it is not always available for various +reasons. + +The security levels are as follows: + + none + All devices are automatically connected by the firmware. No user + approval is needed. In BIOS settings this is typically called + *Legacy mode*. + + user + User is asked whether the device is allowed to be connected. + Based on the device identification information available through + ``/sys/bus/thunderbolt/devices``. user then can do the decision. + In BIOS settings this is typically called *Unique ID*. + + secure + User is asked whether the device is allowed to be connected. In + addition to UUID the device (if it supports secure connect) is sent + a challenge that should match the expected one based on a random key + written to ``key`` sysfs attribute. In BIOS settings this is + typically called *One time saved key*. + + dponly + The firmware automatically creates tunnels for Display Port and + USB. No PCIe tunneling is done. In BIOS settings this is + typically called *Display Port Only*. + +The current security level can be read from +``/sys/bus/thunderbolt/devices/domainX/security`` where ``domainX`` is +the Thunderbolt domain the host controller manages. There is typically +one domain per Thunderbolt host controller. + +If the security level reads as ``user`` or ``secure`` the connected +device must be authorized by the user before PCIe tunnels are created +(e.g the PCIe device appears). + +Each Thunderbolt device plugged in will appear in sysfs under +``/sys/bus/thunderbolt/devices``. The device directory carries +information that can be used to identify the particular device, +including its name and UUID. + +Authorizing devices when security level is ``user`` or ``secure`` +----------------------------------------------------------------- +When a device is plugged in it will appear in sysfs as follows:: + + /sys/bus/thunderbolt/devices/0-1/authorized - 0 + /sys/bus/thunderbolt/devices/0-1/device - 0x8004 + /sys/bus/thunderbolt/devices/0-1/device_name - Thunderbolt to FireWire Adapter + /sys/bus/thunderbolt/devices/0-1/vendor - 0x1 + /sys/bus/thunderbolt/devices/0-1/vendor_name - Apple, Inc. + /sys/bus/thunderbolt/devices/0-1/unique_id - e0376f00-0300-0100-ffff-ffffffffffff + +The ``authorized`` attribute reads 0 which means no PCIe tunnels are +created yet. The user can authorize the device by simply:: + + # echo 1 > /sys/bus/thunderbolt/devices/0-1/authorized + +This will create the PCIe tunnels and the device is now connected. + +If the device supports secure connect, and the domain security level is +set to ``secure``, it has an additional attribute ``key`` which can hold +a random 32 byte value used for authorization and challenging the device in +future connects:: + + /sys/bus/thunderbolt/devices/0-3/authorized - 0 + /sys/bus/thunderbolt/devices/0-3/device - 0x305 + /sys/bus/thunderbolt/devices/0-3/device_name - AKiTiO Thunder3 PCIe Box + /sys/bus/thunderbolt/devices/0-3/key - + /sys/bus/thunderbolt/devices/0-3/vendor - 0x41 + /sys/bus/thunderbolt/devices/0-3/vendor_name - inXtron + /sys/bus/thunderbolt/devices/0-3/unique_id - dc010000-0000-8508-a22d-32ca6421cb16 + +Notice the key is empty by default. + +If the user does not want to use secure connect it can just ``echo 1`` +to the ``authorized`` attribute and the PCIe tunnels will be created in +the same way than in ``user`` security level. + +If the user wants to use secure connect, the first time the device is +plugged a key needs to be created and send to the device:: + + # key=$(openssl rand -hex 32) + # echo $key > /sys/bus/thunderbolt/devices/0-3/key + # echo 1 > /sys/bus/thunderbolt/devices/0-3/authorized + +Now the device is connected (PCIe tunnels are created) and in addition +the key is stored on the device NVM. + +Next time the device is plugged in the user can verify (challenge) the +device using the same key:: + + # echo $key > /sys/bus/thunderbolt/devices/0-3/key + # echo 2 > /sys/bus/thunderbolt/devices/0-3/authorized + +If the challenge the device returns back matches the one we expect based +on the key, the device is connected and the PCIe tunnels are created. +However, if the challenge failed no tunnels are created and error is +returned to the user. + +If the user still wants to connect the device it can either approve +the device without a key or write new key and write 1 to the +``authorized`` file to get the new key stored on the device NVM. + +Upgrading NVM on Thunderbolt device or host +------------------------------------------- +Since most of the functionality is handled in a firmware running on a +host controller or a device, it is important that the firmware can be +upgraded to the latest where possible bugs in it have been fixed. +Typically OEMs provide this firmware from their support site. + +There is also a central site which has links where to download firmwares +for some machines: + + `Thunderbolt Updates <https://thunderbolttechnology.net/updates>`_ + +Before you upgrade firmware on a device or host, please make sure it is +the suitable. Failing to do that may render the device (or host) in a +state where it cannot be used properly anymore without special tools! + +Host NVM upgrade on Apple Macs is not supported. + +Once the NVM image has been downloaded, you need to plug in a +Thunderbolt device so that the host controller appears. It does not +matter which device is connected (unless you are upgrading NVM on a +device - then you need to connect that particular device). + +Note OEM-specific method to power the controller up ("force power") may +be available for your system in which case there is no need to plug in a +Thunderbolt device. + +After that we can write the firmware to the non-active parts of the NVM +of the host or device. As an example here is how Intel NUC6i7KYK (Skull +Canyon) Thunderbolt controller NVM is upgraded:: + + # dd if=KYK_TBT_FW_0018.bin of=/sys/bus/thunderbolt/devices/0-0/nvm_non_active0/nvmem + +Once the operation completes we can trigger NVM authentication and +upgrade process as follows:: + + # echo 1 > /sys/bus/thunderbolt/devices/0-0/nvm_authenticate + +If no errors are returned, the host controller shortly disappears. Once +it comes back the driver notices it and initiates a full power cycle. +After a while the host controller appears again and this time it should +be fully functional. + +We can verify that the new NVM firmware is active by running following +commands:: + + # cat /sys/bus/thunderbolt/devices/0-0/nvm_authenticate + 0x0 + # cat /sys/bus/thunderbolt/devices/0-0/nvm_version + 18.0 + +If ``nvm_authenticate`` contains anything else than 0x0 it is the error +code from the last authentication cycle, which means the authentication +of the NVM image failed. + +Note names of the NVMem devices ``nvm_activeN`` and ``nvm_non_activeN`` +depends on the order they are registered in the NVMem subsystem. N in +the name is the identifier added by the NVMem subsystem. + +Upgrading NVM when host controller is in safe mode +-------------------------------------------------- +If the existing NVM is not properly authenticated (or is missing) the +host controller goes into safe mode which means that only available +functionality is flashing new NVM image. When in this mode the reading +``nvm_version`` fails with ``ENODATA`` and the device identification +information is missing. + +To recover from this mode, one needs to flash a valid NVM image to the +host host controller in the same way it is done in the previous chapter. diff --git a/Documentation/arm/Atmel/README b/Documentation/arm/Atmel/README index 6ca78f818dbf..afb13c15389d 100644 --- a/Documentation/arm/Atmel/README +++ b/Documentation/arm/Atmel/README @@ -16,7 +16,7 @@ git branches/tags and email subject always contain this "at91" sub-string. AT91 SoCs --------- -Documentation and detailled datasheet for each product are available on +Documentation and detailed datasheet for each product are available on the Atmel website: http://www.atmel.com. Flavors: @@ -101,6 +101,42 @@ the Atmel website: http://www.atmel.com. + Datasheet http://www.atmel.com/Images/Atmel-11267-32-bit-Cortex-A5-Microcontroller-SAMA5D2_Datasheet.pdf + * ARM Cortex-M7 MCUs + - sams70 family + - sams70j19 + - sams70j20 + - sams70j21 + - sams70n19 + - sams70n20 + - sams70n21 + - sams70q19 + - sams70q20 + - sams70q21 + + Datasheet + http://www.atmel.com/Images/Atmel-11242-32-bit-Cortex-M7-Microcontroller-SAM-S70Q-SAM-S70N-SAM-S70J_Datasheet.pdf + + - samv70 family + - samv70j19 + - samv70j20 + - samv70n19 + - samv70n20 + - samv70q19 + - samv70q20 + + Datasheet + http://www.atmel.com/Images/Atmel-11297-32-bit-Cortex-M7-Microcontroller-SAM-V70Q-SAM-V70N-SAM-V70J_Datasheet.pdf + + - samv71 family + - samv71j19 + - samv71j20 + - samv71j21 + - samv71n19 + - samv71n20 + - samv71n21 + - samv71q19 + - samv71q20 + - samv71q21 + + Datasheet + http://www.atmel.com/Images/Atmel-44003-32-bit-Cortex-M7-Microcontroller-SAM-V71Q-SAM-V71N-SAM-V71J_Datasheet.pdf Linux kernel information ------------------------ diff --git a/Documentation/arm64/silicon-errata.txt b/Documentation/arm64/silicon-errata.txt index 10f2dddbf449..66e8ce14d23d 100644 --- a/Documentation/arm64/silicon-errata.txt +++ b/Documentation/arm64/silicon-errata.txt @@ -61,11 +61,15 @@ stable kernels. | Cavium | ThunderX ITS | #23144 | CAVIUM_ERRATUM_23144 | | Cavium | ThunderX GICv3 | #23154 | CAVIUM_ERRATUM_23154 | | Cavium | ThunderX Core | #27456 | CAVIUM_ERRATUM_27456 | +| Cavium | ThunderX Core | #30115 | CAVIUM_ERRATUM_30115 | | Cavium | ThunderX SMMUv2 | #27704 | N/A | +| Cavium | ThunderX2 SMMUv3| #74 | N/A | +| Cavium | ThunderX2 SMMUv3| #126 | N/A | | | | | | | Freescale/NXP | LS2080A/LS1043A | A-008585 | FSL_ERRATUM_A008585 | | | | | | | Hisilicon | Hip0{5,6,7} | #161010101 | HISILICON_ERRATUM_161010101 | +| Hisilicon | Hip0{6,7} | #161010701 | N/A | | | | | | | Qualcomm Tech. | Falkor v1 | E1003 | QCOM_FALKOR_ERRATUM_1003 | | Qualcomm Tech. | Falkor v1 | E1009 | QCOM_FALKOR_ERRATUM_1009 | diff --git a/Documentation/bcache.txt b/Documentation/bcache.txt index a9259b562d5c..c0ce64d75bbf 100644 --- a/Documentation/bcache.txt +++ b/Documentation/bcache.txt @@ -1,10 +1,15 @@ +============================ +A block layer cache (bcache) +============================ + Say you've got a big slow raid 6, and an ssd or three. Wouldn't it be nice if you could use them as cache... Hence bcache. Wiki and git repositories are at: - http://bcache.evilpiepirate.org - http://evilpiepirate.org/git/linux-bcache.git - http://evilpiepirate.org/git/bcache-tools.git + + - http://bcache.evilpiepirate.org + - http://evilpiepirate.org/git/linux-bcache.git + - http://evilpiepirate.org/git/bcache-tools.git It's designed around the performance characteristics of SSDs - it only allocates in erase block sized buckets, and it uses a hybrid btree/log to track cached @@ -37,17 +42,19 @@ to be flushed. Getting started: You'll need make-bcache from the bcache-tools repository. Both the cache device -and backing device must be formatted before use. +and backing device must be formatted before use:: + make-bcache -B /dev/sdb make-bcache -C /dev/sdc make-bcache has the ability to format multiple devices at the same time - if you format your backing devices and cache device at the same time, you won't -have to manually attach: +have to manually attach:: + make-bcache -B /dev/sda /dev/sdb -C /dev/sdc bcache-tools now ships udev rules, and bcache devices are known to the kernel -immediately. Without udev, you can manually register devices like this: +immediately. Without udev, you can manually register devices like this:: echo /dev/sdb > /sys/fs/bcache/register echo /dev/sdc > /sys/fs/bcache/register @@ -60,16 +67,16 @@ slow devices as bcache backing devices without a cache, and you can choose to ad a caching device later. See 'ATTACHING' section below. -The devices show up as: +The devices show up as:: /dev/bcache<N> -As well as (with udev): +As well as (with udev):: /dev/bcache/by-uuid/<uuid> /dev/bcache/by-label/<label> -To get started: +To get started:: mkfs.ext4 /dev/bcache0 mount /dev/bcache0 /mnt @@ -81,13 +88,13 @@ Cache devices are managed as sets; multiple caches per set isn't supported yet but will allow for mirroring of metadata and dirty data in the future. Your new cache set shows up as /sys/fs/bcache/<UUID> -ATTACHING +Attaching --------- After your cache device and backing device are registered, the backing device must be attached to your cache set to enable caching. Attaching a backing device to a cache set is done thusly, with the UUID of the cache set in -/sys/fs/bcache: +/sys/fs/bcache:: echo <CSET-UUID> > /sys/block/bcache0/bcache/attach @@ -97,7 +104,7 @@ your bcache devices. If a backing device has data in a cache somewhere, the important if you have writeback caching turned on. If you're booting up and your cache device is gone and never coming back, you -can force run the backing device: +can force run the backing device:: echo 1 > /sys/block/sdb/bcache/running @@ -110,7 +117,7 @@ but all the cached data will be invalidated. If there was dirty data in the cache, don't expect the filesystem to be recoverable - you will have massive filesystem corruption, though ext4's fsck does work miracles. -ERROR HANDLING +Error Handling -------------- Bcache tries to transparently handle IO errors to/from the cache device without @@ -134,25 +141,27 @@ the backing devices to passthrough mode. read some of the dirty data, though. -HOWTO/COOKBOOK +Howto/cookbook -------------- A) Starting a bcache with a missing caching device If registering the backing device doesn't help, it's already there, you just need -to force it to run without the cache: +to force it to run without the cache:: + host:~# echo /dev/sdb1 > /sys/fs/bcache/register [ 119.844831] bcache: register_bcache() error opening /dev/sdb1: device already registered Next, you try to register your caching device if it's present. However if it's absent, or registration fails for some reason, you can still -start your bcache without its cache, like so: +start your bcache without its cache, like so:: + host:/sys/block/sdb/sdb1/bcache# echo 1 > running Note that this may cause data loss if you were running in writeback mode. -B) Bcache does not find its cache +B) Bcache does not find its cache:: host:/sys/block/md5/bcache# echo 0226553a-37cf-41d5-b3ce-8b1e944543a8 > attach [ 1933.455082] bcache: bch_cached_dev_attach() Couldn't find uuid for md5 in set @@ -160,7 +169,8 @@ B) Bcache does not find its cache [ 1933.478179] : cache set not found In this case, the caching device was simply not registered at boot -or disappeared and came back, and needs to be (re-)registered: +or disappeared and came back, and needs to be (re-)registered:: + host:/sys/block/md5/bcache# echo /dev/sdh2 > /sys/fs/bcache/register @@ -180,7 +190,8 @@ device is still available at an 8KiB offset. So either via a loopdev of the backing device created with --offset 8K, or any value defined by --data-offset when you originally formatted bcache with `make-bcache`. -For example: +For example:: + losetup -o 8192 /dev/loop0 /dev/your_bcache_backing_dev This should present your unmodified backing device data in /dev/loop0 @@ -191,33 +202,38 @@ cache device without loosing data. E) Wiping a cache device -host:~# wipefs -a /dev/sdh2 -16 bytes were erased at offset 0x1018 (bcache) -they were: c6 85 73 f6 4e 1a 45 ca 82 65 f5 7f 48 ba 6d 81 +:: + + host:~# wipefs -a /dev/sdh2 + 16 bytes were erased at offset 0x1018 (bcache) + they were: c6 85 73 f6 4e 1a 45 ca 82 65 f5 7f 48 ba 6d 81 + +After you boot back with bcache enabled, you recreate the cache and attach it:: -After you boot back with bcache enabled, you recreate the cache and attach it: -host:~# make-bcache -C /dev/sdh2 -UUID: 7be7e175-8f4c-4f99-94b2-9c904d227045 -Set UUID: 5bc072a8-ab17-446d-9744-e247949913c1 -version: 0 -nbuckets: 106874 -block_size: 1 -bucket_size: 1024 -nr_in_set: 1 -nr_this_dev: 0 -first_bucket: 1 -[ 650.511912] bcache: run_cache_set() invalidating existing data -[ 650.549228] bcache: register_cache() registered cache device sdh2 + host:~# make-bcache -C /dev/sdh2 + UUID: 7be7e175-8f4c-4f99-94b2-9c904d227045 + Set UUID: 5bc072a8-ab17-446d-9744-e247949913c1 + version: 0 + nbuckets: 106874 + block_size: 1 + bucket_size: 1024 + nr_in_set: 1 + nr_this_dev: 0 + first_bucket: 1 + [ 650.511912] bcache: run_cache_set() invalidating existing data + [ 650.549228] bcache: register_cache() registered cache device sdh2 -start backing device with missing cache: -host:/sys/block/md5/bcache# echo 1 > running +start backing device with missing cache:: -attach new cache: -host:/sys/block/md5/bcache# echo 5bc072a8-ab17-446d-9744-e247949913c1 > attach -[ 865.276616] bcache: bch_cached_dev_attach() Caching md5 as bcache0 on set 5bc072a8-ab17-446d-9744-e247949913c1 + host:/sys/block/md5/bcache# echo 1 > running +attach new cache:: -F) Remove or replace a caching device + host:/sys/block/md5/bcache# echo 5bc072a8-ab17-446d-9744-e247949913c1 > attach + [ 865.276616] bcache: bch_cached_dev_attach() Caching md5 as bcache0 on set 5bc072a8-ab17-446d-9744-e247949913c1 + + +F) Remove or replace a caching device:: host:/sys/block/sda/sda7/bcache# echo 1 > detach [ 695.872542] bcache: cached_dev_detach_finish() Caching disabled for sda7 @@ -226,13 +242,15 @@ F) Remove or replace a caching device wipefs: error: /dev/nvme0n1p4: probing initialization failed: Device or resource busy Ooops, it's disabled, but not unregistered, so it's still protected -We need to go and unregister it: +We need to go and unregister it:: + host:/sys/fs/bcache/b7ba27a1-2398-4649-8ae3-0959f57ba128# ls -l cache0 lrwxrwxrwx 1 root root 0 Feb 25 18:33 cache0 -> ../../../devices/pci0000:00/0000:00:1d.0/0000:70:00.0/nvme/nvme0/nvme0n1/nvme0n1p4/bcache/ host:/sys/fs/bcache/b7ba27a1-2398-4649-8ae3-0959f57ba128# echo 1 > stop kernel: [ 917.041908] bcache: cache_set_free() Cache set b7ba27a1-2398-4649-8ae3-0959f57ba128 unregistered -Now we can wipe it: +Now we can wipe it:: + host:~# wipefs -a /dev/nvme0n1p4 /dev/nvme0n1p4: 16 bytes were erased at offset 0x00001018 (bcache): c6 85 73 f6 4e 1a 45 ca 82 65 f5 7f 48 ba 6d 81 @@ -252,40 +270,44 @@ if there are any active backing or caching devices left on it: 1) Is it present in /dev/bcache* ? (there are times where it won't be) -If so, it's easy: + If so, it's easy:: + host:/sys/block/bcache0/bcache# echo 1 > stop -2) But if your backing device is gone, this won't work: +2) But if your backing device is gone, this won't work:: + host:/sys/block/bcache0# cd bcache bash: cd: bcache: No such file or directory -In this case, you may have to unregister the dmcrypt block device that -references this bcache to free it up: + In this case, you may have to unregister the dmcrypt block device that + references this bcache to free it up:: + host:~# dmsetup remove oldds1 bcache: bcache_device_free() bcache0 stopped bcache: cache_set_free() Cache set 5bc072a8-ab17-446d-9744-e247949913c1 unregistered -This causes the backing bcache to be removed from /sys/fs/bcache and -then it can be reused. This would be true of any block device stacking -where bcache is a lower device. + This causes the backing bcache to be removed from /sys/fs/bcache and + then it can be reused. This would be true of any block device stacking + where bcache is a lower device. + +3) In other cases, you can also look in /sys/fs/bcache/:: -3) In other cases, you can also look in /sys/fs/bcache/: + host:/sys/fs/bcache# ls -l */{cache?,bdev?} + lrwxrwxrwx 1 root root 0 Mar 5 09:39 0226553a-37cf-41d5-b3ce-8b1e944543a8/bdev1 -> ../../../devices/virtual/block/dm-1/bcache/ + lrwxrwxrwx 1 root root 0 Mar 5 09:39 0226553a-37cf-41d5-b3ce-8b1e944543a8/cache0 -> ../../../devices/virtual/block/dm-4/bcache/ + lrwxrwxrwx 1 root root 0 Mar 5 09:39 5bc072a8-ab17-446d-9744-e247949913c1/cache0 -> ../../../devices/pci0000:00/0000:00:01.0/0000:01:00.0/ata10/host9/target9:0:0/9:0:0:0/block/sdl/sdl2/bcache/ -host:/sys/fs/bcache# ls -l */{cache?,bdev?} -lrwxrwxrwx 1 root root 0 Mar 5 09:39 0226553a-37cf-41d5-b3ce-8b1e944543a8/bdev1 -> ../../../devices/virtual/block/dm-1/bcache/ -lrwxrwxrwx 1 root root 0 Mar 5 09:39 0226553a-37cf-41d5-b3ce-8b1e944543a8/cache0 -> ../../../devices/virtual/block/dm-4/bcache/ -lrwxrwxrwx 1 root root 0 Mar 5 09:39 5bc072a8-ab17-446d-9744-e247949913c1/cache0 -> ../../../devices/pci0000:00/0000:00:01.0/0000:01:00.0/ata10/host9/target9:0:0/9:0:0:0/block/sdl/sdl2/bcache/ + The device names will show which UUID is relevant, cd in that directory + and stop the cache:: -The device names will show which UUID is relevant, cd in that directory -and stop the cache: host:/sys/fs/bcache/5bc072a8-ab17-446d-9744-e247949913c1# echo 1 > stop -This will free up bcache references and let you reuse the partition for -other purposes. + This will free up bcache references and let you reuse the partition for + other purposes. -TROUBLESHOOTING PERFORMANCE +Troubleshooting performance --------------------------- Bcache has a bunch of config options and tunables. The defaults are intended to @@ -301,11 +323,13 @@ want for getting the best possible numbers when benchmarking. raid stripe size to get the disk multiples that you would like. For example: If you have a 64k stripe size, then the following offset - would provide alignment for many common RAID5 data spindle counts: + would provide alignment for many common RAID5 data spindle counts:: + 64k * 2*2*2*3*3*5*7 bytes = 161280k That space is wasted, but for only 157.5MB you can grow your RAID 5 - volume to the following data-spindle counts without re-aligning: + volume to the following data-spindle counts without re-aligning:: + 3,4,5,6,7,8,9,10,12,14,15,18,20,21 ... - Bad write performance @@ -313,9 +337,9 @@ want for getting the best possible numbers when benchmarking. If write performance is not what you expected, you probably wanted to be running in writeback mode, which isn't the default (not due to a lack of maturity, but simply because in writeback mode you'll lose data if something - happens to your SSD) + happens to your SSD):: - # echo writeback > /sys/block/bcache0/bcache/cache_mode + # echo writeback > /sys/block/bcache0/bcache/cache_mode - Bad performance, or traffic not going to the SSD that you'd expect @@ -325,13 +349,13 @@ want for getting the best possible numbers when benchmarking. accessed data out of your cache. But if you want to benchmark reads from cache, and you start out with fio - writing an 8 gigabyte test file - so you want to disable that. + writing an 8 gigabyte test file - so you want to disable that:: - # echo 0 > /sys/block/bcache0/bcache/sequential_cutoff + # echo 0 > /sys/block/bcache0/bcache/sequential_cutoff - To set it back to the default (4 mb), do + To set it back to the default (4 mb), do:: - # echo 4M > /sys/block/bcache0/bcache/sequential_cutoff + # echo 4M > /sys/block/bcache0/bcache/sequential_cutoff - Traffic's still going to the spindle/still getting cache misses @@ -344,10 +368,10 @@ want for getting the best possible numbers when benchmarking. throttles traffic if the latency exceeds a threshold (it does this by cranking down the sequential bypass). - You can disable this if you need to by setting the thresholds to 0: + You can disable this if you need to by setting the thresholds to 0:: - # echo 0 > /sys/fs/bcache/<cache set>/congested_read_threshold_us - # echo 0 > /sys/fs/bcache/<cache set>/congested_write_threshold_us + # echo 0 > /sys/fs/bcache/<cache set>/congested_read_threshold_us + # echo 0 > /sys/fs/bcache/<cache set>/congested_write_threshold_us The default is 2000 us (2 milliseconds) for reads, and 20000 for writes. @@ -369,7 +393,7 @@ want for getting the best possible numbers when benchmarking. a fix for the issue there). -SYSFS - BACKING DEVICE +Sysfs - backing device ---------------------- Available at /sys/block/<bdev>/bcache, /sys/block/bcache*/bcache and @@ -454,7 +478,8 @@ writeback_running still be added to the cache until it is mostly full; only meant for benchmarking. Defaults to on. -SYSFS - BACKING DEVICE STATS: +Sysfs - backing device stats +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ There are directories with these numbers for a running total, as well as versions that decay over the past day, hour and 5 minutes; they're also @@ -463,14 +488,11 @@ aggregated in the cache set directory as well. bypassed Amount of IO (both reads and writes) that has bypassed the cache -cache_hits -cache_misses -cache_hit_ratio +cache_hits, cache_misses, cache_hit_ratio Hits and misses are counted per individual IO as bcache sees them; a partial hit is counted as a miss. -cache_bypass_hits -cache_bypass_misses +cache_bypass_hits, cache_bypass_misses Hits and misses for IO that is intended to skip the cache are still counted, but broken out here. @@ -482,7 +504,8 @@ cache_miss_collisions cache_readaheads Count of times readahead occurred. -SYSFS - CACHE SET: +Sysfs - cache set +~~~~~~~~~~~~~~~~~ Available at /sys/fs/bcache/<cset-uuid> @@ -520,8 +543,7 @@ flash_vol_create Echoing a size to this file (in human readable units, k/M/G) creates a thinly provisioned volume backed by the cache set. -io_error_halflife -io_error_limit +io_error_halflife, io_error_limit These determines how many errors we accept before disabling the cache. Each error is decayed by the half life (in # ios). If the decaying count reaches io_error_limit dirty data is written out and the cache is disabled. @@ -545,7 +567,8 @@ unregister Detaches all backing devices and closes the cache devices; if dirty data is present it will disable writeback caching and wait for it to be flushed. -SYSFS - CACHE SET INTERNAL: +Sysfs - cache set internal +~~~~~~~~~~~~~~~~~~~~~~~~~~ This directory also exposes timings for a number of internal operations, with separate files for average duration, average frequency, last occurrence and max @@ -574,7 +597,8 @@ cache_read_races trigger_gc Writing to this file forces garbage collection to run. -SYSFS - CACHE DEVICE: +Sysfs - Cache device +~~~~~~~~~~~~~~~~~~~~ Available at /sys/block/<cdev>/bcache diff --git a/Documentation/block/biodoc.txt b/Documentation/block/biodoc.txt index 01ddeaf64b0f..9490f2845f06 100644 --- a/Documentation/block/biodoc.txt +++ b/Documentation/block/biodoc.txt @@ -632,7 +632,7 @@ to i/o submission, if the bio fields are likely to be accessed after the i/o is issued (since the bio may otherwise get freed in case i/o completion happens in the meantime). -The bio_clone() routine may be used to duplicate a bio, where the clone +The bio_clone_fast() routine may be used to duplicate a bio, where the clone shares the bio_vec_list with the original bio (i.e. both point to the same bio_vec_list). This would typically be used for splitting i/o requests in lvm or md. diff --git a/Documentation/block/data-integrity.txt b/Documentation/block/data-integrity.txt index f56ec97f0d14..934c44ea0c57 100644 --- a/Documentation/block/data-integrity.txt +++ b/Documentation/block/data-integrity.txt @@ -192,7 +192,7 @@ will require extra work due to the application tag. supported by the block device. - int bio_integrity_prep(bio); + bool bio_integrity_prep(bio); To generate IMD for WRITE and to set up buffers for READ, the filesystem must call bio_integrity_prep(bio). @@ -201,9 +201,7 @@ will require extra work due to the application tag. sector must be set, and the bio should have all data pages added. It is up to the caller to ensure that the bio does not change while I/O is in progress. - - bio_integrity_prep() should only be called if - bio_integrity_enabled() returned 1. + Complete bio with error if prepare failed for some reson. 5.3 PASSING EXISTING INTEGRITY METADATA diff --git a/Documentation/bt8xxgpio.txt b/Documentation/bt8xxgpio.txt index d8297e4ebd26..a845feb074de 100644 --- a/Documentation/bt8xxgpio.txt +++ b/Documentation/bt8xxgpio.txt @@ -1,12 +1,8 @@ -=============================================================== -== BT8XXGPIO driver == -== == -== A driver for a selfmade cheap BT8xx based PCI GPIO-card == -== == -== For advanced documentation, see == -== http://www.bu3sch.de/btgpio.php == -=============================================================== +=================================================================== +A driver for a selfmade cheap BT8xx based PCI GPIO-card (bt8xxgpio) +=================================================================== +For advanced documentation, see http://www.bu3sch.de/btgpio.php A generic digital 24-port PCI GPIO card can be built out of an ordinary Brooktree bt848, bt849, bt878 or bt879 based analog TV tuner card. The @@ -17,9 +13,8 @@ The bt8xx chip does have 24 digital GPIO ports. These ports are accessible via 24 pins on the SMD chip package. -============================================== -== How to physically access the GPIO pins == -============================================== +How to physically access the GPIO pins +====================================== The are several ways to access these pins. One might unsolder the whole chip and put it on a custom PCI board, or one might only unsolder each individual @@ -27,7 +22,7 @@ GPIO pin and solder that to some tiny wire. As the chip package really is tiny there are some advanced soldering skills needed in any case. The physical pinouts are drawn in the following ASCII art. -The GPIO pins are marked with G00-G23 +The GPIO pins are marked with G00-G23:: G G G G G G G G G G G G G G G G G G 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 diff --git a/Documentation/btmrvl.txt b/Documentation/btmrvl.txt index 34916a46c099..ec57740ead0c 100644 --- a/Documentation/btmrvl.txt +++ b/Documentation/btmrvl.txt @@ -1,18 +1,16 @@ -======================================================================= - README for btmrvl driver -======================================================================= - +============= +btmrvl driver +============= All commands are used via debugfs interface. -===================== -Set/get driver configurations: +Set/get driver configurations +============================= Path: /debug/btmrvl/config/ -gpiogap=[n] -hscfgcmd - These commands are used to configure the host sleep parameters. +gpiogap=[n], hscfgcmd + These commands are used to configure the host sleep parameters:: bit 8:0 -- Gap bit 16:8 -- GPIO @@ -23,7 +21,8 @@ hscfgcmd where Gap is the gap in milli seconds between wakeup signal and wakeup event, or 0xff for special host sleep setting. - Usage: + Usage:: + # Use SDIO interface to wake up the host and set GAP to 0x80: echo 0xff80 > /debug/btmrvl/config/gpiogap echo 1 > /debug/btmrvl/config/hscfgcmd @@ -32,15 +31,16 @@ hscfgcmd echo 0x03ff > /debug/btmrvl/config/gpiogap echo 1 > /debug/btmrvl/config/hscfgcmd -psmode=[n] -pscmd +psmode=[n], pscmd These commands are used to enable/disable auto sleep mode - where the option is: + where the option is:: + 1 -- Enable auto sleep mode 0 -- Disable auto sleep mode - Usage: + Usage:: + # Enable auto sleep mode echo 1 > /debug/btmrvl/config/psmode echo 1 > /debug/btmrvl/config/pscmd @@ -50,15 +50,16 @@ pscmd echo 1 > /debug/btmrvl/config/pscmd -hsmode=[n] -hscmd +hsmode=[n], hscmd These commands are used to enable host sleep or wake up firmware - where the option is: + where the option is:: + 1 -- Enable host sleep 0 -- Wake up firmware - Usage: + Usage:: + # Enable host sleep echo 1 > /debug/btmrvl/config/hsmode echo 1 > /debug/btmrvl/config/hscmd @@ -68,12 +69,13 @@ hscmd echo 1 > /debug/btmrvl/config/hscmd -====================== -Get driver status: +Get driver status +================= Path: /debug/btmrvl/status/ -Usage: +Usage:: + cat /debug/btmrvl/status/<args> where the args are: @@ -90,14 +92,17 @@ hsstate txdnldrdy This command displays the value of Tx download ready flag. - -===================== +Issuing a raw hci command +========================= Use hcitool to issue raw hci command, refer to hcitool manual - Usage: Hcitool cmd <ogf> <ocf> [Parameters] +Usage:: + + Hcitool cmd <ogf> <ocf> [Parameters] + +Interface Control Command:: - Interface Control Command hcitool cmd 0x3f 0x5b 0xf5 0x01 0x00 --Enable All interface hcitool cmd 0x3f 0x5b 0xf5 0x01 0x01 --Enable Wlan interface hcitool cmd 0x3f 0x5b 0xf5 0x01 0x02 --Enable BT interface @@ -105,13 +110,13 @@ Use hcitool to issue raw hci command, refer to hcitool manual hcitool cmd 0x3f 0x5b 0xf5 0x00 0x01 --Disable Wlan interface hcitool cmd 0x3f 0x5b 0xf5 0x00 0x02 --Disable BT interface -======================================================================= - +SD8688 firmware +=============== -SD8688 firmware: +Images: -/lib/firmware/sd8688_helper.bin -/lib/firmware/sd8688.bin +- /lib/firmware/sd8688_helper.bin +- /lib/firmware/sd8688.bin The images can be downloaded from: diff --git a/Documentation/bus-virt-phys-mapping.txt b/Documentation/bus-virt-phys-mapping.txt index 2bc55ff3b4d1..4bb07c2f3e7d 100644 --- a/Documentation/bus-virt-phys-mapping.txt +++ b/Documentation/bus-virt-phys-mapping.txt @@ -1,17 +1,27 @@ -[ NOTE: The virt_to_bus() and bus_to_virt() functions have been +========================================================== +How to access I/O mapped memory from within device drivers +========================================================== + +:Author: Linus + +.. warning:: + + The virt_to_bus() and bus_to_virt() functions have been superseded by the functionality provided by the PCI DMA interface (see Documentation/DMA-API-HOWTO.txt). They continue to be documented below for historical purposes, but new code - must not use them. --davidm 00/12/12 ] + must not use them. --davidm 00/12/12 -[ This is a mail message in response to a query on IO mapping, thus the - strange format for a "document" ] +:: + + [ This is a mail message in response to a query on IO mapping, thus the + strange format for a "document" ] The AHA-1542 is a bus-master device, and your patch makes the driver give the controller the physical address of the buffers, which is correct on x86 (because all bus master devices see the physical memory mappings directly). -However, on many setups, there are actually _three_ different ways of looking +However, on many setups, there are actually **three** different ways of looking at memory addresses, and in this case we actually want the third, the so-called "bus address". @@ -38,7 +48,7 @@ because the memory and the devices share the same address space, and that is not generally necessarily true on other PCI/ISA setups. Now, just as an example, on the PReP (PowerPC Reference Platform), the -CPU sees a memory map something like this (this is from memory): +CPU sees a memory map something like this (this is from memory):: 0-2 GB "real memory" 2 GB-3 GB "system IO" (inb/out and similar accesses on x86) @@ -52,7 +62,7 @@ So when the CPU wants any bus master to write to physical memory 0, it has to give the master address 0x80000000 as the memory address. So, for example, depending on how the kernel is actually mapped on the -PPC, you can end up with a setup like this: +PPC, you can end up with a setup like this:: physical address: 0 virtual address: 0xC0000000 @@ -61,7 +71,7 @@ PPC, you can end up with a setup like this: where all the addresses actually point to the same thing. It's just seen through different translations.. -Similarly, on the Alpha, the normal translation is +Similarly, on the Alpha, the normal translation is:: physical address: 0 virtual address: 0xfffffc0000000000 @@ -70,7 +80,7 @@ Similarly, on the Alpha, the normal translation is (but there are also Alphas where the physical address and the bus address are the same). -Anyway, the way to look up all these translations, you do +Anyway, the way to look up all these translations, you do:: #include <asm/io.h> @@ -81,8 +91,8 @@ Anyway, the way to look up all these translations, you do Now, when do you need these? -You want the _virtual_ address when you are actually going to access that -pointer from the kernel. So you can have something like this: +You want the **virtual** address when you are actually going to access that +pointer from the kernel. So you can have something like this:: /* * this is the hardware "mailbox" we use to communicate with @@ -104,7 +114,7 @@ pointer from the kernel. So you can have something like this: ... on the other hand, you want the bus address when you have a buffer that -you want to give to the controller: +you want to give to the controller:: /* ask the controller to read the sense status into "sense_buffer" */ mbox.bufstart = virt_to_bus(&sense_buffer); @@ -112,7 +122,7 @@ you want to give to the controller: mbox.status = 0; notify_controller(&mbox); -And you generally _never_ want to use the physical address, because you can't +And you generally **never** want to use the physical address, because you can't use that from the CPU (the CPU only uses translated virtual addresses), and you can't use it from the bus master. @@ -124,8 +134,10 @@ be remapped as measured in units of pages, a.k.a. the pfn (the memory management layer doesn't know about devices outside the CPU, so it shouldn't need to know about "bus addresses" etc). -NOTE NOTE NOTE! The above is only one part of the whole equation. The above -only talks about "real memory", that is, CPU memory (RAM). +.. note:: + + The above is only one part of the whole equation. The above + only talks about "real memory", that is, CPU memory (RAM). There is a completely different type of memory too, and that's the "shared memory" on the PCI or ISA bus. That's generally not RAM (although in the case @@ -137,20 +149,22 @@ whatever, and there is only one way to access it: the readb/writeb and related functions. You should never take the address of such memory, because there is really nothing you can do with such an address: it's not conceptually in the same memory space as "real memory" at all, so you cannot -just dereference a pointer. (Sadly, on x86 it _is_ in the same memory space, +just dereference a pointer. (Sadly, on x86 it **is** in the same memory space, so on x86 it actually works to just deference a pointer, but it's not portable). -For such memory, you can do things like +For such memory, you can do things like: + + - reading:: - - reading: /* * read first 32 bits from ISA memory at 0xC0000, aka * C000:0000 in DOS terms */ unsigned int signature = isa_readl(0xC0000); - - remapping and writing: + - remapping and writing:: + /* * remap framebuffer PCI memory area at 0xFC000000, * size 1MB, so that we can access it: We can directly @@ -165,7 +179,8 @@ For such memory, you can do things like /* unmap when we unload the driver */ iounmap(baseptr); - - copying and clearing: + - copying and clearing:: + /* get the 6-byte Ethernet address at ISA address E000:0040 */ memcpy_fromio(kernel_buffer, 0xE0040, 6); /* write a packet to the driver */ @@ -181,10 +196,10 @@ happy that your driver works ;) Note that kernel versions 2.0.x (and earlier) mistakenly called the ioremap() function "vremap()". ioremap() is the proper name, but I didn't think straight when I wrote it originally. People who have to -support both can do something like: +support both can do something like:: /* support old naming silliness */ - #if LINUX_VERSION_CODE < 0x020100 + #if LINUX_VERSION_CODE < 0x020100 #define ioremap vremap #define iounmap vfree #endif @@ -196,13 +211,10 @@ And the above sounds worse than it really is. Most real drivers really don't do all that complex things (or rather: the complexity is not so much in the actual IO accesses as in error handling and timeouts etc). It's generally not hard to fix drivers, and in many cases the code -actually looks better afterwards: +actually looks better afterwards:: unsigned long signature = *(unsigned int *) 0xC0000; vs unsigned long signature = readl(0xC0000); I think the second version actually is more readable, no? - - Linus - diff --git a/Documentation/cachetlb.txt b/Documentation/cachetlb.txt index 3f9f808b5119..6eb9d3f090cd 100644 --- a/Documentation/cachetlb.txt +++ b/Documentation/cachetlb.txt @@ -1,7 +1,8 @@ - Cache and TLB Flushing - Under Linux +================================== +Cache and TLB Flushing Under Linux +================================== - David S. Miller <davem@redhat.com> +:Author: David S. Miller <davem@redhat.com> This document describes the cache/tlb flushing interfaces called by the Linux VM subsystem. It enumerates over each interface, @@ -28,7 +29,7 @@ Therefore when software page table changes occur, the kernel will invoke one of the following flush methods _after_ the page table changes occur: -1) void flush_tlb_all(void) +1) ``void flush_tlb_all(void)`` The most severe flush of all. After this interface runs, any previous page table modification whatsoever will be @@ -37,7 +38,7 @@ changes occur: This is usually invoked when the kernel page tables are changed, since such translations are "global" in nature. -2) void flush_tlb_mm(struct mm_struct *mm) +2) ``void flush_tlb_mm(struct mm_struct *mm)`` This interface flushes an entire user address space from the TLB. After running, this interface must make sure that @@ -49,8 +50,8 @@ changes occur: page table operations such as what happens during fork, and exec. -3) void flush_tlb_range(struct vm_area_struct *vma, - unsigned long start, unsigned long end) +3) ``void flush_tlb_range(struct vm_area_struct *vma, + unsigned long start, unsigned long end)`` Here we are flushing a specific range of (user) virtual address translations from the TLB. After running, this @@ -69,7 +70,7 @@ changes occur: call flush_tlb_page (see below) for each entry which may be modified. -4) void flush_tlb_page(struct vm_area_struct *vma, unsigned long addr) +4) ``void flush_tlb_page(struct vm_area_struct *vma, unsigned long addr)`` This time we need to remove the PAGE_SIZE sized translation from the TLB. The 'vma' is the backing structure used by @@ -87,8 +88,8 @@ changes occur: This is used primarily during fault processing. -5) void update_mmu_cache(struct vm_area_struct *vma, - unsigned long address, pte_t *ptep) +5) ``void update_mmu_cache(struct vm_area_struct *vma, + unsigned long address, pte_t *ptep)`` At the end of every page fault, this routine is invoked to tell the architecture specific code that a translation @@ -100,7 +101,7 @@ changes occur: translations for software managed TLB configurations. The sparc64 port currently does this. -6) void tlb_migrate_finish(struct mm_struct *mm) +6) ``void tlb_migrate_finish(struct mm_struct *mm)`` This interface is called at the end of an explicit process migration. This interface provides a hook @@ -112,7 +113,7 @@ changes occur: Next, we have the cache flushing interfaces. In general, when Linux is changing an existing virtual-->physical mapping to a new value, -the sequence will be in one of the following forms: +the sequence will be in one of the following forms:: 1) flush_cache_mm(mm); change_all_page_tables_of(mm); @@ -143,7 +144,7 @@ and have no dependency on translation information. Here are the routines, one by one: -1) void flush_cache_mm(struct mm_struct *mm) +1) ``void flush_cache_mm(struct mm_struct *mm)`` This interface flushes an entire user address space from the caches. That is, after running, there will be no cache @@ -152,7 +153,7 @@ Here are the routines, one by one: This interface is used to handle whole address space page table operations such as what happens during exit and exec. -2) void flush_cache_dup_mm(struct mm_struct *mm) +2) ``void flush_cache_dup_mm(struct mm_struct *mm)`` This interface flushes an entire user address space from the caches. That is, after running, there will be no cache @@ -164,8 +165,8 @@ Here are the routines, one by one: This option is separate from flush_cache_mm to allow some optimizations for VIPT caches. -3) void flush_cache_range(struct vm_area_struct *vma, - unsigned long start, unsigned long end) +3) ``void flush_cache_range(struct vm_area_struct *vma, + unsigned long start, unsigned long end)`` Here we are flushing a specific range of (user) virtual addresses from the cache. After running, there will be no @@ -181,7 +182,7 @@ Here are the routines, one by one: call flush_cache_page (see below) for each entry which may be modified. -4) void flush_cache_page(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn) +4) ``void flush_cache_page(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn)`` This time we need to remove a PAGE_SIZE sized range from the cache. The 'vma' is the backing structure used by @@ -202,7 +203,7 @@ Here are the routines, one by one: This is used primarily during fault processing. -5) void flush_cache_kmaps(void) +5) ``void flush_cache_kmaps(void)`` This routine need only be implemented if the platform utilizes highmem. It will be called right before all of the kmaps @@ -214,8 +215,8 @@ Here are the routines, one by one: This routing should be implemented in asm/highmem.h -6) void flush_cache_vmap(unsigned long start, unsigned long end) - void flush_cache_vunmap(unsigned long start, unsigned long end) +6) ``void flush_cache_vmap(unsigned long start, unsigned long end)`` + ``void flush_cache_vunmap(unsigned long start, unsigned long end)`` Here in these two interfaces we are flushing a specific range of (kernel) virtual addresses from the cache. After running, @@ -243,8 +244,10 @@ size). This setting will force the SYSv IPC layer to only allow user processes to mmap shared memory at address which are a multiple of this value. -NOTE: This does not fix shared mmaps, check out the sparc64 port for -one way to solve this (in particular SPARC_FLAG_MMAPSHARED). +.. note:: + + This does not fix shared mmaps, check out the sparc64 port for + one way to solve this (in particular SPARC_FLAG_MMAPSHARED). Next, you have to solve the D-cache aliasing issue for all other cases. Please keep in mind that fact that, for a given page @@ -255,8 +258,8 @@ physical page into its address space, by implication the D-cache aliasing problem has the potential to exist since the kernel already maps this page at its virtual address. - void copy_user_page(void *to, void *from, unsigned long addr, struct page *page) - void clear_user_page(void *to, unsigned long addr, struct page *page) + ``void copy_user_page(void *to, void *from, unsigned long addr, struct page *page)`` + ``void clear_user_page(void *to, unsigned long addr, struct page *page)`` These two routines store data in user anonymous or COW pages. It allows a port to efficiently avoid D-cache alias @@ -276,14 +279,16 @@ maps this page at its virtual address. If D-cache aliasing is not an issue, these two routines may simply call memcpy/memset directly and do nothing more. - void flush_dcache_page(struct page *page) + ``void flush_dcache_page(struct page *page)`` Any time the kernel writes to a page cache page, _OR_ the kernel is about to read from a page cache page and user space shared/writable mappings of this page potentially exist, this routine is called. - NOTE: This routine need only be called for page cache pages + .. note:: + + This routine need only be called for page cache pages which can potentially ever be mapped into the address space of a user process. So for example, VFS layer code handling vfs symlinks in the page cache need not call @@ -322,18 +327,19 @@ maps this page at its virtual address. made of this flag bit, and if set the flush is done and the flag bit is cleared. - IMPORTANT NOTE: It is often important, if you defer the flush, + .. important:: + + It is often important, if you defer the flush, that the actual flush occurs on the same CPU as did the cpu stores into the page to make it dirty. Again, see sparc64 for examples of how to deal with this. - void copy_to_user_page(struct vm_area_struct *vma, struct page *page, - unsigned long user_vaddr, - void *dst, void *src, int len) - void copy_from_user_page(struct vm_area_struct *vma, struct page *page, - unsigned long user_vaddr, - void *dst, void *src, int len) + ``void copy_to_user_page(struct vm_area_struct *vma, struct page *page, + unsigned long user_vaddr, void *dst, void *src, int len)`` + ``void copy_from_user_page(struct vm_area_struct *vma, struct page *page, + unsigned long user_vaddr, void *dst, void *src, int len)`` + When the kernel needs to copy arbitrary data in and out of arbitrary user pages (f.e. for ptrace()) it will use these two routines. @@ -344,8 +350,9 @@ maps this page at its virtual address. likely that you will need to flush the instruction cache for copy_to_user_page(). - void flush_anon_page(struct vm_area_struct *vma, struct page *page, - unsigned long vmaddr) + ``void flush_anon_page(struct vm_area_struct *vma, struct page *page, + unsigned long vmaddr)`` + When the kernel needs to access the contents of an anonymous page, it calls this function (currently only get_user_pages()). Note: flush_dcache_page() deliberately @@ -354,7 +361,8 @@ maps this page at its virtual address. architectures). For incoherent architectures, it should flush the cache of the page at vmaddr. - void flush_kernel_dcache_page(struct page *page) + ``void flush_kernel_dcache_page(struct page *page)`` + When the kernel needs to modify a user page is has obtained with kmap, it calls this function after all modifications are complete (but before kunmapping it) to bring the underlying @@ -366,14 +374,16 @@ maps this page at its virtual address. the kernel cache for page (using page_address(page)). - void flush_icache_range(unsigned long start, unsigned long end) + ``void flush_icache_range(unsigned long start, unsigned long end)`` + When the kernel stores into addresses that it will execute out of (eg when loading modules), this function is called. If the icache does not snoop stores then this routine will need to flush it. - void flush_icache_page(struct vm_area_struct *vma, struct page *page) + ``void flush_icache_page(struct vm_area_struct *vma, struct page *page)`` + All the functionality of flush_icache_page can be implemented in flush_dcache_page and update_mmu_cache. In the future, the hope is to remove this interface completely. @@ -387,7 +397,8 @@ the kernel trying to do I/O to vmap areas must manually manage coherency. It must do this by flushing the vmap range before doing I/O and invalidating it after the I/O returns. - void flush_kernel_vmap_range(void *vaddr, int size) + ``void flush_kernel_vmap_range(void *vaddr, int size)`` + flushes the kernel cache for a given virtual address range in the vmap area. This is to make sure that any data the kernel modified in the vmap range is made visible to the physical @@ -395,7 +406,8 @@ I/O and invalidating it after the I/O returns. Note that this API does *not* also flush the offset map alias of the area. - void invalidate_kernel_vmap_range(void *vaddr, int size) invalidates + ``void invalidate_kernel_vmap_range(void *vaddr, int size) invalidates`` + the cache for a given virtual address range in the vmap area which prevents the processor from making the cache stale by speculatively reading data while the I/O was occurring to the diff --git a/Documentation/cgroup-v1/memory.txt b/Documentation/cgroup-v1/memory.txt index 946e69103cdd..cefb63639070 100644 --- a/Documentation/cgroup-v1/memory.txt +++ b/Documentation/cgroup-v1/memory.txt @@ -789,23 +789,46 @@ way to trigger. Applications should do whatever they can to help the system. It might be too late to consult with vmstat or any other statistics, so it's advisable to take an immediate action. -The events are propagated upward until the event is handled, i.e. the -events are not pass-through. Here is what this means: for example you have -three cgroups: A->B->C. Now you set up an event listener on cgroups A, B -and C, and suppose group C experiences some pressure. In this situation, -only group C will receive the notification, i.e. groups A and B will not -receive it. This is done to avoid excessive "broadcasting" of messages, -which disturbs the system and which is especially bad if we are low on -memory or thrashing. So, organize the cgroups wisely, or propagate the -events manually (or, ask us to implement the pass-through events, -explaining why would you need them.) +By default, events are propagated upward until the event is handled, i.e. the +events are not pass-through. For example, you have three cgroups: A->B->C. Now +you set up an event listener on cgroups A, B and C, and suppose group C +experiences some pressure. In this situation, only group C will receive the +notification, i.e. groups A and B will not receive it. This is done to avoid +excessive "broadcasting" of messages, which disturbs the system and which is +especially bad if we are low on memory or thrashing. Group B, will receive +notification only if there are no event listers for group C. + +There are three optional modes that specify different propagation behavior: + + - "default": this is the default behavior specified above. This mode is the + same as omitting the optional mode parameter, preserved by backwards + compatibility. + + - "hierarchy": events always propagate up to the root, similar to the default + behavior, except that propagation continues regardless of whether there are + event listeners at each level, with the "hierarchy" mode. In the above + example, groups A, B, and C will receive notification of memory pressure. + + - "local": events are pass-through, i.e. they only receive notifications when + memory pressure is experienced in the memcg for which the notification is + registered. In the above example, group C will receive notification if + registered for "local" notification and the group experiences memory + pressure. However, group B will never receive notification, regardless if + there is an event listener for group C or not, if group B is registered for + local notification. + +The level and event notification mode ("hierarchy" or "local", if necessary) are +specified by a comma-delimited string, i.e. "low,hierarchy" specifies +hierarchical, pass-through, notification for all ancestor memcgs. Notification +that is the default, non pass-through behavior, does not specify a mode. +"medium,local" specifies pass-through notification for the medium level. The file memory.pressure_level is only used to setup an eventfd. To register a notification, an application must: - create an eventfd using eventfd(2); - open memory.pressure_level; -- write string like "<event_fd> <fd of memory.pressure_level> <level>" +- write string as "<event_fd> <fd of memory.pressure_level> <level[,mode]>" to cgroup.event_control. Application will be notified through eventfd when memory pressure is at @@ -821,7 +844,7 @@ Test: # cd /sys/fs/cgroup/memory/ # mkdir foo # cd foo - # cgroup_event_listener memory.pressure_level low & + # cgroup_event_listener memory.pressure_level low,hierarchy & # echo 8000000 > memory.limit_in_bytes # echo 8000000 > memory.memsw.limit_in_bytes # echo $$ > tasks diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt index dc5e2dcdbef4..bde177103567 100644 --- a/Documentation/cgroup-v2.txt +++ b/Documentation/cgroup-v2.txt @@ -1,7 +1,9 @@ - +================ Control Group v2 +================ -October, 2015 Tejun Heo <tj@kernel.org> +:Date: October, 2015 +:Author: Tejun Heo <tj@kernel.org> This is the authoritative documentation on the design, interface and conventions of cgroup v2. It describes all userland-visible aspects @@ -9,70 +11,72 @@ of cgroup including core and specific controller behaviors. All future changes must be reflected in this document. Documentation for v1 is available under Documentation/cgroup-v1/. -CONTENTS - -1. Introduction - 1-1. Terminology - 1-2. What is cgroup? -2. Basic Operations - 2-1. Mounting - 2-2. Organizing Processes - 2-3. [Un]populated Notification - 2-4. Controlling Controllers - 2-4-1. Enabling and Disabling - 2-4-2. Top-down Constraint - 2-4-3. No Internal Process Constraint - 2-5. Delegation - 2-5-1. Model of Delegation - 2-5-2. Delegation Containment - 2-6. Guidelines - 2-6-1. Organize Once and Control - 2-6-2. Avoid Name Collisions -3. Resource Distribution Models - 3-1. Weights - 3-2. Limits - 3-3. Protections - 3-4. Allocations -4. Interface Files - 4-1. Format - 4-2. Conventions - 4-3. Core Interface Files -5. Controllers - 5-1. CPU - 5-1-1. CPU Interface Files - 5-2. Memory - 5-2-1. Memory Interface Files - 5-2-2. Usage Guidelines - 5-2-3. Memory Ownership - 5-3. IO - 5-3-1. IO Interface Files - 5-3-2. Writeback - 5-4. PID - 5-4-1. PID Interface Files - 5-5. RDMA - 5-5-1. RDMA Interface Files - 5-6. Misc - 5-6-1. perf_event -6. Namespace - 6-1. Basics - 6-2. The Root and Views - 6-3. Migration and setns(2) - 6-4. Interaction with Other Namespaces -P. Information on Kernel Programming - P-1. Filesystem Support for Writeback -D. Deprecated v1 Core Features -R. Issues with v1 and Rationales for v2 - R-1. Multiple Hierarchies - R-2. Thread Granularity - R-3. Competition Between Inner Nodes and Threads - R-4. Other Interface Issues - R-5. Controller Issues and Remedies - R-5-1. Memory - - -1. Introduction - -1-1. Terminology +.. CONTENTS + + 1. Introduction + 1-1. Terminology + 1-2. What is cgroup? + 2. Basic Operations + 2-1. Mounting + 2-2. Organizing Processes + 2-3. [Un]populated Notification + 2-4. Controlling Controllers + 2-4-1. Enabling and Disabling + 2-4-2. Top-down Constraint + 2-4-3. No Internal Process Constraint + 2-5. Delegation + 2-5-1. Model of Delegation + 2-5-2. Delegation Containment + 2-6. Guidelines + 2-6-1. Organize Once and Control + 2-6-2. Avoid Name Collisions + 3. Resource Distribution Models + 3-1. Weights + 3-2. Limits + 3-3. Protections + 3-4. Allocations + 4. Interface Files + 4-1. Format + 4-2. Conventions + 4-3. Core Interface Files + 5. Controllers + 5-1. CPU + 5-1-1. CPU Interface Files + 5-2. Memory + 5-2-1. Memory Interface Files + 5-2-2. Usage Guidelines + 5-2-3. Memory Ownership + 5-3. IO + 5-3-1. IO Interface Files + 5-3-2. Writeback + 5-4. PID + 5-4-1. PID Interface Files + 5-5. RDMA + 5-5-1. RDMA Interface Files + 5-6. Misc + 5-6-1. perf_event + 6. Namespace + 6-1. Basics + 6-2. The Root and Views + 6-3. Migration and setns(2) + 6-4. Interaction with Other Namespaces + P. Information on Kernel Programming + P-1. Filesystem Support for Writeback + D. Deprecated v1 Core Features + R. Issues with v1 and Rationales for v2 + R-1. Multiple Hierarchies + R-2. Thread Granularity + R-3. Competition Between Inner Nodes and Threads + R-4. Other Interface Issues + R-5. Controller Issues and Remedies + R-5-1. Memory + + +Introduction +============ + +Terminology +----------- "cgroup" stands for "control group" and is never capitalized. The singular form is used to designate the whole feature and also as a @@ -80,7 +84,8 @@ qualifier as in "cgroup controllers". When explicitly referring to multiple individual control groups, the plural form "cgroups" is used. -1-2. What is cgroup? +What is cgroup? +--------------- cgroup is a mechanism to organize processes hierarchically and distribute system resources along the hierarchy in a controlled and @@ -110,12 +115,14 @@ restrictions set closer to the root in the hierarchy can not be overridden from further away. -2. Basic Operations +Basic Operations +================ -2-1. Mounting +Mounting +-------- Unlike v1, cgroup v2 has only single hierarchy. The cgroup v2 -hierarchy can be mounted with the following mount command. +hierarchy can be mounted with the following mount command:: # mount -t cgroup2 none $MOUNT_POINT @@ -149,11 +156,22 @@ during boot, before manual intervention is possible. To make testing and experimenting easier, the kernel parameter cgroup_no_v1= allows disabling controllers in v1 and make them always available in v2. +cgroup v2 currently supports the following mount options. + + nsdelegate + + Consider cgroup namespaces as delegation boundaries. This + option is system wide and can only be set on mount or modified + through remount from the init namespace. The mount option is + ignored on non-init namespace mounts. Please refer to the + Delegation section for details. -2-2. Organizing Processes + +Organizing Processes +-------------------- Initially, only the root cgroup exists to which all processes belong. -A child cgroup can be created by creating a sub-directory. +A child cgroup can be created by creating a sub-directory:: # mkdir $CGROUP_NAME @@ -180,28 +198,29 @@ moved to another cgroup. A cgroup which doesn't have any children or live processes can be destroyed by removing the directory. Note that a cgroup which doesn't have any children and is associated only with zombie processes is -considered empty and can be removed. +considered empty and can be removed:: # rmdir $CGROUP_NAME "/proc/$PID/cgroup" lists a process's cgroup membership. If legacy cgroup is in use in the system, this file may contain multiple lines, one for each hierarchy. The entry for cgroup v2 is always in the -format "0::$PATH". +format "0::$PATH":: # cat /proc/842/cgroup ... 0::/test-cgroup/test-cgroup-nested If the process becomes a zombie and the cgroup it was associated with -is removed subsequently, " (deleted)" is appended to the path. +is removed subsequently, " (deleted)" is appended to the path:: # cat /proc/842/cgroup ... 0::/test-cgroup/test-cgroup-nested (deleted) -2-3. [Un]populated Notification +[Un]populated Notification +-------------------------- Each non-root cgroup has a "cgroup.events" file which contains "populated" field indicating whether the cgroup's sub-hierarchy has @@ -212,7 +231,7 @@ example, to start a clean-up operation after all processes of a given sub-hierarchy have exited. The populated state updates and notifications are recursive. Consider the following sub-hierarchy where the numbers in the parentheses represent the numbers of processes -in each cgroup. +in each cgroup:: A(4) - B(0) - C(1) \ D(0) @@ -223,18 +242,20 @@ file modified events will be generated on the "cgroup.events" files of both cgroups. -2-4. Controlling Controllers +Controlling Controllers +----------------------- -2-4-1. Enabling and Disabling +Enabling and Disabling +~~~~~~~~~~~~~~~~~~~~~~ Each cgroup has a "cgroup.controllers" file which lists all -controllers available for the cgroup to enable. +controllers available for the cgroup to enable:: # cat cgroup.controllers cpu io memory No controller is enabled by default. Controllers can be enabled and -disabled by writing to the "cgroup.subtree_control" file. +disabled by writing to the "cgroup.subtree_control" file:: # echo "+cpu +memory -io" > cgroup.subtree_control @@ -246,7 +267,7 @@ are specified, the last one is effective. Enabling a controller in a cgroup indicates that the distribution of the target resource across its immediate children will be controlled. Consider the following sub-hierarchy. The enabled controllers are -listed in parentheses. +listed in parentheses:: A(cpu,memory) - B(memory) - C() \ D() @@ -266,7 +287,8 @@ controller interface files - anything which doesn't start with "cgroup." are owned by the parent rather than the cgroup itself. -2-4-2. Top-down Constraint +Top-down Constraint +~~~~~~~~~~~~~~~~~~~ Resources are distributed top-down and a cgroup can further distribute a resource only if the resource has been distributed to it from the @@ -277,7 +299,8 @@ the parent has the controller enabled and a controller can't be disabled if one or more children have it enabled. -2-4-3. No Internal Process Constraint +No Internal Process Constraint +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Non-root cgroups can only distribute resources to their children when they don't have any processes of their own. In other words, only @@ -304,35 +327,49 @@ children before enabling controllers in its "cgroup.subtree_control" file. -2-5. Delegation +Delegation +---------- + +Model of Delegation +~~~~~~~~~~~~~~~~~~~ -2-5-1. Model of Delegation +A cgroup can be delegated in two ways. First, to a less privileged +user by granting write access of the directory and its "cgroup.procs" +and "cgroup.subtree_control" files to the user. Second, if the +"nsdelegate" mount option is set, automatically to a cgroup namespace +on namespace creation. -A cgroup can be delegated to a less privileged user by granting write -access of the directory and its "cgroup.procs" file to the user. Note -that resource control interface files in a given directory control the -distribution of the parent's resources and thus must not be delegated -along with the directory. +Because the resource control interface files in a given directory +control the distribution of the parent's resources, the delegatee +shouldn't be allowed to write to them. For the first method, this is +achieved by not granting access to these files. For the second, the +kernel rejects writes to all files other than "cgroup.procs" and +"cgroup.subtree_control" on a namespace root from inside the +namespace. -Once delegated, the user can build sub-hierarchy under the directory, -organize processes as it sees fit and further distribute the resources -it received from the parent. The limits and other settings of all -resource controllers are hierarchical and regardless of what happens -in the delegated sub-hierarchy, nothing can escape the resource -restrictions imposed by the parent. +The end results are equivalent for both delegation types. Once +delegated, the user can build sub-hierarchy under the directory, +organize processes inside it as it sees fit and further distribute the +resources it received from the parent. The limits and other settings +of all resource controllers are hierarchical and regardless of what +happens in the delegated sub-hierarchy, nothing can escape the +resource restrictions imposed by the parent. Currently, cgroup doesn't impose any restrictions on the number of cgroups in or nesting depth of a delegated sub-hierarchy; however, this may be limited explicitly in the future. -2-5-2. Delegation Containment +Delegation Containment +~~~~~~~~~~~~~~~~~~~~~~ A delegated sub-hierarchy is contained in the sense that processes -can't be moved into or out of the sub-hierarchy by the delegatee. For -a process with a non-root euid to migrate a target process into a -cgroup by writing its PID to the "cgroup.procs" file, the following -conditions must be met. +can't be moved into or out of the sub-hierarchy by the delegatee. + +For delegations to a less privileged user, this is achieved by +requiring the following conditions for a process with a non-root euid +to migrate a target process into a cgroup by writing its PID to the +"cgroup.procs" file. - The writer must have write access to the "cgroup.procs" file. @@ -345,7 +382,7 @@ in from or push out to outside the sub-hierarchy. For an example, let's assume cgroups C0 and C1 have been delegated to user U0 who created C00, C01 under C0 and C10 under C1 as follows and -all processes under C0 and C1 belong to U0. +all processes under C0 and C1 belong to U0:: ~~~~~~~~~~~~~ - C0 - C00 ~ cgroup ~ \ C01 @@ -359,10 +396,17 @@ destination cgroup C00 is above the points of delegation and U0 would not have write access to its "cgroup.procs" files and thus the write will be denied with -EACCES. +For delegations to namespaces, containment is achieved by requiring +that both the source and destination cgroups are reachable from the +namespace of the process which is attempting the migration. If either +is not reachable, the migration is rejected with -ENOENT. + -2-6. Guidelines +Guidelines +---------- -2-6-1. Organize Once and Control +Organize Once and Control +~~~~~~~~~~~~~~~~~~~~~~~~~ Migrating a process across cgroups is a relatively expensive operation and stateful resources such as memory are not moved together with the @@ -378,7 +422,8 @@ distribution can be made by changing controller configuration through the interface files. -2-6-2. Avoid Name Collisions +Avoid Name Collisions +~~~~~~~~~~~~~~~~~~~~~ Interface files for a cgroup and its children cgroups occupy the same directory and it is possible to create children cgroups which collide @@ -396,14 +441,16 @@ cgroup doesn't do anything to prevent name collisions and it's the user's responsibility to avoid them. -3. Resource Distribution Models +Resource Distribution Models +============================ cgroup controllers implement several resource distribution schemes depending on the resource type and expected use cases. This section describes major schemes in use along with their expected behaviors. -3-1. Weights +Weights +------- A parent's resource is distributed by adding up the weights of all active children and giving each the fraction matching the ratio of its @@ -424,7 +471,8 @@ process migrations. and is an example of this type. -3-2. Limits +Limits +------ A child can only consume upto the configured amount of the resource. Limits can be over-committed - the sum of the limits of children can @@ -440,7 +488,8 @@ process migrations. on an IO device and is an example of this type. -3-3. Protections +Protections +----------- A cgroup is protected to be allocated upto the configured amount of the resource if the usages of all its ancestors are under their @@ -460,7 +509,8 @@ process migrations. example of this type. -3-4. Allocations +Allocations +----------- A cgroup is exclusively allocated a certain amount of a finite resource. Allocations can't be over-committed - the sum of the @@ -479,12 +529,14 @@ may be rejected. type. -4. Interface Files +Interface Files +=============== -4-1. Format +Format +------ All interface files should be in one of the following formats whenever -possible. +possible:: New-line separated values (when only one value can be written at once) @@ -519,7 +571,8 @@ can be written at a time. For nested keyed files, the sub key pairs may be specified in any order and not all pairs have to be specified. -4-2. Conventions +Conventions +----------- - Settings for a single feature should be contained in a single file. @@ -555,25 +608,25 @@ may be specified in any order and not all pairs have to be specified. with "default" as the value must not appear when read. For example, a setting which is keyed by major:minor device numbers - with integer values may look like the following. + with integer values may look like the following:: # cat cgroup-example-interface-file default 150 8:0 300 - The default value can be updated by + The default value can be updated by:: # echo 125 > cgroup-example-interface-file - or + or:: # echo "default 125" > cgroup-example-interface-file - An override can be set by + An override can be set by:: # echo "8:16 170" > cgroup-example-interface-file - and cleared by + and cleared by:: # echo "8:0 default" > cgroup-example-interface-file # cat cgroup-example-interface-file @@ -586,12 +639,12 @@ may be specified in any order and not all pairs have to be specified. generated on the file. -4-3. Core Interface Files +Core Interface Files +-------------------- All cgroup core files are prefixed with "cgroup." cgroup.procs - A read-write new-line separated values file which exists on all cgroups. @@ -617,7 +670,6 @@ All cgroup core files are prefixed with "cgroup." should be granted along with the containing directory. cgroup.controllers - A read-only space separated values file which exists on all cgroups. @@ -625,7 +677,6 @@ All cgroup core files are prefixed with "cgroup." the cgroup. The controllers are not ordered. cgroup.subtree_control - A read-write space separated values file which exists on all cgroups. Starts out empty. @@ -641,23 +692,25 @@ All cgroup core files are prefixed with "cgroup." operations are specified, either all succeed or all fail. cgroup.events - A read-only flat-keyed file which exists on non-root cgroups. The following entries are defined. Unless specified otherwise, a value change in this file generates a file modified event. populated - 1 if the cgroup or its descendants contains any live processes; otherwise, 0. -5. Controllers +Controllers +=========== + +CPU +--- -5-1. CPU +.. note:: -[NOTE: The interface for the cpu controller hasn't been merged yet] + The interface for the cpu controller hasn't been merged yet The "cpu" controllers regulates distribution of CPU cycles. This controller implements weight and absolute bandwidth limit models for @@ -665,36 +718,34 @@ normal scheduling policy and absolute bandwidth allocation model for realtime scheduling policy. -5-1-1. CPU Interface Files +CPU Interface Files +~~~~~~~~~~~~~~~~~~~ All time durations are in microseconds. cpu.stat - A read-only flat-keyed file which exists on non-root cgroups. - It reports the following six stats. + It reports the following six stats: - usage_usec - user_usec - system_usec - nr_periods - nr_throttled - throttled_usec + - usage_usec + - user_usec + - system_usec + - nr_periods + - nr_throttled + - throttled_usec cpu.weight - A read-write single value file which exists on non-root cgroups. The default is "100". The weight in the range [1, 10000]. cpu.max - A read-write two value file which exists on non-root cgroups. The default is "max 100000". - The maximum bandwidth limit. It's in the following format. + The maximum bandwidth limit. It's in the following format:: $MAX $PERIOD @@ -703,9 +754,10 @@ All time durations are in microseconds. one number is written, $MAX is updated. cpu.rt.max + .. note:: - [NOTE: The semantics of this file is still under discussion and the - interface hasn't been merged yet] + The semantics of this file is still under discussion and the + interface hasn't been merged yet A read-write two value file which exists on all cgroups. The default is "0 100000". @@ -713,7 +765,7 @@ All time durations are in microseconds. The maximum realtime runtime allocation. Over-committing configurations are disallowed and process migrations are rejected if not enough bandwidth is available. It's in the - following format. + following format:: $MAX $PERIOD @@ -722,7 +774,8 @@ All time durations are in microseconds. updated. -5-2. Memory +Memory +------ The "memory" controller regulates distribution of memory. Memory is stateful and implements both limit and protection models. Due to the @@ -744,14 +797,14 @@ following types of memory usages are tracked. The above list may expand in the future for better coverage. -5-2-1. Memory Interface Files +Memory Interface Files +~~~~~~~~~~~~~~~~~~~~~~ All memory amounts are in bytes. If a value which is not aligned to PAGE_SIZE is written, the value may be rounded up to the closest PAGE_SIZE multiple when read back. memory.current - A read-only single value file which exists on non-root cgroups. @@ -759,7 +812,6 @@ PAGE_SIZE multiple when read back. and its descendants. memory.low - A read-write single value file which exists on non-root cgroups. The default is "0". @@ -772,7 +824,6 @@ PAGE_SIZE multiple when read back. protection is discouraged. memory.high - A read-write single value file which exists on non-root cgroups. The default is "max". @@ -785,7 +836,6 @@ PAGE_SIZE multiple when read back. under extreme conditions the limit may be breached. memory.max - A read-write single value file which exists on non-root cgroups. The default is "max". @@ -800,21 +850,18 @@ PAGE_SIZE multiple when read back. utility is limited to providing the final safety net. memory.events - A read-only flat-keyed file which exists on non-root cgroups. The following entries are defined. Unless specified otherwise, a value change in this file generates a file modified event. low - The number of times the cgroup is reclaimed due to high memory pressure even though its usage is under the low boundary. This usually indicates that the low boundary is over-committed. high - The number of times processes of the cgroup are throttled and routed to perform direct memory reclaim because the high memory boundary was exceeded. For a @@ -823,19 +870,27 @@ PAGE_SIZE multiple when read back. occurrences are expected. max - The number of times the cgroup's memory usage was about to go over the max boundary. If direct reclaim - fails to bring it down, the OOM killer is invoked. + fails to bring it down, the cgroup goes to OOM state. oom + The number of time the cgroup's memory usage was + reached the limit and allocation was about to fail. - The number of times the OOM killer has been invoked in - the cgroup. This may not exactly match the number of - processes killed but should generally be close. + Depending on context result could be invocation of OOM + killer and retrying allocation or failing alloction. - memory.stat + Failed allocation in its turn could be returned into + userspace as -ENOMEM or siletly ignored in cases like + disk readahead. For now OOM in memory cgroup kills + tasks iff shortage has happened inside page fault. + oom_kill + The number of processes belonging to this cgroup + killed by any kind of OOM killer. + + memory.stat A read-only flat-keyed file which exists on non-root cgroups. This breaks down the cgroup's memory footprint into different @@ -849,73 +904,55 @@ PAGE_SIZE multiple when read back. fixed position; use the keys to look up specific values! anon - Amount of memory used in anonymous mappings such as brk(), sbrk(), and mmap(MAP_ANONYMOUS) file - Amount of memory used to cache filesystem data, including tmpfs and shared memory. kernel_stack - Amount of memory allocated to kernel stacks. slab - Amount of memory used for storing in-kernel data structures. sock - Amount of memory used in network transmission buffers shmem - Amount of cached filesystem data that is swap-backed, such as tmpfs, shm segments, shared anonymous mmap()s file_mapped - Amount of cached filesystem data mapped with mmap() file_dirty - Amount of cached filesystem data that was modified but not yet written back to disk file_writeback - Amount of cached filesystem data that was modified and is currently being written back to disk - inactive_anon - active_anon - inactive_file - active_file - unevictable - + inactive_anon, active_anon, inactive_file, active_file, unevictable Amount of memory, swap-backed and filesystem-backed, on the internal memory management lists used by the page reclaim algorithm slab_reclaimable - Part of "slab" that might be reclaimed, such as dentries and inodes. slab_unreclaimable - Part of "slab" that cannot be reclaimed on memory pressure. pgfault - Total number of page faults incurred pgmajfault - Number of major page faults incurred workingset_refault @@ -930,8 +967,35 @@ PAGE_SIZE multiple when read back. Number of times a shadow node has been reclaimed - memory.swap.current + pgrefill + + Amount of scanned pages (in an active LRU list) + + pgscan + + Amount of scanned pages (in an inactive LRU list) + + pgsteal + + Amount of reclaimed pages + + pgactivate + + Amount of pages moved to the active LRU list + + pgdeactivate + Amount of pages moved to the inactive LRU lis + + pglazyfree + + Amount of pages postponed to be freed under memory pressure + + pglazyfreed + + Amount of reclaimed lazyfree pages + + memory.swap.current A read-only single value file which exists on non-root cgroups. @@ -939,7 +1003,6 @@ PAGE_SIZE multiple when read back. and its descendants. memory.swap.max - A read-write single value file which exists on non-root cgroups. The default is "max". @@ -947,7 +1010,8 @@ PAGE_SIZE multiple when read back. limit, anonymous meomry of the cgroup will not be swapped out. -5-2-2. Usage Guidelines +Usage Guidelines +~~~~~~~~~~~~~~~~ "memory.high" is the main mechanism to control memory usage. Over-committing on high limit (sum of high limits > available memory) @@ -970,7 +1034,8 @@ memory; unfortunately, memory pressure monitoring mechanism isn't implemented yet. -5-2-3. Memory Ownership +Memory Ownership +~~~~~~~~~~~~~~~~ A memory area is charged to the cgroup which instantiated it and stays charged to the cgroup until the area is released. Migrating a process @@ -988,7 +1053,8 @@ POSIX_FADV_DONTNEED to relinquish the ownership of memory areas belonging to the affected files to ensure correct memory ownership. -5-3. IO +IO +-- The "io" controller regulates the distribution of IO resources. This controller implements both weight based and absolute bandwidth or IOPS @@ -997,28 +1063,29 @@ only if cfq-iosched is in use and neither scheme is available for blk-mq devices. -5-3-1. IO Interface Files +IO Interface Files +~~~~~~~~~~~~~~~~~~ io.stat - A read-only nested-keyed file which exists on non-root cgroups. Lines are keyed by $MAJ:$MIN device numbers and not ordered. The following nested keys are defined. + ====== =================== rbytes Bytes read wbytes Bytes written rios Number of read IOs wios Number of write IOs + ====== =================== - An example read output follows. + An example read output follows: 8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353 8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252 io.weight - A read-write flat-keyed file which exists on non-root cgroups. The default is "default 100". @@ -1032,14 +1099,13 @@ blk-mq devices. $WEIGHT" or simply "$WEIGHT". Overrides can be set by writing "$MAJ:$MIN $WEIGHT" and unset by writing "$MAJ:$MIN default". - An example read output follows. + An example read output follows:: default 100 8:16 200 8:0 50 io.max - A read-write nested-keyed file which exists on non-root cgroups. @@ -1047,10 +1113,12 @@ blk-mq devices. device numbers and not ordered. The following nested keys are defined. + ===== ================================== rbps Max read bytes per second wbps Max write bytes per second riops Max read IO operations per second wiops Max write IO operations per second + ===== ================================== When writing, any number of nested key-value pairs can be specified in any order. "max" can be specified as the value @@ -1060,24 +1128,25 @@ blk-mq devices. BPS and IOPS are measured in each IO direction and IOs are delayed if limit is reached. Temporary bursts are allowed. - Setting read limit at 2M BPS and write at 120 IOPS for 8:16. + Setting read limit at 2M BPS and write at 120 IOPS for 8:16:: echo "8:16 rbps=2097152 wiops=120" > io.max - Reading returns the following. + Reading returns the following:: 8:16 rbps=2097152 wbps=max riops=max wiops=120 - Write IOPS limit can be removed by writing the following. + Write IOPS limit can be removed by writing the following:: echo "8:16 wiops=max" > io.max - Reading now returns the following. + Reading now returns the following:: 8:16 rbps=2097152 wbps=max riops=max wiops=max -5-3-2. Writeback +Writeback +~~~~~~~~~ Page cache is dirtied through buffered writes and shared mmaps and written asynchronously to the backing filesystem by the writeback @@ -1125,22 +1194,19 @@ patterns. The sysctl knobs which affect writeback behavior are applied to cgroup writeback as follows. - vm.dirty_background_ratio - vm.dirty_ratio - + vm.dirty_background_ratio, vm.dirty_ratio These ratios apply the same to cgroup writeback with the amount of available memory capped by limits imposed by the memory controller and system-wide clean memory. - vm.dirty_background_bytes - vm.dirty_bytes - + vm.dirty_background_bytes, vm.dirty_bytes For cgroup writeback, this is calculated into ratio against total available memory and applied the same way as vm.dirty[_background]_ratio. -5-4. PID +PID +--- The process number controller is used to allow a cgroup to stop any new tasks from being fork()'d or clone()'d after a specified limit is @@ -1155,17 +1221,16 @@ Note that PIDs used in this controller refer to TIDs, process IDs as used by the kernel. -5-4-1. PID Interface Files +PID Interface Files +~~~~~~~~~~~~~~~~~~~ pids.max - A read-write single value file which exists on non-root cgroups. The default is "max". Hard limit of number of processes. pids.current - A read-only single value file which exists on all cgroups. The number of processes currently in the cgroup and its @@ -1180,12 +1245,14 @@ through fork() or clone(). These will return -EAGAIN if the creation of a new process would cause a cgroup policy to be violated. -5-5. RDMA +RDMA +---- The "rdma" controller regulates the distribution and accounting of of RDMA resources. -5-5-1. RDMA Interface Files +RDMA Interface Files +~~~~~~~~~~~~~~~~~~~~ rdma.max A readwrite nested-keyed file that exists for all the cgroups @@ -1198,10 +1265,12 @@ of RDMA resources. The following nested keys are defined. + ========== ============================= hca_handle Maximum number of HCA Handles hca_object Maximum number of HCA Objects + ========== ============================= - An example for mlx4 and ocrdma device follows. + An example for mlx4 and ocrdma device follows:: mlx4_0 hca_handle=2 hca_object=2000 ocrdma1 hca_handle=3 hca_object=max @@ -1210,15 +1279,17 @@ of RDMA resources. A read-only file that describes current resource usage. It exists for all the cgroup except root. - An example for mlx4 and ocrdma device follows. + An example for mlx4 and ocrdma device follows:: mlx4_0 hca_handle=1 hca_object=20 ocrdma1 hca_handle=1 hca_object=23 -5-6. Misc +Misc +---- -5-6-1. perf_event +perf_event +~~~~~~~~~~ perf_event controller, if not mounted on a legacy hierarchy, is automatically enabled on the v2 hierarchy so that perf events can @@ -1226,9 +1297,11 @@ always be filtered by cgroup v2 path. The controller can still be moved to a legacy hierarchy after v2 hierarchy is populated. -6. Namespace +Namespace +========= -6-1. Basics +Basics +------ cgroup namespace provides a mechanism to virtualize the view of the "/proc/$PID/cgroup" file and cgroup mounts. The CLONE_NEWCGROUP clone @@ -1242,7 +1315,7 @@ Without cgroup namespace, the "/proc/$PID/cgroup" file shows the complete path of the cgroup of a process. In a container setup where a set of cgroups and namespaces are intended to isolate processes the "/proc/$PID/cgroup" file may leak potential system level information -to the isolated processes. For Example: +to the isolated processes. For Example:: # cat /proc/self/cgroup 0::/batchjobs/container_id1 @@ -1250,14 +1323,14 @@ to the isolated processes. For Example: The path '/batchjobs/container_id1' can be considered as system-data and undesirable to expose to the isolated processes. cgroup namespace can be used to restrict visibility of this path. For example, before -creating a cgroup namespace, one would see: +creating a cgroup namespace, one would see:: # ls -l /proc/self/ns/cgroup lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835] # cat /proc/self/cgroup 0::/batchjobs/container_id1 -After unsharing a new namespace, the view changes. +After unsharing a new namespace, the view changes:: # ls -l /proc/self/ns/cgroup lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183] @@ -1275,7 +1348,8 @@ namespace is destroyed. The cgroupns root and the actual cgroups remain. -6-2. The Root and Views +The Root and Views +------------------ The 'cgroupns root' for a cgroup namespace is the cgroup in which the process calling unshare(2) is running. For example, if a process in @@ -1284,7 +1358,7 @@ process calling unshare(2) is running. For example, if a process in init_cgroup_ns, this is the real root ('/') cgroup. The cgroupns root cgroup does not change even if the namespace creator -process later moves to a different cgroup. +process later moves to a different cgroup:: # ~/unshare -c # unshare cgroupns in some cgroup # cat /proc/self/cgroup @@ -1298,7 +1372,7 @@ Each process gets its namespace-specific view of "/proc/$PID/cgroup" Processes running inside the cgroup namespace will be able to see cgroup paths (in /proc/self/cgroup) only inside their root cgroup. -From within an unshared cgroupns: +From within an unshared cgroupns:: # sleep 100000 & [1] 7353 @@ -1307,7 +1381,7 @@ From within an unshared cgroupns: 0::/sub_cgrp_1 From the initial cgroup namespace, the real cgroup path will be -visible: +visible:: $ cat /proc/7353/cgroup 0::/batchjobs/container_id1/sub_cgrp_1 @@ -1315,7 +1389,7 @@ visible: From a sibling cgroup namespace (that is, a namespace rooted at a different cgroup), the cgroup path relative to its own cgroup namespace root will be shown. For instance, if PID 7353's cgroup -namespace root is at '/batchjobs/container_id2', then it will see +namespace root is at '/batchjobs/container_id2', then it will see:: # cat /proc/7353/cgroup 0::/../container_id2/sub_cgrp_1 @@ -1324,13 +1398,14 @@ Note that the relative path always starts with '/' to indicate that its relative to the cgroup namespace root of the caller. -6-3. Migration and setns(2) +Migration and setns(2) +---------------------- Processes inside a cgroup namespace can move into and out of the namespace root if they have proper access to external cgroups. For example, from inside a namespace with cgroupns root at /batchjobs/container_id1, and assuming that the global hierarchy is -still accessible inside cgroupns: +still accessible inside cgroupns:: # cat /proc/7353/cgroup 0::/sub_cgrp_1 @@ -1352,10 +1427,11 @@ namespace. It is expected that the someone moves the attaching process under the target cgroup namespace root. -6-4. Interaction with Other Namespaces +Interaction with Other Namespaces +--------------------------------- Namespace specific cgroup hierarchy can be mounted by a process -running inside a non-init cgroup namespace. +running inside a non-init cgroup namespace:: # mount -t cgroup2 none $MOUNT_POINT @@ -1368,27 +1444,27 @@ the view of cgroup hierarchy by namespace-private cgroupfs mount provides a properly isolated cgroup view inside the container. -P. Information on Kernel Programming +Information on Kernel Programming +================================= This section contains kernel programming information in the areas where interacting with cgroup is necessary. cgroup core and controllers are not covered. -P-1. Filesystem Support for Writeback +Filesystem Support for Writeback +-------------------------------- A filesystem can support cgroup writeback by updating address_space_operations->writepage[s]() to annotate bio's using the following two functions. wbc_init_bio(@wbc, @bio) - Should be called for each bio carrying writeback data and associates the bio with the inode's owner cgroup. Can be called anytime between bio allocation and submission. wbc_account_io(@wbc, @page, @bytes) - Should be called for each data segment being written out. While this function doesn't care exactly when it's called during the writeback session, it's the easiest and most @@ -1409,11 +1485,12 @@ cases by skipping wbc_init_bio() or using bio_associate_blkcg() directly. -D. Deprecated v1 Core Features +Deprecated v1 Core Features +=========================== - Multiple hierarchies including named ones are not supported. -- All mount options and remounting are not supported. +- All v1 mount options are not supported. - The "tasks" file is removed and "cgroup.procs" is not sorted. @@ -1423,9 +1500,11 @@ D. Deprecated v1 Core Features at the root instead. -R. Issues with v1 and Rationales for v2 +Issues with v1 and Rationales for v2 +==================================== -R-1. Multiple Hierarchies +Multiple Hierarchies +-------------------- cgroup v1 allowed an arbitrary number of hierarchies and each hierarchy could host any number of controllers. While this seemed to @@ -1477,7 +1556,8 @@ how memory is distributed beyond a certain level while still wanting to control how CPU cycles are distributed. -R-2. Thread Granularity +Thread Granularity +------------------ cgroup v1 allowed threads of a process to belong to different cgroups. This didn't make sense for some controllers and those controllers @@ -1520,7 +1600,8 @@ misbehaving and poorly abstracted interfaces and kernel exposing and locked into constructs inadvertently. -R-3. Competition Between Inner Nodes and Threads +Competition Between Inner Nodes and Threads +------------------------------------------- cgroup v1 allowed threads to be in any cgroups which created an interesting problem where threads belonging to a parent cgroup and its @@ -1539,7 +1620,7 @@ simply weren't available for threads. The io controller implicitly created a hidden leaf node for each cgroup to host the threads. The hidden leaf had its own copies of all -the knobs with "leaf_" prefixed. While this allowed equivalent +the knobs with ``leaf_`` prefixed. While this allowed equivalent control over internal threads, it was with serious drawbacks. It always added an extra layer of nesting which wouldn't be necessary otherwise, made the interface messy and significantly complicated the @@ -1560,7 +1641,8 @@ This clearly is a problem which needs to be addressed from cgroup core in a uniform way. -R-4. Other Interface Issues +Other Interface Issues +---------------------- cgroup v1 grew without oversight and developed a large number of idiosyncrasies and inconsistencies. One issue on the cgroup core side @@ -1588,9 +1670,11 @@ cgroup v2 establishes common conventions where appropriate and updates controllers so that they expose minimal and consistent interfaces. -R-5. Controller Issues and Remedies +Controller Issues and Remedies +------------------------------ -R-5-1. Memory +Memory +~~~~~~ The original lower boundary, the soft limit, is defined as a limit that is per default unset. As a result, the set of cgroups that diff --git a/Documentation/circular-buffers.txt b/Documentation/circular-buffers.txt index 4a824d232472..d4628174b7c5 100644 --- a/Documentation/circular-buffers.txt +++ b/Documentation/circular-buffers.txt @@ -1,9 +1,9 @@ - ================ - CIRCULAR BUFFERS - ================ +================ +Circular Buffers +================ -By: David Howells <dhowells@redhat.com> - Paul E. McKenney <paulmck@linux.vnet.ibm.com> +:Author: David Howells <dhowells@redhat.com> +:Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Linux provides a number of features that can be used to implement circular @@ -20,7 +20,7 @@ producer and just one consumer. It is possible to handle multiple producers by serialising them, and to handle multiple consumers by serialising them. -Contents: +.. Contents: (*) What is a circular buffer? @@ -31,8 +31,8 @@ Contents: - The consumer. -========================== -WHAT IS A CIRCULAR BUFFER? + +What is a circular buffer? ========================== First of all, what is a circular buffer? A circular buffer is a buffer of @@ -60,9 +60,7 @@ buffer, provided that neither index overtakes the other. The implementer must be careful, however, as a region more than one unit in size may wrap the end of the buffer and be broken into two segments. - -============================ -MEASURING POWER-OF-2 BUFFERS +Measuring power-of-2 buffers ============================ Calculation of the occupancy or the remaining capacity of an arbitrarily sized @@ -71,13 +69,13 @@ modulus (divide) instruction. However, if the buffer is of a power-of-2 size, then a much quicker bitwise-AND instruction can be used instead. Linux provides a set of macros for handling power-of-2 circular buffers. These -can be made use of by: +can be made use of by:: #include <linux/circ_buf.h> The macros are: - (*) Measure the remaining capacity of a buffer: + (#) Measure the remaining capacity of a buffer:: CIRC_SPACE(head_index, tail_index, buffer_size); @@ -85,7 +83,7 @@ The macros are: can be inserted. - (*) Measure the maximum consecutive immediate space in a buffer: + (#) Measure the maximum consecutive immediate space in a buffer:: CIRC_SPACE_TO_END(head_index, tail_index, buffer_size); @@ -94,14 +92,14 @@ The macros are: beginning of the buffer. - (*) Measure the occupancy of a buffer: + (#) Measure the occupancy of a buffer:: CIRC_CNT(head_index, tail_index, buffer_size); This returns the number of items currently occupying a buffer[2]. - (*) Measure the non-wrapping occupancy of a buffer: + (#) Measure the non-wrapping occupancy of a buffer:: CIRC_CNT_TO_END(head_index, tail_index, buffer_size); @@ -112,7 +110,7 @@ The macros are: Each of these macros will nominally return a value between 0 and buffer_size-1, however: - [1] CIRC_SPACE*() are intended to be used in the producer. To the producer + (1) CIRC_SPACE*() are intended to be used in the producer. To the producer they will return a lower bound as the producer controls the head index, but the consumer may still be depleting the buffer on another CPU and moving the tail index. @@ -120,7 +118,7 @@ however: To the consumer it will show an upper bound as the producer may be busy depleting the space. - [2] CIRC_CNT*() are intended to be used in the consumer. To the consumer they + (2) CIRC_CNT*() are intended to be used in the consumer. To the consumer they will return a lower bound as the consumer controls the tail index, but the producer may still be filling the buffer on another CPU and moving the head index. @@ -128,14 +126,12 @@ however: To the producer it will show an upper bound as the consumer may be busy emptying the buffer. - [3] To a third party, the order in which the writes to the indices by the + (3) To a third party, the order in which the writes to the indices by the producer and consumer become visible cannot be guaranteed as they are independent and may be made on different CPUs - so the result in such a situation will merely be a guess, and may even be negative. - -=========================================== -USING MEMORY BARRIERS WITH CIRCULAR BUFFERS +Using memory barriers with circular buffers =========================================== By using memory barriers in conjunction with circular buffers, you can avoid @@ -152,10 +148,10 @@ time, and only one thing should be emptying a buffer at any one time, but the two sides can operate simultaneously. -THE PRODUCER +The producer ------------ -The producer will look something like this: +The producer will look something like this:: spin_lock(&producer_lock); @@ -193,10 +189,10 @@ ordering between the read of the index indicating that the consumer has vacated a given element and the write by the producer to that same element. -THE CONSUMER +The Consumer ------------ -The consumer will look something like this: +The consumer will look something like this:: spin_lock(&consumer_lock); @@ -235,8 +231,7 @@ prevents the compiler from tearing the store, and enforces ordering against previous accesses. -=============== -FURTHER READING +Further reading =============== See also Documentation/memory-barriers.txt for a description of Linux's memory diff --git a/Documentation/clk.txt b/Documentation/clk.txt index 22f026aa2f34..be909ed45970 100644 --- a/Documentation/clk.txt +++ b/Documentation/clk.txt @@ -1,12 +1,16 @@ - The Common Clk Framework - Mike Turquette <mturquette@ti.com> +======================== +The Common Clk Framework +======================== + +:Author: Mike Turquette <mturquette@ti.com> This document endeavours to explain the common clk framework details, and how to port a platform over to this framework. It is not yet a detailed explanation of the clock api in include/linux/clk.h, but perhaps someday it will include that information. - Part 1 - introduction and interface split +Introduction and interface split +================================ The common clk framework is an interface to control the clock nodes available on various devices today. This may come in the form of clock @@ -35,10 +39,11 @@ is defined in struct clk_foo and pointed to within struct clk_core. This allows for easy navigation between the two discrete halves of the common clock interface. - Part 2 - common data structures and api +Common data structures and api +============================== Below is the common struct clk_core definition from -drivers/clk/clk.c, modified for brevity: +drivers/clk/clk.c, modified for brevity:: struct clk_core { const char *name; @@ -59,7 +64,7 @@ struct clk. That api is documented in include/linux/clk.h. Platforms and devices utilizing the common struct clk_core use the struct clk_ops pointer in struct clk_core to perform the hardware-specific parts of -the operations defined in clk-provider.h: +the operations defined in clk-provider.h:: struct clk_ops { int (*prepare)(struct clk_hw *hw); @@ -95,19 +100,20 @@ the operations defined in clk-provider.h: struct dentry *dentry); }; - Part 3 - hardware clk implementations +Hardware clk implementations +============================ The strength of the common struct clk_core comes from its .ops and .hw pointers which abstract the details of struct clk from the hardware-specific bits, and vice versa. To illustrate consider the simple gateable clk implementation in -drivers/clk/clk-gate.c: +drivers/clk/clk-gate.c:: -struct clk_gate { - struct clk_hw hw; - void __iomem *reg; - u8 bit_idx; - ... -}; + struct clk_gate { + struct clk_hw hw; + void __iomem *reg; + u8 bit_idx; + ... + }; struct clk_gate contains struct clk_hw hw as well as hardware-specific knowledge about which register and bit controls this clk's gating. @@ -115,7 +121,7 @@ Nothing about clock topology or accounting, such as enable_count or notifier_count, is needed here. That is all handled by the common framework code and struct clk_core. -Let's walk through enabling this clk from driver code: +Let's walk through enabling this clk from driver code:: struct clk *clk; clk = clk_get(NULL, "my_gateable_clk"); @@ -123,70 +129,71 @@ Let's walk through enabling this clk from driver code: clk_prepare(clk); clk_enable(clk); -The call graph for clk_enable is very simple: +The call graph for clk_enable is very simple:: -clk_enable(clk); - clk->ops->enable(clk->hw); - [resolves to...] - clk_gate_enable(hw); - [resolves struct clk gate with to_clk_gate(hw)] - clk_gate_set_bit(gate); + clk_enable(clk); + clk->ops->enable(clk->hw); + [resolves to...] + clk_gate_enable(hw); + [resolves struct clk gate with to_clk_gate(hw)] + clk_gate_set_bit(gate); -And the definition of clk_gate_set_bit: +And the definition of clk_gate_set_bit:: -static void clk_gate_set_bit(struct clk_gate *gate) -{ - u32 reg; + static void clk_gate_set_bit(struct clk_gate *gate) + { + u32 reg; - reg = __raw_readl(gate->reg); - reg |= BIT(gate->bit_idx); - writel(reg, gate->reg); -} + reg = __raw_readl(gate->reg); + reg |= BIT(gate->bit_idx); + writel(reg, gate->reg); + } -Note that to_clk_gate is defined as: +Note that to_clk_gate is defined as:: -#define to_clk_gate(_hw) container_of(_hw, struct clk_gate, hw) + #define to_clk_gate(_hw) container_of(_hw, struct clk_gate, hw) This pattern of abstraction is used for every clock hardware representation. - Part 4 - supporting your own clk hardware +Supporting your own clk hardware +================================ When implementing support for a new type of clock it is only necessary to -include the following header: +include the following header:: -#include <linux/clk-provider.h> + #include <linux/clk-provider.h> To construct a clk hardware structure for your platform you must define -the following: +the following:: -struct clk_foo { - struct clk_hw hw; - ... hardware specific data goes here ... -}; + struct clk_foo { + struct clk_hw hw; + ... hardware specific data goes here ... + }; To take advantage of your data you'll need to support valid operations -for your clk: +for your clk:: -struct clk_ops clk_foo_ops { - .enable = &clk_foo_enable; - .disable = &clk_foo_disable; -}; + struct clk_ops clk_foo_ops { + .enable = &clk_foo_enable; + .disable = &clk_foo_disable; + }; -Implement the above functions using container_of: +Implement the above functions using container_of:: -#define to_clk_foo(_hw) container_of(_hw, struct clk_foo, hw) + #define to_clk_foo(_hw) container_of(_hw, struct clk_foo, hw) -int clk_foo_enable(struct clk_hw *hw) -{ - struct clk_foo *foo; + int clk_foo_enable(struct clk_hw *hw) + { + struct clk_foo *foo; - foo = to_clk_foo(hw); + foo = to_clk_foo(hw); - ... perform magic on foo ... + ... perform magic on foo ... - return 0; -}; + return 0; + }; Below is a matrix detailing which clk_ops are mandatory based upon the hardware capabilities of that clock. A cell marked as "y" means @@ -194,41 +201,56 @@ mandatory, a cell marked as "n" implies that either including that callback is invalid or otherwise unnecessary. Empty cells are either optional or must be evaluated on a case-by-case basis. - clock hardware characteristics - ----------------------------------------------------------- - | gate | change rate | single parent | multiplexer | root | - |------|-------------|---------------|-------------|------| -.prepare | | | | | | -.unprepare | | | | | | - | | | | | | -.enable | y | | | | | -.disable | y | | | | | -.is_enabled | y | | | | | - | | | | | | -.recalc_rate | | y | | | | -.round_rate | | y [1] | | | | -.determine_rate | | y [1] | | | | -.set_rate | | y | | | | - | | | | | | -.set_parent | | | n | y | n | -.get_parent | | | n | y | n | - | | | | | | -.recalc_accuracy| | | | | | - | | | | | | -.init | | | | | | - ----------------------------------------------------------- -[1] either one of round_rate or determine_rate is required. +.. table:: clock hardware characteristics + + +----------------+------+-------------+---------------+-------------+------+ + | | gate | change rate | single parent | multiplexer | root | + +================+======+=============+===============+=============+======+ + |.prepare | | | | | | + +----------------+------+-------------+---------------+-------------+------+ + |.unprepare | | | | | | + +----------------+------+-------------+---------------+-------------+------+ + +----------------+------+-------------+---------------+-------------+------+ + |.enable | y | | | | | + +----------------+------+-------------+---------------+-------------+------+ + |.disable | y | | | | | + +----------------+------+-------------+---------------+-------------+------+ + |.is_enabled | y | | | | | + +----------------+------+-------------+---------------+-------------+------+ + +----------------+------+-------------+---------------+-------------+------+ + |.recalc_rate | | y | | | | + +----------------+------+-------------+---------------+-------------+------+ + |.round_rate | | y [1]_ | | | | + +----------------+------+-------------+---------------+-------------+------+ + |.determine_rate | | y [1]_ | | | | + +----------------+------+-------------+---------------+-------------+------+ + |.set_rate | | y | | | | + +----------------+------+-------------+---------------+-------------+------+ + +----------------+------+-------------+---------------+-------------+------+ + |.set_parent | | | n | y | n | + +----------------+------+-------------+---------------+-------------+------+ + |.get_parent | | | n | y | n | + +----------------+------+-------------+---------------+-------------+------+ + +----------------+------+-------------+---------------+-------------+------+ + |.recalc_accuracy| | | | | | + +----------------+------+-------------+---------------+-------------+------+ + +----------------+------+-------------+---------------+-------------+------+ + |.init | | | | | | + +----------------+------+-------------+---------------+-------------+------+ + +.. [1] either one of round_rate or determine_rate is required. Finally, register your clock at run-time with a hardware-specific registration function. This function simply populates struct clk_foo's data and then passes the common struct clk parameters to the framework -with a call to: +with a call to:: -clk_register(...) + clk_register(...) -See the basic clock types in drivers/clk/clk-*.c for examples. +See the basic clock types in ``drivers/clk/clk-*.c`` for examples. - Part 5 - Disabling clock gating of unused clocks +Disabling clock gating of unused clocks +======================================= Sometimes during development it can be useful to be able to bypass the default disabling of unused clocks. For example, if drivers aren't enabling @@ -239,7 +261,8 @@ are sorted out. To bypass this disabling, include "clk_ignore_unused" in the bootargs to the kernel. - Part 6 - Locking +Locking +======= The common clock framework uses two global locks, the prepare lock and the enable lock. diff --git a/Documentation/conf.py b/Documentation/conf.py index bacf9d337c89..71b032bb44fd 100644 --- a/Documentation/conf.py +++ b/Documentation/conf.py @@ -271,8 +271,7 @@ latex_elements = { # Additional stuff for the LaTeX preamble. 'preamble': ''' - % Adjust margins - \\usepackage[margin=0.5in, top=1in, bottom=1in]{geometry} + \\usepackage{ifthen} % Allow generate some pages in landscape \\usepackage{lscape} @@ -281,6 +280,7 @@ latex_elements = { \\definecolor{NoteColor}{RGB}{204,255,255} \\definecolor{WarningColor}{RGB}{255,204,204} \\definecolor{AttentionColor}{RGB}{255,255,204} + \\definecolor{ImportantColor}{RGB}{192,255,204} \\definecolor{OtherColor}{RGB}{204,204,204} \\newlength{\\mynoticelength} \\makeatletter\\newenvironment{coloredbox}[1]{% @@ -301,7 +301,12 @@ latex_elements = { \\ifthenelse% {\\equal{\\py@noticetype}{attention}}% {\\colorbox{AttentionColor}{\\usebox{\\@tempboxa}}}% - {\\colorbox{OtherColor}{\\usebox{\\@tempboxa}}}% + {% + \\ifthenelse% + {\\equal{\\py@noticetype}{important}}% + {\\colorbox{ImportantColor}{\\usebox{\\@tempboxa}}}% + {\\colorbox{OtherColor}{\\usebox{\\@tempboxa}}}% + }% }% }% }\\makeatother @@ -336,30 +341,51 @@ latex_elements = { if major == 1 and minor > 3: latex_elements['preamble'] += '\\renewcommand*{\\DUrole}[2]{ #2 }\n' +if major == 1 and minor <= 4: + latex_elements['preamble'] += '\\usepackage[margin=0.5in, top=1in, bottom=1in]{geometry}' +elif major == 1 and (minor > 5 or (minor == 5 and patch >= 3)): + latex_elements['sphinxsetup'] = 'hmargin=0.5in, vmargin=0.5in' + + # Grouping the document tree into LaTeX files. List of tuples # (source start file, target name, title, # author, documentclass [howto, manual, or own class]). +# Sorted in alphabetical order latex_documents = [ - ('doc-guide/index', 'kernel-doc-guide.tex', 'Linux Kernel Documentation Guide', - 'The kernel development community', 'manual'), ('admin-guide/index', 'linux-user.tex', 'Linux Kernel User Documentation', 'The kernel development community', 'manual'), ('core-api/index', 'core-api.tex', 'The kernel core API manual', 'The kernel development community', 'manual'), - ('driver-api/index', 'driver-api.tex', 'The kernel driver API manual', + ('crypto/index', 'crypto-api.tex', 'Linux Kernel Crypto API manual', 'The kernel development community', 'manual'), - ('input/index', 'linux-input.tex', 'The Linux input driver subsystem', + ('dev-tools/index', 'dev-tools.tex', 'Development tools for the Kernel', 'The kernel development community', 'manual'), - ('kernel-documentation', 'kernel-documentation.tex', 'The Linux Kernel Documentation', + ('doc-guide/index', 'kernel-doc-guide.tex', 'Linux Kernel Documentation Guide', 'The kernel development community', 'manual'), - ('process/index', 'development-process.tex', 'Linux Kernel Development Documentation', + ('driver-api/index', 'driver-api.tex', 'The kernel driver API manual', + 'The kernel development community', 'manual'), + ('filesystems/index', 'filesystems.tex', 'Linux Filesystems API', 'The kernel development community', 'manual'), ('gpu/index', 'gpu.tex', 'Linux GPU Driver Developer\'s Guide', 'The kernel development community', 'manual'), + ('input/index', 'linux-input.tex', 'The Linux input driver subsystem', + 'The kernel development community', 'manual'), + ('kernel-hacking/index', 'kernel-hacking.tex', 'Unreliable Guide To Hacking The Linux Kernel', + 'The kernel development community', 'manual'), ('media/index', 'media.tex', 'Linux Media Subsystem Documentation', 'The kernel development community', 'manual'), + ('networking/index', 'networking.tex', 'Linux Networking Documentation', + 'The kernel development community', 'manual'), + ('process/index', 'development-process.tex', 'Linux Kernel Development Documentation', + 'The kernel development community', 'manual'), ('security/index', 'security.tex', 'The kernel security subsystem manual', 'The kernel development community', 'manual'), + ('sh/index', 'sh.tex', 'SuperH architecture implementation manual', + 'The kernel development community', 'manual'), + ('sound/index', 'sound.tex', 'Linux Sound Subsystem Documentation', + 'The kernel development community', 'manual'), + ('userspace-api/index', 'userspace-api.tex', 'The Linux kernel user-space API guide', + 'The kernel development community', 'manual'), ] # The name of an image file (relative to this directory) to place at the top of diff --git a/Documentation/core-api/assoc_array.rst b/Documentation/core-api/assoc_array.rst index d83cfff9ea43..8231b915c939 100644 --- a/Documentation/core-api/assoc_array.rst +++ b/Documentation/core-api/assoc_array.rst @@ -10,7 +10,10 @@ properties: 1. Objects are opaque pointers. The implementation does not care where they point (if anywhere) or what they point to (if anything). -.. note:: Pointers to objects _must_ be zero in the least significant bit. + + .. note:: + + Pointers to objects _must_ be zero in the least significant bit. 2. Objects do not need to contain linkage blocks for use by the array. This permits an object to be located in multiple arrays simultaneously. diff --git a/Documentation/core-api/atomic_ops.rst b/Documentation/core-api/atomic_ops.rst index 55e43f1c80de..fce929144ccd 100644 --- a/Documentation/core-api/atomic_ops.rst +++ b/Documentation/core-api/atomic_ops.rst @@ -303,6 +303,11 @@ defined which accomplish this:: void smp_mb__before_atomic(void); void smp_mb__after_atomic(void); +Preceding a non-value-returning read-modify-write atomic operation with +smp_mb__before_atomic() and following it with smp_mb__after_atomic() +provides the same full ordering that is provided by value-returning +read-modify-write atomic operations. + For example, smp_mb__before_atomic() can be used like so:: obj->dead = 1; diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst index 62abd36bfffb..0606be3a3111 100644 --- a/Documentation/core-api/index.rst +++ b/Documentation/core-api/index.rst @@ -19,6 +19,7 @@ Core utilities workqueue genericirq flexible-arrays + librs Interfaces for kernel debugging =============================== diff --git a/Documentation/core-api/kernel-api.rst b/Documentation/core-api/kernel-api.rst index 9ec8488319dc..17b00914c6ab 100644 --- a/Documentation/core-api/kernel-api.rst +++ b/Documentation/core-api/kernel-api.rst @@ -114,7 +114,7 @@ The Slab Cache User Space Memory Access ------------------------ -.. kernel-doc:: arch/x86/include/asm/uaccess_32.h +.. kernel-doc:: arch/x86/include/asm/uaccess.h :internal: .. kernel-doc:: arch/x86/lib/usercopy_32.c diff --git a/Documentation/core-api/librs.rst b/Documentation/core-api/librs.rst new file mode 100644 index 000000000000..6010f5bc5bf9 --- /dev/null +++ b/Documentation/core-api/librs.rst @@ -0,0 +1,212 @@ +========================================== +Reed-Solomon Library Programming Interface +========================================== + +:Author: Thomas Gleixner + +Introduction +============ + +The generic Reed-Solomon Library provides encoding, decoding and error +correction functions. + +Reed-Solomon codes are used in communication and storage applications to +ensure data integrity. + +This documentation is provided for developers who want to utilize the +functions provided by the library. + +Known Bugs And Assumptions +========================== + +None. + +Usage +===== + +This chapter provides examples of how to use the library. + +Initializing +------------ + +The init function init_rs returns a pointer to an rs decoder structure, +which holds the necessary information for encoding, decoding and error +correction with the given polynomial. It either uses an existing +matching decoder or creates a new one. On creation all the lookup tables +for fast en/decoding are created. The function may take a while, so make +sure not to call it in critical code paths. + +:: + + /* the Reed Solomon control structure */ + static struct rs_control *rs_decoder; + + /* Symbolsize is 10 (bits) + * Primitive polynomial is x^10+x^3+1 + * first consecutive root is 0 + * primitive element to generate roots = 1 + * generator polynomial degree (number of roots) = 6 + */ + rs_decoder = init_rs (10, 0x409, 0, 1, 6); + + +Encoding +-------- + +The encoder calculates the Reed-Solomon code over the given data length +and stores the result in the parity buffer. Note that the parity buffer +must be initialized before calling the encoder. + +The expanded data can be inverted on the fly by providing a non-zero +inversion mask. The expanded data is XOR'ed with the mask. This is used +e.g. for FLASH ECC, where the all 0xFF is inverted to an all 0x00. The +Reed-Solomon code for all 0x00 is all 0x00. The code is inverted before +storing to FLASH so it is 0xFF too. This prevents that reading from an +erased FLASH results in ECC errors. + +The databytes are expanded to the given symbol size on the fly. There is +no support for encoding continuous bitstreams with a symbol size != 8 at +the moment. If it is necessary it should be not a big deal to implement +such functionality. + +:: + + /* Parity buffer. Size = number of roots */ + uint16_t par[6]; + /* Initialize the parity buffer */ + memset(par, 0, sizeof(par)); + /* Encode 512 byte in data8. Store parity in buffer par */ + encode_rs8 (rs_decoder, data8, 512, par, 0); + + +Decoding +-------- + +The decoder calculates the syndrome over the given data length and the +received parity symbols and corrects errors in the data. + +If a syndrome is available from a hardware decoder then the syndrome +calculation is skipped. + +The correction of the data buffer can be suppressed by providing a +correction pattern buffer and an error location buffer to the decoder. +The decoder stores the calculated error location and the correction +bitmask in the given buffers. This is useful for hardware decoders which +use a weird bit ordering scheme. + +The databytes are expanded to the given symbol size on the fly. There is +no support for decoding continuous bitstreams with a symbolsize != 8 at +the moment. If it is necessary it should be not a big deal to implement +such functionality. + +Decoding with syndrome calculation, direct data correction +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + /* Parity buffer. Size = number of roots */ + uint16_t par[6]; + uint8_t data[512]; + int numerr; + /* Receive data */ + ..... + /* Receive parity */ + ..... + /* Decode 512 byte in data8.*/ + numerr = decode_rs8 (rs_decoder, data8, par, 512, NULL, 0, NULL, 0, NULL); + + +Decoding with syndrome given by hardware decoder, direct data correction +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + /* Parity buffer. Size = number of roots */ + uint16_t par[6], syn[6]; + uint8_t data[512]; + int numerr; + /* Receive data */ + ..... + /* Receive parity */ + ..... + /* Get syndrome from hardware decoder */ + ..... + /* Decode 512 byte in data8.*/ + numerr = decode_rs8 (rs_decoder, data8, par, 512, syn, 0, NULL, 0, NULL); + + +Decoding with syndrome given by hardware decoder, no direct data correction. +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Note: It's not necessary to give data and received parity to the +decoder. + +:: + + /* Parity buffer. Size = number of roots */ + uint16_t par[6], syn[6], corr[8]; + uint8_t data[512]; + int numerr, errpos[8]; + /* Receive data */ + ..... + /* Receive parity */ + ..... + /* Get syndrome from hardware decoder */ + ..... + /* Decode 512 byte in data8.*/ + numerr = decode_rs8 (rs_decoder, NULL, NULL, 512, syn, 0, errpos, 0, corr); + for (i = 0; i < numerr; i++) { + do_error_correction_in_your_buffer(errpos[i], corr[i]); + } + + +Cleanup +------- + +The function free_rs frees the allocated resources, if the caller is +the last user of the decoder. + +:: + + /* Release resources */ + free_rs(rs_decoder); + + +Structures +========== + +This chapter contains the autogenerated documentation of the structures +which are used in the Reed-Solomon Library and are relevant for a +developer. + +.. kernel-doc:: include/linux/rslib.h + :internal: + +Public Functions Provided +========================= + +This chapter contains the autogenerated documentation of the +Reed-Solomon functions which are exported. + +.. kernel-doc:: lib/reed_solomon/reed_solomon.c + :export: + +Credits +======= + +The library code for encoding and decoding was written by Phil Karn. + +:: + + Copyright 2002, Phil Karn, KA9Q + May be used under the terms of the GNU General Public License (GPL) + + +The wrapper functions and interfaces are written by Thomas Gleixner. + +Many users have provided bugfixes, improvements and helping hands for +testing. Thanks a lot. + +The following people have contributed to this document: + +Thomas Gleixner\ tglx@linutronix.de diff --git a/Documentation/cpu-load.txt b/Documentation/cpu-load.txt index 287224e57cfc..2d01ce43d2a2 100644 --- a/Documentation/cpu-load.txt +++ b/Documentation/cpu-load.txt @@ -1,9 +1,10 @@ +======== CPU load --------- +======== -Linux exports various bits of information via `/proc/stat' and -`/proc/uptime' that userland tools, such as top(1), use to calculate -the average time system spent in a particular state, for example: +Linux exports various bits of information via ``/proc/stat`` and +``/proc/uptime`` that userland tools, such as top(1), use to calculate +the average time system spent in a particular state, for example:: $ iostat Linux 2.6.18.3-exp (linmac) 02/20/2007 @@ -17,7 +18,7 @@ Here the system thinks that over the default sampling period the system spent 10.01% of the time doing work in user space, 2.92% in the kernel, and was overall 81.63% of the time idle. -In most cases the `/proc/stat' information reflects the reality quite +In most cases the ``/proc/stat`` information reflects the reality quite closely, however due to the nature of how/when the kernel collects this data sometimes it can not be trusted at all. @@ -33,78 +34,78 @@ Example ------- If we imagine the system with one task that periodically burns cycles -in the following manner: +in the following manner:: - time line between two timer interrupts -|--------------------------------------| - ^ ^ - |_ something begins working | - |_ something goes to sleep - (only to be awaken quite soon) + time line between two timer interrupts + |--------------------------------------| + ^ ^ + |_ something begins working | + |_ something goes to sleep + (only to be awaken quite soon) In the above situation the system will be 0% loaded according to the -`/proc/stat' (since the timer interrupt will always happen when the +``/proc/stat`` (since the timer interrupt will always happen when the system is executing the idle handler), but in reality the load is closer to 99%. One can imagine many more situations where this behavior of the kernel -will lead to quite erratic information inside `/proc/stat'. - - -/* gcc -o hog smallhog.c */ -#include <time.h> -#include <limits.h> -#include <signal.h> -#include <sys/time.h> -#define HIST 10 - -static volatile sig_atomic_t stop; - -static void sighandler (int signr) -{ - (void) signr; - stop = 1; -} -static unsigned long hog (unsigned long niters) -{ - stop = 0; - while (!stop && --niters); - return niters; -} -int main (void) -{ - int i; - struct itimerval it = { .it_interval = { .tv_sec = 0, .tv_usec = 1 }, - .it_value = { .tv_sec = 0, .tv_usec = 1 } }; - sigset_t set; - unsigned long v[HIST]; - double tmp = 0.0; - unsigned long n; - signal (SIGALRM, &sighandler); - setitimer (ITIMER_REAL, &it, NULL); - - hog (ULONG_MAX); - for (i = 0; i < HIST; ++i) v[i] = ULONG_MAX - hog (ULONG_MAX); - for (i = 0; i < HIST; ++i) tmp += v[i]; - tmp /= HIST; - n = tmp - (tmp / 3.0); - - sigemptyset (&set); - sigaddset (&set, SIGALRM); - - for (;;) { - hog (n); - sigwait (&set, &i); - } - return 0; -} +will lead to quite erratic information inside ``/proc/stat``:: + + + /* gcc -o hog smallhog.c */ + #include <time.h> + #include <limits.h> + #include <signal.h> + #include <sys/time.h> + #define HIST 10 + + static volatile sig_atomic_t stop; + + static void sighandler (int signr) + { + (void) signr; + stop = 1; + } + static unsigned long hog (unsigned long niters) + { + stop = 0; + while (!stop && --niters); + return niters; + } + int main (void) + { + int i; + struct itimerval it = { .it_interval = { .tv_sec = 0, .tv_usec = 1 }, + .it_value = { .tv_sec = 0, .tv_usec = 1 } }; + sigset_t set; + unsigned long v[HIST]; + double tmp = 0.0; + unsigned long n; + signal (SIGALRM, &sighandler); + setitimer (ITIMER_REAL, &it, NULL); + + hog (ULONG_MAX); + for (i = 0; i < HIST; ++i) v[i] = ULONG_MAX - hog (ULONG_MAX); + for (i = 0; i < HIST; ++i) tmp += v[i]; + tmp /= HIST; + n = tmp - (tmp / 3.0); + + sigemptyset (&set); + sigaddset (&set, SIGALRM); + + for (;;) { + hog (n); + sigwait (&set, &i); + } + return 0; + } References ---------- -http://lkml.org/lkml/2007/2/12/6 -Documentation/filesystems/proc.txt (1.8) +- http://lkml.org/lkml/2007/2/12/6 +- Documentation/filesystems/proc.txt (1.8) Thanks diff --git a/Documentation/cputopology.txt b/Documentation/cputopology.txt index 127c9d8c2174..c6e7e9196a8b 100644 --- a/Documentation/cputopology.txt +++ b/Documentation/cputopology.txt @@ -1,3 +1,6 @@ +=========================================== +How CPU topology info is exported via sysfs +=========================================== Export CPU topology info via sysfs. Items (attributes) are similar to /proc/cpuinfo output of some architectures: @@ -75,24 +78,26 @@ CONFIG_SCHED_BOOK and CONFIG_DRAWER are currently only used on s390, where they reflect the cpu and cache hierarchy. For an architecture to support this feature, it must define some of -these macros in include/asm-XXX/topology.h: -#define topology_physical_package_id(cpu) -#define topology_core_id(cpu) -#define topology_book_id(cpu) -#define topology_drawer_id(cpu) -#define topology_sibling_cpumask(cpu) -#define topology_core_cpumask(cpu) -#define topology_book_cpumask(cpu) -#define topology_drawer_cpumask(cpu) - -The type of **_id macros is int. -The type of **_cpumask macros is (const) struct cpumask *. The latter -correspond with appropriate **_siblings sysfs attributes (except for +these macros in include/asm-XXX/topology.h:: + + #define topology_physical_package_id(cpu) + #define topology_core_id(cpu) + #define topology_book_id(cpu) + #define topology_drawer_id(cpu) + #define topology_sibling_cpumask(cpu) + #define topology_core_cpumask(cpu) + #define topology_book_cpumask(cpu) + #define topology_drawer_cpumask(cpu) + +The type of ``**_id macros`` is int. +The type of ``**_cpumask macros`` is ``(const) struct cpumask *``. The latter +correspond with appropriate ``**_siblings`` sysfs attributes (except for topology_sibling_cpumask() which corresponds with thread_siblings). To be consistent on all architectures, include/linux/topology.h provides default definitions for any of the above macros that are not defined by include/asm-XXX/topology.h: + 1) physical_package_id: -1 2) core_id: 0 3) sibling_cpumask: just the given CPU @@ -107,6 +112,7 @@ Additionally, CPU topology information is provided under /sys/devices/system/cpu and includes these files. The internal source for the output is in brackets ("[]"). + =========== ========================================================== kernel_max: the maximum CPU index allowed by the kernel configuration. [NR_CPUS-1] @@ -122,6 +128,7 @@ source for the output is in brackets ("[]"). present: CPUs that have been identified as being present in the system. [cpu_present_mask] + =========== ========================================================== The format for the above output is compatible with cpulist_parse() [see <linux/cpumask.h>]. Some examples follow. @@ -129,7 +136,7 @@ The format for the above output is compatible with cpulist_parse() In this example, there are 64 CPUs in the system but cpus 32-63 exceed the kernel max which is limited to 0..31 by the NR_CPUS config option being 32. Note also that CPUs 2 and 4-31 are not online but could be -brought online as they are both present and possible. +brought online as they are both present and possible:: kernel_max: 31 offline: 2,4-31,32-63 @@ -140,7 +147,7 @@ brought online as they are both present and possible. In this example, the NR_CPUS config option is 128, but the kernel was started with possible_cpus=144. There are 4 CPUs in the system and cpu2 was manually taken offline (and is the only CPU that can be brought -online.) +online.):: kernel_max: 127 offline: 2,4-127,128-143 diff --git a/Documentation/crc32.txt b/Documentation/crc32.txt index a08a7dd9d625..8a6860f33b4e 100644 --- a/Documentation/crc32.txt +++ b/Documentation/crc32.txt @@ -1,4 +1,6 @@ -A brief CRC tutorial. +================================= +brief tutorial on CRC computation +================================= A CRC is a long-division remainder. You add the CRC to the message, and the whole thing (message+CRC) is a multiple of the given @@ -8,7 +10,8 @@ remainder computed on the message+CRC is 0. This latter approach is used by a lot of hardware implementations, and is why so many protocols put the end-of-frame flag after the CRC. -It's actually the same long division you learned in school, except that +It's actually the same long division you learned in school, except that: + - We're working in binary, so the digits are only 0 and 1, and - When dividing polynomials, there are no carries. Rather than add and subtract, we just xor. Thus, we tend to get a bit sloppy about @@ -40,11 +43,12 @@ throw the quotient bit away, but subtract the appropriate multiple of the polynomial from the remainder and we're back to where we started, ready to process the next bit. -A big-endian CRC written this way would be coded like: -for (i = 0; i < input_bits; i++) { - multiple = remainder & 0x80000000 ? CRCPOLY : 0; - remainder = (remainder << 1 | next_input_bit()) ^ multiple; -} +A big-endian CRC written this way would be coded like:: + + for (i = 0; i < input_bits; i++) { + multiple = remainder & 0x80000000 ? CRCPOLY : 0; + remainder = (remainder << 1 | next_input_bit()) ^ multiple; + } Notice how, to get at bit 32 of the shifted remainder, we look at bit 31 of the remainder *before* shifting it. @@ -54,25 +58,26 @@ the remainder don't actually affect any decision-making until 32 bits later. Thus, the first 32 cycles of this are pretty boring. Also, to add the CRC to a message, we need a 32-bit-long hole for it at the end, so we have to add 32 extra cycles shifting in zeros at the -end of every message, +end of every message. These details lead to a standard trick: rearrange merging in the next_input_bit() until the moment it's needed. Then the first 32 cycles can be precomputed, and merging in the final 32 zero bits to make room -for the CRC can be skipped entirely. This changes the code to: +for the CRC can be skipped entirely. This changes the code to:: -for (i = 0; i < input_bits; i++) { - remainder ^= next_input_bit() << 31; - multiple = (remainder & 0x80000000) ? CRCPOLY : 0; - remainder = (remainder << 1) ^ multiple; -} + for (i = 0; i < input_bits; i++) { + remainder ^= next_input_bit() << 31; + multiple = (remainder & 0x80000000) ? CRCPOLY : 0; + remainder = (remainder << 1) ^ multiple; + } -With this optimization, the little-endian code is particularly simple: -for (i = 0; i < input_bits; i++) { - remainder ^= next_input_bit(); - multiple = (remainder & 1) ? CRCPOLY : 0; - remainder = (remainder >> 1) ^ multiple; -} +With this optimization, the little-endian code is particularly simple:: + + for (i = 0; i < input_bits; i++) { + remainder ^= next_input_bit(); + multiple = (remainder & 1) ? CRCPOLY : 0; + remainder = (remainder >> 1) ^ multiple; + } The most significant coefficient of the remainder polynomial is stored in the least significant bit of the binary "remainder" variable. @@ -81,23 +86,25 @@ be bit-reversed) and next_input_bit(). As long as next_input_bit is returning the bits in a sensible order, we don't *have* to wait until the last possible moment to merge in additional bits. -We can do it 8 bits at a time rather than 1 bit at a time: -for (i = 0; i < input_bytes; i++) { - remainder ^= next_input_byte() << 24; - for (j = 0; j < 8; j++) { - multiple = (remainder & 0x80000000) ? CRCPOLY : 0; - remainder = (remainder << 1) ^ multiple; +We can do it 8 bits at a time rather than 1 bit at a time:: + + for (i = 0; i < input_bytes; i++) { + remainder ^= next_input_byte() << 24; + for (j = 0; j < 8; j++) { + multiple = (remainder & 0x80000000) ? CRCPOLY : 0; + remainder = (remainder << 1) ^ multiple; + } } -} -Or in little-endian: -for (i = 0; i < input_bytes; i++) { - remainder ^= next_input_byte(); - for (j = 0; j < 8; j++) { - multiple = (remainder & 1) ? CRCPOLY : 0; - remainder = (remainder >> 1) ^ multiple; +Or in little-endian:: + + for (i = 0; i < input_bytes; i++) { + remainder ^= next_input_byte(); + for (j = 0; j < 8; j++) { + multiple = (remainder & 1) ? CRCPOLY : 0; + remainder = (remainder >> 1) ^ multiple; + } } -} If the input is a multiple of 32 bits, you can even XOR in a 32-bit word at a time and increase the inner loop count to 32. diff --git a/Documentation/crypto/api-samples.rst b/Documentation/crypto/api-samples.rst index d021fd96a76d..2531948db89f 100644 --- a/Documentation/crypto/api-samples.rst +++ b/Documentation/crypto/api-samples.rst @@ -155,9 +155,9 @@ Code Example For Use of Operational State Memory With SHASH char ctx[]; }; - static struct sdesc init_sdesc(struct crypto_shash *alg) + static struct sdesc *init_sdesc(struct crypto_shash *alg) { - struct sdesc sdesc; + struct sdesc *sdesc; int size; size = sizeof(struct shash_desc) + crypto_shash_descsize(alg); @@ -169,15 +169,16 @@ Code Example For Use of Operational State Memory With SHASH return sdesc; } - static int calc_hash(struct crypto_shashalg, - const unsigned chardata, unsigned int datalen, - unsigned chardigest) { - struct sdesc sdesc; + static int calc_hash(struct crypto_shash *alg, + const unsigned char *data, unsigned int datalen, + unsigned char *digest) + { + struct sdesc *sdesc; int ret; sdesc = init_sdesc(alg); if (IS_ERR(sdesc)) { - pr_info("trusted_key: can't alloc %s\n", hash_alg); + pr_info("can't alloc sdesc\n"); return PTR_ERR(sdesc); } @@ -186,6 +187,23 @@ Code Example For Use of Operational State Memory With SHASH return ret; } + static int test_hash(const unsigned char *data, unsigned int datalen, + unsigned char *digest) + { + struct crypto_shash *alg; + char *hash_alg_name = "sha1-padlock-nano"; + int ret; + + alg = crypto_alloc_shash(hash_alg_name, CRYPTO_ALG_TYPE_SHASH, 0); + if (IS_ERR(alg)) { + pr_info("can't alloc alg %s\n", hash_alg_name); + return PTR_ERR(alg); + } + ret = calc_hash(alg, data, datalen, digest); + crypto_free_shash(alg); + return ret; + } + Code Example For Random Number Generator Usage ---------------------------------------------- @@ -195,8 +213,8 @@ Code Example For Random Number Generator Usage static int get_random_numbers(u8 *buf, unsigned int len) { - struct crypto_rngrng = NULL; - chardrbg = "drbg_nopr_sha256"; /* Hash DRBG with SHA-256, no PR */ + struct crypto_rng *rng = NULL; + char *drbg = "drbg_nopr_sha256"; /* Hash DRBG with SHA-256, no PR */ int ret; if (!buf || !len) { @@ -207,7 +225,7 @@ Code Example For Random Number Generator Usage rng = crypto_alloc_rng(drbg, 0, 0); if (IS_ERR(rng)) { pr_debug("could not allocate RNG handle for %s\n", drbg); - return -PTR_ERR(rng); + return PTR_ERR(rng); } ret = crypto_rng_get_bytes(rng, buf, len); diff --git a/Documentation/crypto/asymmetric-keys.txt b/Documentation/crypto/asymmetric-keys.txt index 5ad6480e3fb9..5969bf42562a 100644 --- a/Documentation/crypto/asymmetric-keys.txt +++ b/Documentation/crypto/asymmetric-keys.txt @@ -10,6 +10,7 @@ Contents: - Signature verification. - Asymmetric key subtypes. - Instantiation data parsers. + - Keyring link restrictions. ======== @@ -265,7 +266,7 @@ mandatory: The caller passes a pointer to the following struct with all of the fields cleared, except for data, datalen and quotalen [see - Documentation/security/keys.txt]. + Documentation/security/keys/core.rst]. struct key_preparsed_payload { char *description; @@ -318,7 +319,8 @@ KEYRING LINK RESTRICTIONS ========================= Keyrings created from userspace using add_key can be configured to check the -signature of the key being linked. +signature of the key being linked. Keys without a valid signature are not +allowed to link. Several restriction methods are available: @@ -327,9 +329,10 @@ Several restriction methods are available: - Option string used with KEYCTL_RESTRICT_KEYRING: - "builtin_trusted" - The kernel builtin trusted keyring will be searched for the signing - key. The ca_keys kernel parameter also affects which keys are used for - signature verification. + The kernel builtin trusted keyring will be searched for the signing key. + If the builtin trusted keyring is not configured, all links will be + rejected. The ca_keys kernel parameter also affects which keys are used + for signature verification. (2) Restrict using the kernel builtin and secondary trusted keyrings @@ -337,8 +340,10 @@ Several restriction methods are available: - "builtin_and_secondary_trusted" The kernel builtin and secondary trusted keyrings will be searched for the - signing key. The ca_keys kernel parameter also affects which keys are used - for signature verification. + signing key. If the secondary trusted keyring is not configured, this + restriction will behave like the "builtin_trusted" option. The ca_keys + kernel parameter also affects which keys are used for signature + verification. (3) Restrict using a separate key or keyring @@ -346,7 +351,7 @@ Several restriction methods are available: - "key_or_keyring:<key or keyring serial number>[:chain]" Whenever a key link is requested, the link will only succeed if the key - being linked is signed by one of the designated keys. This key may be + being linked is signed by one of the designated keys. This key may be specified directly by providing a serial number for one asymmetric key, or a group of keys may be searched for the signing key by providing the serial number for a keyring. @@ -354,7 +359,51 @@ Several restriction methods are available: When the "chain" option is provided at the end of the string, the keys within the destination keyring will also be searched for signing keys. This allows for verification of certificate chains by adding each - cert in order (starting closest to the root) to one keyring. + certificate in order (starting closest to the root) to a keyring. For + instance, one keyring can be populated with links to a set of root + certificates, with a separate, restricted keyring set up for each + certificate chain to be validated: + + # Create and populate a keyring for root certificates + root_id=`keyctl add keyring root-certs "" @s` + keyctl padd asymmetric "" $root_id < root1.cert + keyctl padd asymmetric "" $root_id < root2.cert + + # Create and restrict a keyring for the certificate chain + chain_id=`keyctl add keyring chain "" @s` + keyctl restrict_keyring $chain_id asymmetric key_or_keyring:$root_id:chain + + # Attempt to add each certificate in the chain, starting with the + # certificate closest to the root. + keyctl padd asymmetric "" $chain_id < intermediateA.cert + keyctl padd asymmetric "" $chain_id < intermediateB.cert + keyctl padd asymmetric "" $chain_id < end-entity.cert + + If the final end-entity certificate is successfully added to the "chain" + keyring, we can be certain that it has a valid signing chain going back to + one of the root certificates. + + A single keyring can be used to verify a chain of signatures by + restricting the keyring after linking the root certificate: + + # Create a keyring for the certificate chain and add the root + chain2_id=`keyctl add keyring chain2 "" @s` + keyctl padd asymmetric "" $chain2_id < root1.cert + + # Restrict the keyring that already has root1.cert linked. The cert + # will remain linked by the keyring. + keyctl restrict_keyring $chain2_id asymmetric key_or_keyring:0:chain + + # Attempt to add each certificate in the chain, starting with the + # certificate closest to the root. + keyctl padd asymmetric "" $chain2_id < intermediateA.cert + keyctl padd asymmetric "" $chain2_id < intermediateB.cert + keyctl padd asymmetric "" $chain2_id < end-entity.cert + + If the final end-entity certificate is successfully added to the "chain2" + keyring, we can be certain that there is a valid signing chain going back + to the root certificate that was added before the keyring was restricted. + In all of these cases, if the signing key is found the signature of the key to be linked will be verified using the signing key. The requested key is added diff --git a/Documentation/crypto/conf.py b/Documentation/crypto/conf.py new file mode 100644 index 000000000000..4335d251ddf3 --- /dev/null +++ b/Documentation/crypto/conf.py @@ -0,0 +1,10 @@ +# -*- coding: utf-8; mode: python -*- + +project = 'Linux Kernel Crypto API' + +tags.add("subproject") + +latex_documents = [ + ('index', 'crypto-api.tex', 'Linux Kernel Crypto API manual', + 'The kernel development community', 'manual'), +] diff --git a/Documentation/crypto/userspace-if.rst b/Documentation/crypto/userspace-if.rst index de5a72e32bc9..ff86befa61e0 100644 --- a/Documentation/crypto/userspace-if.rst +++ b/Documentation/crypto/userspace-if.rst @@ -327,7 +327,7 @@ boundary. Non-aligned data can be used as well, but may require more operations of the kernel which would defeat the speed gains obtained from the zero-copy interface. -The system-interent limit for the size of one zero-copy operation is 16 +The system-inherent limit for the size of one zero-copy operation is 16 pages. If more data is to be sent to AF_ALG, user space must slice the input into segments with a maximum size of 16 pages. diff --git a/Documentation/dcdbas.txt b/Documentation/dcdbas.txt index e1c52e2dc361..309cc57a7c1c 100644 --- a/Documentation/dcdbas.txt +++ b/Documentation/dcdbas.txt @@ -1,4 +1,9 @@ +=================================== +Dell Systems Management Base Driver +=================================== + Overview +======== The Dell Systems Management Base Driver provides a sysfs interface for systems management software such as Dell OpenManage to perform system @@ -17,6 +22,7 @@ more information about the libsmbios project. System Management Interrupt +=========================== On some Dell systems, systems management software must access certain management information via a system management interrupt (SMI). The SMI data @@ -24,12 +30,12 @@ buffer must reside in 32-bit address space, and the physical address of the buffer is required for the SMI. The driver maintains the memory required for the SMI and provides a way for the application to generate the SMI. The driver creates the following sysfs entries for systems management -software to perform these system management interrupts: +software to perform these system management interrupts:: -/sys/devices/platform/dcdbas/smi_data -/sys/devices/platform/dcdbas/smi_data_buf_phys_addr -/sys/devices/platform/dcdbas/smi_data_buf_size -/sys/devices/platform/dcdbas/smi_request + /sys/devices/platform/dcdbas/smi_data + /sys/devices/platform/dcdbas/smi_data_buf_phys_addr + /sys/devices/platform/dcdbas/smi_data_buf_size + /sys/devices/platform/dcdbas/smi_request Systems management software must perform the following steps to execute a SMI using this driver: @@ -43,6 +49,7 @@ a SMI using this driver: Host Control Action +=================== Dell OpenManage supports a host control feature that allows the administrator to perform a power cycle or power off of the system after the OS has finished @@ -69,12 +76,14 @@ power off host control action using this driver: Host Control SMI Type +===================== The following table shows the value to write to host_control_smi_type to perform a power cycle or power off host control action: +=================== ===================== PowerEdge System Host Control SMI Type ----------------- --------------------- +=================== ===================== 300 HC_SMITYPE_TYPE1 1300 HC_SMITYPE_TYPE1 1400 HC_SMITYPE_TYPE2 @@ -87,5 +96,4 @@ PowerEdge System Host Control SMI Type 1655MC HC_SMITYPE_TYPE2 700 HC_SMITYPE_TYPE3 750 HC_SMITYPE_TYPE3 - - +=================== ===================== diff --git a/Documentation/debugging-via-ohci1394.txt b/Documentation/debugging-via-ohci1394.txt index 9ff026d22b75..981ad4f89fd3 100644 --- a/Documentation/debugging-via-ohci1394.txt +++ b/Documentation/debugging-via-ohci1394.txt @@ -1,6 +1,6 @@ - - Using physical DMA provided by OHCI-1394 FireWire controllers for debugging - --------------------------------------------------------------------------- +=========================================================================== +Using physical DMA provided by OHCI-1394 FireWire controllers for debugging +=========================================================================== Introduction ------------ @@ -91,10 +91,10 @@ Step-by-step instructions for using firescope with early OHCI initialization: 1) Verify that your hardware is supported: Load the firewire-ohci module and check your kernel logs. - You should see a line similar to + You should see a line similar to:: - firewire_ohci 0000:15:00.1: added OHCI v1.0 device as card 2, 4 IR + 4 IT - ... contexts, quirks 0x11 + firewire_ohci 0000:15:00.1: added OHCI v1.0 device as card 2, 4 IR + 4 IT + ... contexts, quirks 0x11 when loading the driver. If you have no supported controller, many PCI, CardBus and even some Express cards which are fully compliant to OHCI-1394 @@ -113,9 +113,9 @@ Step-by-step instructions for using firescope with early OHCI initialization: stable connection and has matching connectors (there are small 4-pin and large 6-pin FireWire ports) will do. - If an driver is running on both machines you should see a line like + If an driver is running on both machines you should see a line like:: - firewire_core 0000:15:00.1: created device fw1: GUID 00061b0020105917, S400 + firewire_core 0000:15:00.1: created device fw1: GUID 00061b0020105917, S400 on both machines in the kernel log when the cable is plugged in and connects the two machines. @@ -123,7 +123,7 @@ Step-by-step instructions for using firescope with early OHCI initialization: 3) Test physical DMA using firescope: On the debug host, make sure that /dev/fw* is accessible, - then start firescope: + then start firescope:: $ firescope Port 0 (/dev/fw1) opened, 2 nodes detected @@ -163,7 +163,7 @@ Step-by-step instructions for using firescope with early OHCI initialization: host loaded, reboot the debugged machine, booting the kernel which has CONFIG_PROVIDE_OHCI1394_DMA_INIT enabled, with the option ohci1394_dma=early. - Then, on the debugging host, run firescope, for example by using -A: + Then, on the debugging host, run firescope, for example by using -A:: firescope -A System.map-of-debug-target-kernel @@ -178,6 +178,7 @@ Step-by-step instructions for using firescope with early OHCI initialization: Notes ----- + Documentation and specifications: http://halobates.de/firewire/ FireWire is a trademark of Apple Inc. - for more information please refer to: diff --git a/Documentation/dell_rbu.txt b/Documentation/dell_rbu.txt index d262e22bddec..0fdb6aa2704c 100644 --- a/Documentation/dell_rbu.txt +++ b/Documentation/dell_rbu.txt @@ -1,18 +1,30 @@ -Purpose: -Demonstrate the usage of the new open sourced rbu (Remote BIOS Update) driver +============================================================= +Usage of the new open sourced rbu (Remote BIOS Update) driver +============================================================= + +Purpose +======= + +Document demonstrating the use of the Dell Remote BIOS Update driver. for updating BIOS images on Dell servers and desktops. -Scope: +Scope +===== + This document discusses the functionality of the rbu driver only. It does not cover the support needed from applications to enable the BIOS to update itself with the image downloaded in to the memory. -Overview: +Overview +======== + This driver works with Dell OpenManage or Dell Update Packages for updating the BIOS on Dell servers (starting from servers sold since 1999), desktops and notebooks (starting from those sold in 2005). + Please go to http://support.dell.com register and you can find info on OpenManage and Dell Update packages (DUP). + Libsmbios can also be used to update BIOS on Dell systems go to http://linux.dell.com/libsmbios/ for details. @@ -22,6 +34,7 @@ of physical pages having the BIOS image. In case of packetized the app using the driver breaks the image in to packets of fixed sizes and the driver would place each packet in contiguous physical memory. The driver also maintains a link list of packets for reading them back. + If the dell_rbu driver is unloaded all the allocated memory is freed. The rbu driver needs to have an application (as mentioned above)which will @@ -30,28 +43,33 @@ inform the BIOS to enable the update in the next system reboot. The user should not unload the rbu driver after downloading the BIOS image or updating. -The driver load creates the following directories under the /sys file system. -/sys/class/firmware/dell_rbu/loading -/sys/class/firmware/dell_rbu/data -/sys/devices/platform/dell_rbu/image_type -/sys/devices/platform/dell_rbu/data -/sys/devices/platform/dell_rbu/packet_size +The driver load creates the following directories under the /sys file system:: + + /sys/class/firmware/dell_rbu/loading + /sys/class/firmware/dell_rbu/data + /sys/devices/platform/dell_rbu/image_type + /sys/devices/platform/dell_rbu/data + /sys/devices/platform/dell_rbu/packet_size The driver supports two types of update mechanism; monolithic and packetized. These update mechanism depends upon the BIOS currently running on the system. Most of the Dell systems support a monolithic update where the BIOS image is copied to a single contiguous block of physical memory. + In case of packet mechanism the single memory can be broken in smaller chunks of contiguous memory and the BIOS image is scattered in these packets. By default the driver uses monolithic memory for the update type. This can be changed to packets during the driver load time by specifying the load -parameter image_type=packet. This can also be changed later as below -echo packet > /sys/devices/platform/dell_rbu/image_type +parameter image_type=packet. This can also be changed later as below:: + + echo packet > /sys/devices/platform/dell_rbu/image_type In packet update mode the packet size has to be given before any packets can -be downloaded. It is done as below -echo XXXX > /sys/devices/platform/dell_rbu/packet_size +be downloaded. It is done as below:: + + echo XXXX > /sys/devices/platform/dell_rbu/packet_size + In the packet update mechanism, the user needs to create a new file having packets of data arranged back to back. It can be done as follows The user creates packets header, gets the chunk of the BIOS image and @@ -60,41 +78,54 @@ added together should match the specified packet_size. This makes one packet, the user needs to create more such packets out of the entire BIOS image file and then arrange all these packets back to back in to one single file. + This file is then copied to /sys/class/firmware/dell_rbu/data. Once this file gets to the driver, the driver extracts packet_size data from the file and spreads it across the physical memory in contiguous packet_sized space. + This method makes sure that all the packets get to the driver in a single operation. In monolithic update the user simply get the BIOS image (.hdr file) and copies to the data file as is without any change to the BIOS image itself. Do the steps below to download the BIOS image. + 1) echo 1 > /sys/class/firmware/dell_rbu/loading 2) cp bios_image.hdr /sys/class/firmware/dell_rbu/data 3) echo 0 > /sys/class/firmware/dell_rbu/loading The /sys/class/firmware/dell_rbu/ entries will remain till the following is done. -echo -1 > /sys/class/firmware/dell_rbu/loading + +:: + + echo -1 > /sys/class/firmware/dell_rbu/loading + Until this step is completed the driver cannot be unloaded. + Also echoing either mono, packet or init in to image_type will free up the memory allocated by the driver. If a user by accident executes steps 1 and 3 above without executing step 2; it will make the /sys/class/firmware/dell_rbu/ entries disappear. -The entries can be recreated by doing the following -echo init > /sys/devices/platform/dell_rbu/image_type -NOTE: echoing init in image_type does not change it original value. + +The entries can be recreated by doing the following:: + + echo init > /sys/devices/platform/dell_rbu/image_type + +.. note:: echoing init in image_type does not change it original value. Also the driver provides /sys/devices/platform/dell_rbu/data readonly file to read back the image downloaded. -NOTE: -This driver requires a patch for firmware_class.c which has the modified -request_firmware_nowait function. -Also after updating the BIOS image a user mode application needs to execute -code which sends the BIOS update request to the BIOS. So on the next reboot -the BIOS knows about the new image downloaded and it updates itself. -Also don't unload the rbu driver if the image has to be updated. +.. note:: + + This driver requires a patch for firmware_class.c which has the modified + request_firmware_nowait function. + + Also after updating the BIOS image a user mode application needs to execute + code which sends the BIOS update request to the BIOS. So on the next reboot + the BIOS knows about the new image downloaded and it updates itself. + Also don't unload the rbu driver if the image has to be updated. diff --git a/Documentation/dev-tools/index.rst b/Documentation/dev-tools/index.rst index 07d881147ef3..a81787cd47d7 100644 --- a/Documentation/dev-tools/index.rst +++ b/Documentation/dev-tools/index.rst @@ -23,6 +23,8 @@ whole; patches welcome! kmemleak kmemcheck gdb-kernel-debugging + kgdb + kselftest .. only:: subproject and html diff --git a/Documentation/dev-tools/kgdb.rst b/Documentation/dev-tools/kgdb.rst new file mode 100644 index 000000000000..75273203a35a --- /dev/null +++ b/Documentation/dev-tools/kgdb.rst @@ -0,0 +1,907 @@ +================================================= +Using kgdb, kdb and the kernel debugger internals +================================================= + +:Author: Jason Wessel + +Introduction +============ + +The kernel has two different debugger front ends (kdb and kgdb) which +interface to the debug core. It is possible to use either of the +debugger front ends and dynamically transition between them if you +configure the kernel properly at compile and runtime. + +Kdb is simplistic shell-style interface which you can use on a system +console with a keyboard or serial console. You can use it to inspect +memory, registers, process lists, dmesg, and even set breakpoints to +stop in a certain location. Kdb is not a source level debugger, although +you can set breakpoints and execute some basic kernel run control. Kdb +is mainly aimed at doing some analysis to aid in development or +diagnosing kernel problems. You can access some symbols by name in +kernel built-ins or in kernel modules if the code was built with +``CONFIG_KALLSYMS``. + +Kgdb is intended to be used as a source level debugger for the Linux +kernel. It is used along with gdb to debug a Linux kernel. The +expectation is that gdb can be used to "break in" to the kernel to +inspect memory, variables and look through call stack information +similar to the way an application developer would use gdb to debug an +application. It is possible to place breakpoints in kernel code and +perform some limited execution stepping. + +Two machines are required for using kgdb. One of these machines is a +development machine and the other is the target machine. The kernel to +be debugged runs on the target machine. The development machine runs an +instance of gdb against the vmlinux file which contains the symbols (not +a boot image such as bzImage, zImage, uImage...). In gdb the developer +specifies the connection parameters and connects to kgdb. The type of +connection a developer makes with gdb depends on the availability of +kgdb I/O modules compiled as built-ins or loadable kernel modules in the +test machine's kernel. + +Compiling a kernel +================== + +- In order to enable compilation of kdb, you must first enable kgdb. + +- The kgdb test compile options are described in the kgdb test suite + chapter. + +Kernel config options for kgdb +------------------------------ + +To enable ``CONFIG_KGDB`` you should look under +:menuselection:`Kernel hacking --> Kernel debugging` and select +:menuselection:`KGDB: kernel debugger`. + +While it is not a hard requirement that you have symbols in your vmlinux +file, gdb tends not to be very useful without the symbolic data, so you +will want to turn on ``CONFIG_DEBUG_INFO`` which is called +:menuselection:`Compile the kernel with debug info` in the config menu. + +It is advised, but not required, that you turn on the +``CONFIG_FRAME_POINTER`` kernel option which is called :menuselection:`Compile +the kernel with frame pointers` in the config menu. This option inserts code +to into the compiled executable which saves the frame information in +registers or on the stack at different points which allows a debugger +such as gdb to more accurately construct stack back traces while +debugging the kernel. + +If the architecture that you are using supports the kernel option +``CONFIG_STRICT_KERNEL_RWX``, you should consider turning it off. This +option will prevent the use of software breakpoints because it marks +certain regions of the kernel's memory space as read-only. If kgdb +supports it for the architecture you are using, you can use hardware +breakpoints if you desire to run with the ``CONFIG_STRICT_KERNEL_RWX`` +option turned on, else you need to turn off this option. + +Next you should choose one of more I/O drivers to interconnect debugging +host and debugged target. Early boot debugging requires a KGDB I/O +driver that supports early debugging and the driver must be built into +the kernel directly. Kgdb I/O driver configuration takes place via +kernel or module parameters which you can learn more about in the in the +section that describes the parameter kgdboc. + +Here is an example set of ``.config`` symbols to enable or disable for kgdb:: + + # CONFIG_STRICT_KERNEL_RWX is not set + CONFIG_FRAME_POINTER=y + CONFIG_KGDB=y + CONFIG_KGDB_SERIAL_CONSOLE=y + +Kernel config options for kdb +----------------------------- + +Kdb is quite a bit more complex than the simple gdbstub sitting on top +of the kernel's debug core. Kdb must implement a shell, and also adds +some helper functions in other parts of the kernel, responsible for +printing out interesting data such as what you would see if you ran +``lsmod``, or ``ps``. In order to build kdb into the kernel you follow the +same steps as you would for kgdb. + +The main config option for kdb is ``CONFIG_KGDB_KDB`` which is called +:menuselection:`KGDB_KDB: include kdb frontend for kgdb` in the config menu. +In theory you would have already also selected an I/O driver such as the +``CONFIG_KGDB_SERIAL_CONSOLE`` interface if you plan on using kdb on a +serial port, when you were configuring kgdb. + +If you want to use a PS/2-style keyboard with kdb, you would select +``CONFIG_KDB_KEYBOARD`` which is called :menuselection:`KGDB_KDB: keyboard as +input device` in the config menu. The ``CONFIG_KDB_KEYBOARD`` option is not +used for anything in the gdb interface to kgdb. The ``CONFIG_KDB_KEYBOARD`` +option only works with kdb. + +Here is an example set of ``.config`` symbols to enable/disable kdb:: + + # CONFIG_STRICT_KERNEL_RWX is not set + CONFIG_FRAME_POINTER=y + CONFIG_KGDB=y + CONFIG_KGDB_SERIAL_CONSOLE=y + CONFIG_KGDB_KDB=y + CONFIG_KDB_KEYBOARD=y + +Kernel Debugger Boot Arguments +============================== + +This section describes the various runtime kernel parameters that affect +the configuration of the kernel debugger. The following chapter covers +using kdb and kgdb as well as providing some examples of the +configuration parameters. + +Kernel parameter: kgdboc +------------------------ + +The kgdboc driver was originally an abbreviation meant to stand for +"kgdb over console". Today it is the primary mechanism to configure how +to communicate from gdb to kgdb as well as the devices you want to use +to interact with the kdb shell. + +For kgdb/gdb, kgdboc is designed to work with a single serial port. It +is intended to cover the circumstance where you want to use a serial +console as your primary console as well as using it to perform kernel +debugging. It is also possible to use kgdb on a serial port which is not +designated as a system console. Kgdboc may be configured as a kernel +built-in or a kernel loadable module. You can only make use of +``kgdbwait`` and early debugging if you build kgdboc into the kernel as +a built-in. + +Optionally you can elect to activate kms (Kernel Mode Setting) +integration. When you use kms with kgdboc and you have a video driver +that has atomic mode setting hooks, it is possible to enter the debugger +on the graphics console. When the kernel execution is resumed, the +previous graphics mode will be restored. This integration can serve as a +useful tool to aid in diagnosing crashes or doing analysis of memory +with kdb while allowing the full graphics console applications to run. + +kgdboc arguments +~~~~~~~~~~~~~~~~ + +Usage:: + + kgdboc=[kms][[,]kbd][[,]serial_device][,baud] + +The order listed above must be observed if you use any of the optional +configurations together. + +Abbreviations: + +- kms = Kernel Mode Setting + +- kbd = Keyboard + +You can configure kgdboc to use the keyboard, and/or a serial device +depending on if you are using kdb and/or kgdb, in one of the following +scenarios. The order listed above must be observed if you use any of the +optional configurations together. Using kms + only gdb is generally not +a useful combination. + +Using loadable module or built-in +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +1. As a kernel built-in: + + Use the kernel boot argument:: + + kgdboc=<tty-device>,[baud] + +2. As a kernel loadable module: + + Use the command:: + + modprobe kgdboc kgdboc=<tty-device>,[baud] + + Here are two examples of how you might format the kgdboc string. The + first is for an x86 target using the first serial port. The second + example is for the ARM Versatile AB using the second serial port. + + 1. ``kgdboc=ttyS0,115200`` + + 2. ``kgdboc=ttyAMA1,115200`` + +Configure kgdboc at runtime with sysfs +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +At run time you can enable or disable kgdboc by echoing a parameters +into the sysfs. Here are two examples: + +1. Enable kgdboc on ttyS0:: + + echo ttyS0 > /sys/module/kgdboc/parameters/kgdboc + +2. Disable kgdboc:: + + echo "" > /sys/module/kgdboc/parameters/kgdboc + +.. note:: + + You do not need to specify the baud if you are configuring the + console on tty which is already configured or open. + +More examples +^^^^^^^^^^^^^ + +You can configure kgdboc to use the keyboard, and/or a serial device +depending on if you are using kdb and/or kgdb, in one of the following +scenarios. + +1. kdb and kgdb over only a serial port:: + + kgdboc=<serial_device>[,baud] + + Example:: + + kgdboc=ttyS0,115200 + +2. kdb and kgdb with keyboard and a serial port:: + + kgdboc=kbd,<serial_device>[,baud] + + Example:: + + kgdboc=kbd,ttyS0,115200 + +3. kdb with a keyboard:: + + kgdboc=kbd + +4. kdb with kernel mode setting:: + + kgdboc=kms,kbd + +5. kdb with kernel mode setting and kgdb over a serial port:: + + kgdboc=kms,kbd,ttyS0,115200 + +.. note:: + + Kgdboc does not support interrupting the target via the gdb remote + protocol. You must manually send a :kbd:`SysRq-G` unless you have a proxy + that splits console output to a terminal program. A console proxy has a + separate TCP port for the debugger and a separate TCP port for the + "human" console. The proxy can take care of sending the :kbd:`SysRq-G` + for you. + +When using kgdboc with no debugger proxy, you can end up connecting the +debugger at one of two entry points. If an exception occurs after you +have loaded kgdboc, a message should print on the console stating it is +waiting for the debugger. In this case you disconnect your terminal +program and then connect the debugger in its place. If you want to +interrupt the target system and forcibly enter a debug session you have +to issue a :kbd:`Sysrq` sequence and then type the letter :kbd:`g`. Then you +disconnect the terminal session and connect gdb. Your options if you +don't like this are to hack gdb to send the :kbd:`SysRq-G` for you as well as +on the initial connect, or to use a debugger proxy that allows an +unmodified gdb to do the debugging. + +Kernel parameter: ``kgdbwait`` +------------------------------ + +The Kernel command line option ``kgdbwait`` makes kgdb wait for a +debugger connection during booting of a kernel. You can only use this +option if you compiled a kgdb I/O driver into the kernel and you +specified the I/O driver configuration as a kernel command line option. +The kgdbwait parameter should always follow the configuration parameter +for the kgdb I/O driver in the kernel command line else the I/O driver +will not be configured prior to asking the kernel to use it to wait. + +The kernel will stop and wait as early as the I/O driver and +architecture allows when you use this option. If you build the kgdb I/O +driver as a loadable kernel module kgdbwait will not do anything. + +Kernel parameter: ``kgdbcon`` +----------------------------- + +The ``kgdbcon`` feature allows you to see :c:func:`printk` messages inside gdb +while gdb is connected to the kernel. Kdb does not make use of the kgdbcon +feature. + +Kgdb supports using the gdb serial protocol to send console messages to +the debugger when the debugger is connected and running. There are two +ways to activate this feature. + +1. Activate with the kernel command line option:: + + kgdbcon + +2. Use sysfs before configuring an I/O driver:: + + echo 1 > /sys/module/kgdb/parameters/kgdb_use_con + +.. note:: + + If you do this after you configure the kgdb I/O driver, the + setting will not take effect until the next point the I/O is + reconfigured. + +.. important:: + + You cannot use kgdboc + kgdbcon on a tty that is an + active system console. An example of incorrect usage is:: + + console=ttyS0,115200 kgdboc=ttyS0 kgdbcon + +It is possible to use this option with kgdboc on a tty that is not a +system console. + +Run time parameter: ``kgdbreboot`` +---------------------------------- + +The kgdbreboot feature allows you to change how the debugger deals with +the reboot notification. You have 3 choices for the behavior. The +default behavior is always set to 0. + +.. tabularcolumns:: |p{0.4cm}|p{11.5cm}|p{5.6cm}| + +.. flat-table:: + :widths: 1 10 8 + + * - 1 + - ``echo -1 > /sys/module/debug_core/parameters/kgdbreboot`` + - Ignore the reboot notification entirely. + + * - 2 + - ``echo 0 > /sys/module/debug_core/parameters/kgdbreboot`` + - Send the detach message to any attached debugger client. + + * - 3 + - ``echo 1 > /sys/module/debug_core/parameters/kgdbreboot`` + - Enter the debugger on reboot notify. + +Using kdb +========= + +Quick start for kdb on a serial port +------------------------------------ + +This is a quick example of how to use kdb. + +1. Configure kgdboc at boot using kernel parameters:: + + console=ttyS0,115200 kgdboc=ttyS0,115200 + + OR + + Configure kgdboc after the kernel has booted; assuming you are using + a serial port console:: + + echo ttyS0 > /sys/module/kgdboc/parameters/kgdboc + +2. Enter the kernel debugger manually or by waiting for an oops or + fault. There are several ways you can enter the kernel debugger + manually; all involve using the :kbd:`SysRq-G`, which means you must have + enabled ``CONFIG_MAGIC_SysRq=y`` in your kernel config. + + - When logged in as root or with a super user session you can run:: + + echo g > /proc/sysrq-trigger + + - Example using minicom 2.2 + + Press: :kbd:`CTRL-A` :kbd:`f` :kbd:`g` + + - When you have telneted to a terminal server that supports sending + a remote break + + Press: :kbd:`CTRL-]` + + Type in: ``send break`` + + Press: :kbd:`Enter` :kbd:`g` + +3. From the kdb prompt you can run the ``help`` command to see a complete + list of the commands that are available. + + Some useful commands in kdb include: + + =========== ================================================================= + ``lsmod`` Shows where kernel modules are loaded + ``ps`` Displays only the active processes + ``ps A`` Shows all the processes + ``summary`` Shows kernel version info and memory usage + ``bt`` Get a backtrace of the current process using :c:func:`dump_stack` + ``dmesg`` View the kernel syslog buffer + ``go`` Continue the system + =========== ================================================================= + +4. When you are done using kdb you need to consider rebooting the system + or using the ``go`` command to resuming normal kernel execution. If you + have paused the kernel for a lengthy period of time, applications + that rely on timely networking or anything to do with real wall clock + time could be adversely affected, so you should take this into + consideration when using the kernel debugger. + +Quick start for kdb using a keyboard connected console +------------------------------------------------------ + +This is a quick example of how to use kdb with a keyboard. + +1. Configure kgdboc at boot using kernel parameters:: + + kgdboc=kbd + + OR + + Configure kgdboc after the kernel has booted:: + + echo kbd > /sys/module/kgdboc/parameters/kgdboc + +2. Enter the kernel debugger manually or by waiting for an oops or + fault. There are several ways you can enter the kernel debugger + manually; all involve using the :kbd:`SysRq-G`, which means you must have + enabled ``CONFIG_MAGIC_SysRq=y`` in your kernel config. + + - When logged in as root or with a super user session you can run:: + + echo g > /proc/sysrq-trigger + + - Example using a laptop keyboard: + + Press and hold down: :kbd:`Alt` + + Press and hold down: :kbd:`Fn` + + Press and release the key with the label: :kbd:`SysRq` + + Release: :kbd:`Fn` + + Press and release: :kbd:`g` + + Release: :kbd:`Alt` + + - Example using a PS/2 101-key keyboard + + Press and hold down: :kbd:`Alt` + + Press and release the key with the label: :kbd:`SysRq` + + Press and release: :kbd:`g` + + Release: :kbd:`Alt` + +3. Now type in a kdb command such as ``help``, ``dmesg``, ``bt`` or ``go`` to + continue kernel execution. + +Using kgdb / gdb +================ + +In order to use kgdb you must activate it by passing configuration +information to one of the kgdb I/O drivers. If you do not pass any +configuration information kgdb will not do anything at all. Kgdb will +only actively hook up to the kernel trap hooks if a kgdb I/O driver is +loaded and configured. If you unconfigure a kgdb I/O driver, kgdb will +unregister all the kernel hook points. + +All kgdb I/O drivers can be reconfigured at run time, if +``CONFIG_SYSFS`` and ``CONFIG_MODULES`` are enabled, by echo'ing a new +config string to ``/sys/module/<driver>/parameter/<option>``. The driver +can be unconfigured by passing an empty string. You cannot change the +configuration while the debugger is attached. Make sure to detach the +debugger with the ``detach`` command prior to trying to unconfigure a +kgdb I/O driver. + +Connecting with gdb to a serial port +------------------------------------ + +1. Configure kgdboc + + Configure kgdboc at boot using kernel parameters:: + + kgdboc=ttyS0,115200 + + OR + + Configure kgdboc after the kernel has booted:: + + echo ttyS0 > /sys/module/kgdboc/parameters/kgdboc + +2. Stop kernel execution (break into the debugger) + + In order to connect to gdb via kgdboc, the kernel must first be + stopped. There are several ways to stop the kernel which include + using kgdbwait as a boot argument, via a :kbd:`SysRq-G`, or running the + kernel until it takes an exception where it waits for the debugger to + attach. + + - When logged in as root or with a super user session you can run:: + + echo g > /proc/sysrq-trigger + + - Example using minicom 2.2 + + Press: :kbd:`CTRL-A` :kbd:`f` :kbd:`g` + + - When you have telneted to a terminal server that supports sending + a remote break + + Press: :kbd:`CTRL-]` + + Type in: ``send break`` + + Press: :kbd:`Enter` :kbd:`g` + +3. Connect from gdb + + Example (using a directly connected port):: + + % gdb ./vmlinux + (gdb) set remotebaud 115200 + (gdb) target remote /dev/ttyS0 + + + Example (kgdb to a terminal server on TCP port 2012):: + + % gdb ./vmlinux + (gdb) target remote 192.168.2.2:2012 + + + Once connected, you can debug a kernel the way you would debug an + application program. + + If you are having problems connecting or something is going seriously + wrong while debugging, it will most often be the case that you want + to enable gdb to be verbose about its target communications. You do + this prior to issuing the ``target remote`` command by typing in:: + + set debug remote 1 + +Remember if you continue in gdb, and need to "break in" again, you need +to issue an other :kbd:`SysRq-G`. It is easy to create a simple entry point by +putting a breakpoint at ``sys_sync`` and then you can run ``sync`` from a +shell or script to break into the debugger. + +kgdb and kdb interoperability +============================= + +It is possible to transition between kdb and kgdb dynamically. The debug +core will remember which you used the last time and automatically start +in the same mode. + +Switching between kdb and kgdb +------------------------------ + +Switching from kgdb to kdb +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +There are two ways to switch from kgdb to kdb: you can use gdb to issue +a maintenance packet, or you can blindly type the command ``$3#33``. +Whenever the kernel debugger stops in kgdb mode it will print the +message ``KGDB or $3#33 for KDB``. It is important to note that you have +to type the sequence correctly in one pass. You cannot type a backspace +or delete because kgdb will interpret that as part of the debug stream. + +1. Change from kgdb to kdb by blindly typing:: + + $3#33 + +2. Change from kgdb to kdb with gdb:: + + maintenance packet 3 + + .. note:: + + Now you must kill gdb. Typically you press :kbd:`CTRL-Z` and issue + the command:: + + kill -9 % + +Change from kdb to kgdb +~~~~~~~~~~~~~~~~~~~~~~~ + +There are two ways you can change from kdb to kgdb. You can manually +enter kgdb mode by issuing the kgdb command from the kdb shell prompt, +or you can connect gdb while the kdb shell prompt is active. The kdb +shell looks for the typical first commands that gdb would issue with the +gdb remote protocol and if it sees one of those commands it +automatically changes into kgdb mode. + +1. From kdb issue the command:: + + kgdb + + Now disconnect your terminal program and connect gdb in its place + +2. At the kdb prompt, disconnect the terminal program and connect gdb in + its place. + +Running kdb commands from gdb +----------------------------- + +It is possible to run a limited set of kdb commands from gdb, using the +gdb monitor command. You don't want to execute any of the run control or +breakpoint operations, because it can disrupt the state of the kernel +debugger. You should be using gdb for breakpoints and run control +operations if you have gdb connected. The more useful commands to run +are things like lsmod, dmesg, ps or possibly some of the memory +information commands. To see all the kdb commands you can run +``monitor help``. + +Example:: + + (gdb) monitor ps + 1 idle process (state I) and + 27 sleeping system daemon (state M) processes suppressed, + use 'ps A' to see all. + Task Addr Pid Parent [*] cpu State Thread Command + + 0xc78291d0 1 0 0 0 S 0xc7829404 init + 0xc7954150 942 1 0 0 S 0xc7954384 dropbear + 0xc78789c0 944 1 0 0 S 0xc7878bf4 sh + (gdb) + +kgdb Test Suite +=============== + +When kgdb is enabled in the kernel config you can also elect to enable +the config parameter ``KGDB_TESTS``. Turning this on will enable a special +kgdb I/O module which is designed to test the kgdb internal functions. + +The kgdb tests are mainly intended for developers to test the kgdb +internals as well as a tool for developing a new kgdb architecture +specific implementation. These tests are not really for end users of the +Linux kernel. The primary source of documentation would be to look in +the ``drivers/misc/kgdbts.c`` file. + +The kgdb test suite can also be configured at compile time to run the +core set of tests by setting the kernel config parameter +``KGDB_TESTS_ON_BOOT``. This particular option is aimed at automated +regression testing and does not require modifying the kernel boot config +arguments. If this is turned on, the kgdb test suite can be disabled by +specifying ``kgdbts=`` as a kernel boot argument. + +Kernel Debugger Internals +========================= + +Architecture Specifics +---------------------- + +The kernel debugger is organized into a number of components: + +1. The debug core + + The debug core is found in ``kernel/debugger/debug_core.c``. It + contains: + + - A generic OS exception handler which includes sync'ing the + processors into a stopped state on an multi-CPU system. + + - The API to talk to the kgdb I/O drivers + + - The API to make calls to the arch-specific kgdb implementation + + - The logic to perform safe memory reads and writes to memory while + using the debugger + + - A full implementation for software breakpoints unless overridden + by the arch + + - The API to invoke either the kdb or kgdb frontend to the debug + core. + + - The structures and callback API for atomic kernel mode setting. + + .. note:: kgdboc is where the kms callbacks are invoked. + +2. kgdb arch-specific implementation + + This implementation is generally found in ``arch/*/kernel/kgdb.c``. As + an example, ``arch/x86/kernel/kgdb.c`` contains the specifics to + implement HW breakpoint as well as the initialization to dynamically + register and unregister for the trap handlers on this architecture. + The arch-specific portion implements: + + - contains an arch-specific trap catcher which invokes + :c:func:`kgdb_handle_exception` to start kgdb about doing its work + + - translation to and from gdb specific packet format to :c:type:`pt_regs` + + - Registration and unregistration of architecture specific trap + hooks + + - Any special exception handling and cleanup + + - NMI exception handling and cleanup + + - (optional) HW breakpoints + +3. gdbstub frontend (aka kgdb) + + The gdbstub is located in ``kernel/debug/gdbstub.c``. It contains: + + - All the logic to implement the gdb serial protocol + +4. kdb frontend + + The kdb debugger shell is broken down into a number of components. + The kdb core is located in kernel/debug/kdb. There are a number of + helper functions in some of the other kernel components to make it + possible for kdb to examine and report information about the kernel + without taking locks that could cause a kernel deadlock. The kdb core + contains implements the following functionality. + + - A simple shell + + - The kdb core command set + + - A registration API to register additional kdb shell commands. + + - A good example of a self-contained kdb module is the ``ftdump`` + command for dumping the ftrace buffer. See: + ``kernel/trace/trace_kdb.c`` + + - For an example of how to dynamically register a new kdb command + you can build the kdb_hello.ko kernel module from + ``samples/kdb/kdb_hello.c``. To build this example you can set + ``CONFIG_SAMPLES=y`` and ``CONFIG_SAMPLE_KDB=m`` in your kernel + config. Later run ``modprobe kdb_hello`` and the next time you + enter the kdb shell, you can run the ``hello`` command. + + - The implementation for :c:func:`kdb_printf` which emits messages directly + to I/O drivers, bypassing the kernel log. + + - SW / HW breakpoint management for the kdb shell + +5. kgdb I/O driver + + Each kgdb I/O driver has to provide an implementation for the + following: + + - configuration via built-in or module + + - dynamic configuration and kgdb hook registration calls + + - read and write character interface + + - A cleanup handler for unconfiguring from the kgdb core + + - (optional) Early debug methodology + + Any given kgdb I/O driver has to operate very closely with the + hardware and must do it in such a way that does not enable interrupts + or change other parts of the system context without completely + restoring them. The kgdb core will repeatedly "poll" a kgdb I/O + driver for characters when it needs input. The I/O driver is expected + to return immediately if there is no data available. Doing so allows + for the future possibility to touch watchdog hardware in such a way + as to have a target system not reset when these are enabled. + +If you are intent on adding kgdb architecture specific support for a new +architecture, the architecture should define ``HAVE_ARCH_KGDB`` in the +architecture specific Kconfig file. This will enable kgdb for the +architecture, and at that point you must create an architecture specific +kgdb implementation. + +There are a few flags which must be set on every architecture in their +``asm/kgdb.h`` file. These are: + +- ``NUMREGBYTES``: + The size in bytes of all of the registers, so that we + can ensure they will all fit into a packet. + +- ``BUFMAX``: + The size in bytes of the buffer GDB will read into. This must + be larger than NUMREGBYTES. + +- ``CACHE_FLUSH_IS_SAFE``: + Set to 1 if it is always safe to call + flush_cache_range or flush_icache_range. On some architectures, + these functions may not be safe to call on SMP since we keep other + CPUs in a holding pattern. + +There are also the following functions for the common backend, found in +``kernel/kgdb.c``, that must be supplied by the architecture-specific +backend unless marked as (optional), in which case a default function +maybe used if the architecture does not need to provide a specific +implementation. + +.. kernel-doc:: include/linux/kgdb.h + :internal: + +kgdboc internals +---------------- + +kgdboc and uarts +~~~~~~~~~~~~~~~~ + +The kgdboc driver is actually a very thin driver that relies on the +underlying low level to the hardware driver having "polling hooks" to +which the tty driver is attached. In the initial implementation of +kgdboc the serial_core was changed to expose a low level UART hook for +doing polled mode reading and writing of a single character while in an +atomic context. When kgdb makes an I/O request to the debugger, kgdboc +invokes a callback in the serial core which in turn uses the callback in +the UART driver. + +When using kgdboc with a UART, the UART driver must implement two +callbacks in the :c:type:`struct uart_ops <uart_ops>`. +Example from ``drivers/8250.c``:: + + + #ifdef CONFIG_CONSOLE_POLL + .poll_get_char = serial8250_get_poll_char, + .poll_put_char = serial8250_put_poll_char, + #endif + + +Any implementation specifics around creating a polling driver use the +``#ifdef CONFIG_CONSOLE_POLL``, as shown above. Keep in mind that +polling hooks have to be implemented in such a way that they can be +called from an atomic context and have to restore the state of the UART +chip on return such that the system can return to normal when the +debugger detaches. You need to be very careful with any kind of lock you +consider, because failing here is most likely going to mean pressing the +reset button. + +kgdboc and keyboards +~~~~~~~~~~~~~~~~~~~~~~~~ + +The kgdboc driver contains logic to configure communications with an +attached keyboard. The keyboard infrastructure is only compiled into the +kernel when ``CONFIG_KDB_KEYBOARD=y`` is set in the kernel configuration. + +The core polled keyboard driver driver for PS/2 type keyboards is in +``drivers/char/kdb_keyboard.c``. This driver is hooked into the debug core +when kgdboc populates the callback in the array called +:c:type:`kdb_poll_funcs[]`. The :c:func:`kdb_get_kbd_char` is the top-level +function which polls hardware for single character input. + +kgdboc and kms +~~~~~~~~~~~~~~~~~~ + +The kgdboc driver contains logic to request the graphics display to +switch to a text context when you are using ``kgdboc=kms,kbd``, provided +that you have a video driver which has a frame buffer console and atomic +kernel mode setting support. + +Every time the kernel debugger is entered it calls +:c:func:`kgdboc_pre_exp_handler` which in turn calls :c:func:`con_debug_enter` +in the virtual console layer. On resuming kernel execution, the kernel +debugger calls :c:func:`kgdboc_post_exp_handler` which in turn calls +:c:func:`con_debug_leave`. + +Any video driver that wants to be compatible with the kernel debugger +and the atomic kms callbacks must implement the ``mode_set_base_atomic``, +``fb_debug_enter`` and ``fb_debug_leave operations``. For the +``fb_debug_enter`` and ``fb_debug_leave`` the option exists to use the +generic drm fb helper functions or implement something custom for the +hardware. The following example shows the initialization of the +.mode_set_base_atomic operation in +drivers/gpu/drm/i915/intel_display.c:: + + + static const struct drm_crtc_helper_funcs intel_helper_funcs = { + [...] + .mode_set_base_atomic = intel_pipe_set_base_atomic, + [...] + }; + + +Here is an example of how the i915 driver initializes the +fb_debug_enter and fb_debug_leave functions to use the generic drm +helpers in ``drivers/gpu/drm/i915/intel_fb.c``:: + + + static struct fb_ops intelfb_ops = { + [...] + .fb_debug_enter = drm_fb_helper_debug_enter, + .fb_debug_leave = drm_fb_helper_debug_leave, + [...] + }; + + +Credits +======= + +The following people have contributed to this document: + +1. Amit Kale <amitkale@linsyssoft.com> + +2. Tom Rini <trini@kernel.crashing.org> + +In March 2008 this document was completely rewritten by: + +- Jason Wessel <jason.wessel@windriver.com> + +In Jan 2010 this document was updated to include kdb. + +- Jason Wessel <jason.wessel@windriver.com> diff --git a/Documentation/dev-tools/kmemleak.rst b/Documentation/dev-tools/kmemleak.rst index b2391b829169..cb8862659178 100644 --- a/Documentation/dev-tools/kmemleak.rst +++ b/Documentation/dev-tools/kmemleak.rst @@ -150,6 +150,7 @@ See the include/linux/kmemleak.h header for the functions prototype. - ``kmemleak_init`` - initialize kmemleak - ``kmemleak_alloc`` - notify of a memory block allocation - ``kmemleak_alloc_percpu`` - notify of a percpu memory block allocation +- ``kmemleak_vmalloc`` - notify of a vmalloc() memory allocation - ``kmemleak_free`` - notify of a memory block freeing - ``kmemleak_free_part`` - notify of a partial memory block freeing - ``kmemleak_free_percpu`` - notify of a percpu memory block freeing diff --git a/Documentation/kselftest.txt b/Documentation/dev-tools/kselftest.rst index 5bd590335839..ebd03d11d2c2 100644 --- a/Documentation/kselftest.txt +++ b/Documentation/dev-tools/kselftest.rst @@ -1,4 +1,6 @@ +====================== Linux Kernel Selftests +====================== The kernel contains a set of "self tests" under the tools/testing/selftests/ directory. These are intended to be small tests to exercise individual code @@ -15,28 +17,33 @@ hotplug test is run on 2% of hotplug capable memory instead of 10%. Running the selftests (hotplug tests are run in limited mode) ============================================================= -To build the tests: +To build the tests:: + $ make -C tools/testing/selftests +To run the tests:: -To run the tests: $ make -C tools/testing/selftests run_tests -To build and run the tests with a single command, use: +To build and run the tests with a single command, use:: + $ make kselftest -- note that some tests will require root privileges. +Note that some tests will require root privileges. Running a subset of selftests -======================================== +============================= + You can use the "TARGETS" variable on the make command line to specify single test to run, or a list of tests to run. -To run only tests targeted for a single subsystem: - $ make -C tools/testing/selftests TARGETS=ptrace run_tests +To run only tests targeted for a single subsystem:: + + $ make -C tools/testing/selftests TARGETS=ptrace run_tests + +You can specify multiple tests to build and run:: -You can specify multiple tests to build and run: $ make TARGETS="size timers" kselftest See the top-level tools/testing/selftests/Makefile for the list of all @@ -46,13 +53,15 @@ possible targets. Running the full range hotplug selftests ======================================== -To build the hotplug tests: +To build the hotplug tests:: + $ make -C tools/testing/selftests hotplug -To run the hotplug tests: +To run the hotplug tests:: + $ make -C tools/testing/selftests run_hotplug -- note that some tests will require root privileges. +Note that some tests will require root privileges. Install selftests @@ -62,11 +71,13 @@ You can use kselftest_install.sh tool installs selftests in default location which is tools/testing/selftests/kselftest or a user specified location. -To install selftests in default location: +To install selftests in default location:: + $ cd tools/testing/selftests $ ./kselftest_install.sh -To install selftests in a user specified location: +To install selftests in a user specified location:: + $ cd tools/testing/selftests $ ./kselftest_install.sh install_dir @@ -77,10 +88,10 @@ Kselftest install as well as the Kselftest tarball provide a script named "run_kselftest.sh" to run the tests. You can simply do the following to run the installed Kselftests.ย Please -note some tests will require root privileges. +note some tests will require root privileges:: -cd kselftest -./run_kselftest.sh + $ cd kselftest + $ ./run_kselftest.sh Contributing new tests ====================== @@ -96,14 +107,49 @@ In general, the rules for selftests are * Don't cause the top-level "make run_tests" to fail if your feature is unconfigured. -Contributing new tests(details) -=============================== +Contributing new tests (details) +================================ * Use TEST_GEN_XXX if such binaries or files are generated during compiling. + TEST_PROGS, TEST_GEN_PROGS mean it is the excutable tested by default. + TEST_PROGS_EXTENDED, TEST_GEN_PROGS_EXTENDED mean it is the executable which is not tested by default. TEST_FILES, TEST_GEN_FILES mean it is the file which is used by test. + +Test Harness +============ + +The kselftest_harness.h file contains useful helpers to build tests. The tests +from tools/testing/selftests/seccomp/seccomp_bpf.c can be used as example. + +Example +------- + +.. kernel-doc:: tools/testing/selftests/kselftest_harness.h + :doc: example + + +Helpers +------- + +.. kernel-doc:: tools/testing/selftests/kselftest_harness.h + :functions: TH_LOG TEST TEST_SIGNAL FIXTURE FIXTURE_DATA FIXTURE_SETUP + FIXTURE_TEARDOWN TEST_F TEST_HARNESS_MAIN + +Operators +--------- + +.. kernel-doc:: tools/testing/selftests/kselftest_harness.h + :doc: operators + +.. kernel-doc:: tools/testing/selftests/kselftest_harness.h + :functions: ASSERT_EQ ASSERT_NE ASSERT_LT ASSERT_LE ASSERT_GT ASSERT_GE + ASSERT_NULL ASSERT_TRUE ASSERT_NULL ASSERT_TRUE ASSERT_FALSE + ASSERT_STREQ ASSERT_STRNE EXPECT_EQ EXPECT_NE EXPECT_LT + EXPECT_LE EXPECT_GT EXPECT_GE EXPECT_NULL EXPECT_TRUE + EXPECT_FALSE EXPECT_STREQ EXPECT_STRNE diff --git a/Documentation/dev-tools/sparse.rst b/Documentation/dev-tools/sparse.rst index ffdcc97f6f5a..78aa00a604a0 100644 --- a/Documentation/dev-tools/sparse.rst +++ b/Documentation/dev-tools/sparse.rst @@ -103,9 +103,3 @@ have already built it. The optional make variable CF can be used to pass arguments to sparse. The build system passes -Wbitwise to sparse automatically. - -Checking RCU annotations -~~~~~~~~~~~~~~~~~~~~~~~~ - -RCU annotations are not checked by default. To enable RCU annotation -checks, include -DCONFIG_SPARSE_RCU_POINTER in your CF flags. diff --git a/Documentation/device-mapper/dm-zoned.txt b/Documentation/device-mapper/dm-zoned.txt new file mode 100644 index 000000000000..736fcc78d193 --- /dev/null +++ b/Documentation/device-mapper/dm-zoned.txt @@ -0,0 +1,144 @@ +dm-zoned +======== + +The dm-zoned device mapper target exposes a zoned block device (ZBC and +ZAC compliant devices) as a regular block device without any write +pattern constraints. In effect, it implements a drive-managed zoned +block device which hides from the user (a file system or an application +doing raw block device accesses) the sequential write constraints of +host-managed zoned block devices and can mitigate the potential +device-side performance degradation due to excessive random writes on +host-aware zoned block devices. + +For a more detailed description of the zoned block device models and +their constraints see (for SCSI devices): + +http://www.t10.org/drafts.htm#ZBC_Family + +and (for ATA devices): + +http://www.t13.org/Documents/UploadedDocuments/docs2015/di537r05-Zoned_Device_ATA_Command_Set_ZAC.pdf + +The dm-zoned implementation is simple and minimizes system overhead (CPU +and memory usage as well as storage capacity loss). For a 10TB +host-managed disk with 256 MB zones, dm-zoned memory usage per disk +instance is at most 4.5 MB and as little as 5 zones will be used +internally for storing metadata and performaing reclaim operations. + +dm-zoned target devices are formatted and checked using the dmzadm +utility available at: + +https://github.com/hgst/dm-zoned-tools + +Algorithm +========= + +dm-zoned implements an on-disk buffering scheme to handle non-sequential +write accesses to the sequential zones of a zoned block device. +Conventional zones are used for caching as well as for storing internal +metadata. + +The zones of the device are separated into 2 types: + +1) Metadata zones: these are conventional zones used to store metadata. +Metadata zones are not reported as useable capacity to the user. + +2) Data zones: all remaining zones, the vast majority of which will be +sequential zones used exclusively to store user data. The conventional +zones of the device may be used also for buffering user random writes. +Data in these zones may be directly mapped to the conventional zone, but +later moved to a sequential zone so that the conventional zone can be +reused for buffering incoming random writes. + +dm-zoned exposes a logical device with a sector size of 4096 bytes, +irrespective of the physical sector size of the backend zoned block +device being used. This allows reducing the amount of metadata needed to +manage valid blocks (blocks written). + +The on-disk metadata format is as follows: + +1) The first block of the first conventional zone found contains the +super block which describes the on disk amount and position of metadata +blocks. + +2) Following the super block, a set of blocks is used to describe the +mapping of the logical device blocks. The mapping is done per chunk of +blocks, with the chunk size equal to the zoned block device size. The +mapping table is indexed by chunk number and each mapping entry +indicates the zone number of the device storing the chunk of data. Each +mapping entry may also indicate if the zone number of a conventional +zone used to buffer random modification to the data zone. + +3) A set of blocks used to store bitmaps indicating the validity of +blocks in the data zones follows the mapping table. A valid block is +defined as a block that was written and not discarded. For a buffered +data chunk, a block is always valid only in the data zone mapping the +chunk or in the buffer zone of the chunk. + +For a logical chunk mapped to a conventional zone, all write operations +are processed by directly writing to the zone. If the mapping zone is a +sequential zone, the write operation is processed directly only if the +write offset within the logical chunk is equal to the write pointer +offset within of the sequential data zone (i.e. the write operation is +aligned on the zone write pointer). Otherwise, write operations are +processed indirectly using a buffer zone. In that case, an unused +conventional zone is allocated and assigned to the chunk being +accessed. Writing a block to the buffer zone of a chunk will +automatically invalidate the same block in the sequential zone mapping +the chunk. If all blocks of the sequential zone become invalid, the zone +is freed and the chunk buffer zone becomes the primary zone mapping the +chunk, resulting in native random write performance similar to a regular +block device. + +Read operations are processed according to the block validity +information provided by the bitmaps. Valid blocks are read either from +the sequential zone mapping a chunk, or if the chunk is buffered, from +the buffer zone assigned. If the accessed chunk has no mapping, or the +accessed blocks are invalid, the read buffer is zeroed and the read +operation terminated. + +After some time, the limited number of convnetional zones available may +be exhausted (all used to map chunks or buffer sequential zones) and +unaligned writes to unbuffered chunks become impossible. To avoid this +situation, a reclaim process regularly scans used conventional zones and +tries to reclaim the least recently used zones by copying the valid +blocks of the buffer zone to a free sequential zone. Once the copy +completes, the chunk mapping is updated to point to the sequential zone +and the buffer zone freed for reuse. + +Metadata Protection +=================== + +To protect metadata against corruption in case of sudden power loss or +system crash, 2 sets of metadata zones are used. One set, the primary +set, is used as the main metadata region, while the secondary set is +used as a staging area. Modified metadata is first written to the +secondary set and validated by updating the super block in the secondary +set, a generation counter is used to indicate that this set contains the +newest metadata. Once this operation completes, in place of metadata +block updates can be done in the primary metadata set. This ensures that +one of the set is always consistent (all modifications committed or none +at all). Flush operations are used as a commit point. Upon reception of +a flush request, metadata modification activity is temporarily blocked +(for both incoming BIO processing and reclaim process) and all dirty +metadata blocks are staged and updated. Normal operation is then +resumed. Flushing metadata thus only temporarily delays write and +discard requests. Read requests can be processed concurrently while +metadata flush is being executed. + +Usage +===== + +A zoned block device must first be formatted using the dmzadm tool. This +will analyze the device zone configuration, determine where to place the +metadata sets on the device and initialize the metadata sets. + +Ex: + +dmzadm --format /dev/sdxx + +For a formatted device, the target can be created normally with the +dmsetup utility. The only parameter that dm-zoned requires is the +underlying zoned block device name. Ex: + +echo "0 `blockdev --getsize ${dev}` zoned ${dev}" | dmsetup create dmz-`basename ${dev}` diff --git a/Documentation/devicetree/bindings/arm/actions.txt b/Documentation/devicetree/bindings/arm/actions.txt new file mode 100644 index 000000000000..3bc7ea575564 --- /dev/null +++ b/Documentation/devicetree/bindings/arm/actions.txt @@ -0,0 +1,39 @@ +Actions Semi platforms device tree bindings +------------------------------------------- + + +S500 SoC +======== + +Required root node properties: + + - compatible : must contain "actions,s500" + + +Modules: + +Root node property compatible must contain, depending on module: + + - LeMaker Guitar: "lemaker,guitar" + + +Boards: + +Root node property compatible must contain, depending on board: + + - LeMaker Guitar Base Board rev. B: "lemaker,guitar-bb-rev-b", "lemaker,guitar" + + +S900 SoC +======== + +Required root node properties: + +- compatible : must contain "actions,s900" + + +Boards: + +Root node property compatible must contain, depending on board: + + - uCRobotics Bubblegum-96: "ucrobotics,bubblegum-96" diff --git a/Documentation/devicetree/bindings/arm/amlogic.txt b/Documentation/devicetree/bindings/arm/amlogic.txt index bfd5b558477d..0fff40a6330d 100644 --- a/Documentation/devicetree/bindings/arm/amlogic.txt +++ b/Documentation/devicetree/bindings/arm/amlogic.txt @@ -29,26 +29,35 @@ Boards with the Amlogic Meson GXM S912 SoC shall have the following properties: Required root node property: compatible: "amlogic,s912", "amlogic,meson-gxm"; -Board compatible values: +Board compatible values (alphabetically, grouped by SoC): + - "geniatech,atv1200" (Meson6) + - "minix,neo-x8" (Meson8) - - "tronfy,mxq" (Meson8b) + - "hardkernel,odroid-c1" (Meson8b) + - "tronfy,mxq" (Meson8b) + + - "amlogic,p200" (Meson gxbb) + - "amlogic,p201" (Meson gxbb) + - "friendlyarm,nanopi-k2" (Meson gxbb) + - "hardkernel,odroid-c2" (Meson gxbb) + - "nexbox,a95x" (Meson gxbb or Meson gxl s905x) - "tronsmart,vega-s95-pro", "tronsmart,vega-s95" (Meson gxbb) - "tronsmart,vega-s95-meta", "tronsmart,vega-s95" (Meson gxbb) - "tronsmart,vega-s95-telos", "tronsmart,vega-s95" (Meson gxbb) - - "hardkernel,odroid-c2" (Meson gxbb) - - "amlogic,p200" (Meson gxbb) - - "amlogic,p201" (Meson gxbb) - "wetek,hub" (Meson gxbb) - "wetek,play2" (Meson gxbb) + - "amlogic,p212" (Meson gxl s905x) + - "hwacom,amazetv" (Meson gxl s905x) - "khadas,vim" (Meson gxl s905x) + - "libretech,cc" (Meson gxl s905x) - "amlogic,p230" (Meson gxl s905d) - "amlogic,p231" (Meson gxl s905d) - - "hwacom,amazetv" (Meson gxl s905x) + - "amlogic,q200" (Meson gxm s912) - "amlogic,q201" (Meson gxm s912) - - "nexbox,a95x" (Meson gxbb or Meson gxl s905x) + - "kingnovel,r-box-pro" (Meson gxm S912) - "nexbox,a1" (Meson gxm s912) diff --git a/Documentation/devicetree/bindings/arm/atmel-at91.txt b/Documentation/devicetree/bindings/arm/atmel-at91.txt index 799af90dd75b..91cb8e4f2a4f 100644 --- a/Documentation/devicetree/bindings/arm/atmel-at91.txt +++ b/Documentation/devicetree/bindings/arm/atmel-at91.txt @@ -41,6 +41,36 @@ compatible: must be one of: - "atmel,sama5d43" - "atmel,sama5d44" + * "atmel,samv7" for MCUs using a Cortex-M7, shall be extended with the specific + SoC family: + o "atmel,sams70" shall be extended with the specific MCU compatible: + - "atmel,sams70j19" + - "atmel,sams70j20" + - "atmel,sams70j21" + - "atmel,sams70n19" + - "atmel,sams70n20" + - "atmel,sams70n21" + - "atmel,sams70q19" + - "atmel,sams70q20" + - "atmel,sams70q21" + o "atmel,samv70" shall be extended with the specific MCU compatible: + - "atmel,samv70j19" + - "atmel,samv70j20" + - "atmel,samv70n19" + - "atmel,samv70n20" + - "atmel,samv70q19" + - "atmel,samv70q20" + o "atmel,samv71" shall be extended with the specific MCU compatible: + - "atmel,samv71j19" + - "atmel,samv71j20" + - "atmel,samv71j21" + - "atmel,samv71n19" + - "atmel,samv71n20" + - "atmel,samv71n21" + - "atmel,samv71q19" + - "atmel,samv71q20" + - "atmel,samv71q21" + Chipid required properties: - compatible: Should be "atmel,sama5d2-chipid" - reg : Should contain registers location and length diff --git a/Documentation/devicetree/bindings/arm/bcm/brcm,stingray.txt b/Documentation/devicetree/bindings/arm/bcm/brcm,stingray.txt new file mode 100644 index 000000000000..23a02178dd44 --- /dev/null +++ b/Documentation/devicetree/bindings/arm/bcm/brcm,stingray.txt @@ -0,0 +1,12 @@ +Broadcom Stingray device tree bindings +------------------------------------------------ + +Boards with Stingray shall have the following properties: + +Required root node property: + +Stingray Combo SVK board +compatible = "brcm,bcm958742k", "brcm,stingray"; + +Stingray SST100 board +compatible = "brcm,bcm958742t", "brcm,stingray"; diff --git a/Documentation/devicetree/bindings/arm/cci.txt b/Documentation/devicetree/bindings/arm/cci.txt index 0f2153e8fa7e..9600761f2d5b 100644 --- a/Documentation/devicetree/bindings/arm/cci.txt +++ b/Documentation/devicetree/bindings/arm/cci.txt @@ -11,13 +11,6 @@ clusters, through memory mapped interface, with a global control register space and multiple sets of interface control registers, one per slave interface. -Bindings for the CCI node follow the ePAPR standard, available from: - -www.power.org/documentation/epapr-version-1-1/ - -with the addition of the bindings described in this document which are -specific to ARM. - * CCI interconnect node Description: Describes a CCI cache coherent Interconnect component @@ -50,10 +43,10 @@ specific to ARM. as a tuple of cells, containing child address, parent address and the size of the region in the child address space. - Definition: A standard property. Follow rules in the ePAPR for - hierarchical bus addressing. CCI interfaces - addresses refer to the parent node addressing - scheme to declare their register bases. + Definition: A standard property. Follow rules in the Devicetree + Specification for hierarchical bus addressing. CCI + interfaces addresses refer to the parent node + addressing scheme to declare their register bases. CCI interconnect node can define the following child nodes: diff --git a/Documentation/devicetree/bindings/arm/ccn.txt b/Documentation/devicetree/bindings/arm/ccn.txt index b100d3847d88..29801456c9ee 100644 --- a/Documentation/devicetree/bindings/arm/ccn.txt +++ b/Documentation/devicetree/bindings/arm/ccn.txt @@ -3,6 +3,7 @@ Required properties: - compatible: (standard compatible string) should be one of: + "arm,ccn-502" "arm,ccn-504" "arm,ccn-508" diff --git a/Documentation/devicetree/bindings/arm/coresight-cpu-debug.txt b/Documentation/devicetree/bindings/arm/coresight-cpu-debug.txt new file mode 100644 index 000000000000..298291211ea4 --- /dev/null +++ b/Documentation/devicetree/bindings/arm/coresight-cpu-debug.txt @@ -0,0 +1,49 @@ +* CoreSight CPU Debug Component: + +CoreSight CPU debug component are compliant with the ARMv8 architecture +reference manual (ARM DDI 0487A.k) Chapter 'Part H: External debug'. The +external debug module is mainly used for two modes: self-hosted debug and +external debug, and it can be accessed from mmio region from Coresight +and eventually the debug module connects with CPU for debugging. And the +debug module provides sample-based profiling extension, which can be used +to sample CPU program counter, secure state and exception level, etc; +usually every CPU has one dedicated debug module to be connected. + +Required properties: + +- compatible : should be "arm,coresight-cpu-debug"; supplemented with + "arm,primecell" since this driver is using the AMBA bus + interface. + +- reg : physical base address and length of the register set. + +- clocks : the clock associated to this component. + +- clock-names : the name of the clock referenced by the code. Since we are + using the AMBA framework, the name of the clock providing + the interconnect should be "apb_pclk" and the clock is + mandatory. The interface between the debug logic and the + processor core is clocked by the internal CPU clock, so it + is enabled with CPU clock by default. + +- cpu : the CPU phandle the debug module is affined to. When omitted + the module is considered to belong to CPU0. + +Optional properties: + +- power-domains: a phandle to the debug power domain. We use "power-domains" + binding to turn on the debug logic if it has own dedicated + power domain and if necessary to use "cpuidle.off=1" or + "nohlt" in the kernel command line or sysfs node to + constrain idle states to ensure registers in the CPU power + domain are accessible. + +Example: + + debug@f6590000 { + compatible = "arm,coresight-cpu-debug","arm,primecell"; + reg = <0 0xf6590000 0 0x1000>; + clocks = <&sys_ctrl HI6220_DAPB_CLK>; + clock-names = "apb_pclk"; + cpu = <&cpu0>; + }; diff --git a/Documentation/devicetree/bindings/arm/cpus.txt b/Documentation/devicetree/bindings/arm/cpus.txt index 1030f5f50207..a44253cad269 100644 --- a/Documentation/devicetree/bindings/arm/cpus.txt +++ b/Documentation/devicetree/bindings/arm/cpus.txt @@ -6,9 +6,9 @@ The device tree allows to describe the layout of CPUs in a system through the "cpus" node, which in turn contains a number of subnodes (ie "cpu") defining properties for every cpu. -Bindings for CPU nodes follow the ePAPR v1.1 standard, available from: +Bindings for CPU nodes follow the Devicetree Specification, available from: -https://www.power.org/documentation/epapr-version-1-1/ +https://www.devicetree.org/specifications/ with updates for 32-bit and 64-bit ARM systems provided in this document. @@ -16,8 +16,8 @@ with updates for 32-bit and 64-bit ARM systems provided in this document. Convention used in this document ================================ -This document follows the conventions described in the ePAPR v1.1, with -the addition: +This document follows the conventions described in the Devicetree +Specification, with the addition: - square brackets define bitfields, eg reg[7:0] value of the bitfield in the reg property contained in bits 7 down to 0 @@ -26,8 +26,9 @@ the addition: cpus and cpu node bindings definition ===================================== -The ARM architecture, in accordance with the ePAPR, requires the cpus and cpu -nodes to be present and contain the properties described below. +The ARM architecture, in accordance with the Devicetree Specification, +requires the cpus and cpu nodes to be present and contain the properties +described below. - cpus node @@ -193,6 +194,7 @@ nodes to be present and contain the properties described below. "spin-table" # On ARM 32-bit systems this property is optional and can be one of: + "actions,s500-smp" "allwinner,sun6i-a31" "allwinner,sun8i-a23" "arm,realview-smp" @@ -249,7 +251,7 @@ nodes to be present and contain the properties described below. Usage: Optional Value type: <u32> Definition: - # u32 value representing CPU capacity [3] in + # u32 value representing CPU capacity [4] in DMIPS/MHz, relative to highest capacity-dmips-mhz in the system. @@ -476,5 +478,5 @@ cpus { [2] arm/msm/qcom,kpss-acc.txt [3] ARM Linux kernel documentation - idle states bindings Documentation/devicetree/bindings/arm/idle-states.txt -[3] ARM Linux kernel documentation - cpu capacity bindings +[4] ARM Linux kernel documentation - cpu capacity bindings Documentation/devicetree/bindings/arm/cpu-capacity.txt diff --git a/Documentation/devicetree/bindings/arm/gemini.txt b/Documentation/devicetree/bindings/arm/gemini.txt index 0041eb031116..55bf7ce96c44 100644 --- a/Documentation/devicetree/bindings/arm/gemini.txt +++ b/Documentation/devicetree/bindings/arm/gemini.txt @@ -24,6 +24,19 @@ Required nodes: global control registers, with the compatible string "cortina,gemini-syscon", "syscon"; + Required properties on the syscon: + - reg: syscon register location and size. + - #clock-cells: should be set to <1> - the system controller is also a + clock provider. + - #reset-cells: should be set to <1> - the system controller is also a + reset line provider. + + The clock sources have shorthand defines in the include file: + <dt-bindings/clock/cortina,gemini-clock.h> + + The reset lines have shorthand defines in the include file: + <dt-bindings/reset/cortina,gemini-reset.h> + - timer: the soc bus node must have a timer node pointing to the SoC timer block, with the compatible string "cortina,gemini-timer" See: clocksource/cortina,gemini-timer.txt @@ -56,12 +69,15 @@ Example: syscon: syscon@40000000 { compatible = "cortina,gemini-syscon", "syscon"; reg = <0x40000000 0x1000>; + #clock-cells = <1>; + #reset-cells = <1>; }; uart0: serial@42000000 { compatible = "ns16550a"; reg = <0x42000000 0x100>; - clock-frequency = <48000000>; + resets = <&syscon GEMINI_RESET_UART>; + clocks = <&syscon GEMINI_CLK_UART>; interrupts = <18 IRQ_TYPE_LEVEL_HIGH>; reg-shift = <2>; }; @@ -73,12 +89,18 @@ Example: interrupts = <14 IRQ_TYPE_EDGE_FALLING>, /* Timer 1 */ <15 IRQ_TYPE_EDGE_FALLING>, /* Timer 2 */ <16 IRQ_TYPE_EDGE_FALLING>; /* Timer 3 */ + resets = <&syscon GEMINI_RESET_TIMER>; + /* APB clock or RTC clock */ + clocks = <&syscon GEMINI_CLK_APB>, + <&syscon GEMINI_CLK_RTC>; + clock-names = "PCLK", "EXTCLK"; syscon = <&syscon>; }; intcon: interrupt-controller@48000000 { compatible = "cortina,gemini-interrupt-controller"; reg = <0x48000000 0x1000>; + resets = <&syscon GEMINI_RESET_INTCON0>; interrupt-controller; #interrupt-cells = <2>; }; diff --git a/Documentation/devicetree/bindings/arm/hisilicon/hisilicon.txt b/Documentation/devicetree/bindings/arm/hisilicon/hisilicon.txt index 2e732152064b..7111fbc82a4e 100644 --- a/Documentation/devicetree/bindings/arm/hisilicon/hisilicon.txt +++ b/Documentation/devicetree/bindings/arm/hisilicon/hisilicon.txt @@ -4,6 +4,10 @@ Hi3660 SoC Required root node properties: - compatible = "hisilicon,hi3660"; +HiKey960 Board +Required root node properties: + - compatible = "hisilicon,hi3660-hikey960", "hisilicon,hi3660"; + Hi3798cv200 SoC Required root node properties: - compatible = "hisilicon,hi3798cv200"; diff --git a/Documentation/devicetree/bindings/arm/idle-states.txt b/Documentation/devicetree/bindings/arm/idle-states.txt index b8e41c148a3c..7a591333f2b1 100644 --- a/Documentation/devicetree/bindings/arm/idle-states.txt +++ b/Documentation/devicetree/bindings/arm/idle-states.txt @@ -695,5 +695,5 @@ cpus { [4] ARM Architecture Reference Manuals http://infocenter.arm.com/help/index.jsp -[5] ePAPR standard - https://www.power.org/documentation/epapr-version-1-1/ +[5] Devicetree Specification + https://www.devicetree.org/specifications/ diff --git a/Documentation/devicetree/bindings/arm/keystone/keystone.txt b/Documentation/devicetree/bindings/arm/keystone/keystone.txt index 48f6703a28c8..f310bad04483 100644 --- a/Documentation/devicetree/bindings/arm/keystone/keystone.txt +++ b/Documentation/devicetree/bindings/arm/keystone/keystone.txt @@ -37,3 +37,6 @@ Boards: - K2G EVM compatible = "ti,k2g-evm", "ti,k2g", "ti-keystone" + +- K2G Industrial Communication Engine EVM + compatible = "ti,k2g-ice", "ti,k2g", "ti-keystone" diff --git a/Documentation/devicetree/bindings/arm/l2c2x0.txt b/Documentation/devicetree/bindings/arm/l2c2x0.txt index d9650c1788f4..fbe6cb21f4cf 100644 --- a/Documentation/devicetree/bindings/arm/l2c2x0.txt +++ b/Documentation/devicetree/bindings/arm/l2c2x0.txt @@ -4,8 +4,8 @@ ARM cores often have a separate L2C210/L2C220/L2C310 (also known as PL210/PL220/ PL310 and variants) based level 2 cache controller. All these various implementations of the L2 cache controller have compatible programming models (Note 1). Some of the properties that are just prefixed "cache-*" are taken from section -3.7.3 of the ePAPR v1.1 specification which can be found at: -https://www.power.org/wp-content/uploads/2012/06/Power_ePAPR_APPROVED_v1.1.pdf +3.7.3 of the Devicetree Specification which can be found at: +https://www.devicetree.org/specifications/ The ARM L2 cache representation in the device tree should be done as follows: diff --git a/Documentation/devicetree/bindings/arm/marvell/ap806-system-controller.txt b/Documentation/devicetree/bindings/arm/marvell/ap806-system-controller.txt index 8968371d84e2..0b887440e08a 100644 --- a/Documentation/devicetree/bindings/arm/marvell/ap806-system-controller.txt +++ b/Documentation/devicetree/bindings/arm/marvell/ap806-system-controller.txt @@ -7,6 +7,14 @@ registers giving access to numerous features: clocks, pin-muxing and many other SoC configuration items. This DT binding allows to describe this system controller. +For the top level node: + - compatible: must be: "syscon", "simple-mfd"; + - reg: register area of the AP806 system controller + +Clocks: +------- + + The Device Tree node representing the AP806 system controller provides a number of clocks: @@ -17,19 +25,76 @@ a number of clocks: Required properties: - - compatible: must be: - "marvell,ap806-system-controller", "syscon" - - reg: register area of the AP806 system controller + - compatible: must be: "marvell,ap806-clock" - #clock-cells: must be set to 1 - - clock-output-names: must be defined to: - "ap-cpu-cluster-0", "ap-cpu-cluster-1", "ap-fixed", "ap-mss" + +Pinctrl: +-------- + +For common binding part and usage, refer to +Documentation/devicetree/bindings/pinctrl/marvell,mvebu-pinctrl.txt. + +Required properties: +- compatible must be "marvell,ap806-pinctrl", + +Available mpp pins/groups and functions: +Note: brackets (x) are not part of the mpp name for marvell,function and given +only for more detailed description in this document. + +name pins functions +================================================================================ +mpp0 0 gpio, sdio(clk), spi0(clk) +mpp1 1 gpio, sdio(cmd), spi0(miso) +mpp2 2 gpio, sdio(d0), spi0(mosi) +mpp3 3 gpio, sdio(d1), spi0(cs0n) +mpp4 4 gpio, sdio(d2), i2c0(sda) +mpp5 5 gpio, sdio(d3), i2c0(sdk) +mpp6 6 gpio, sdio(ds) +mpp7 7 gpio, sdio(d4), uart1(rxd) +mpp8 8 gpio, sdio(d5), uart1(txd) +mpp9 9 gpio, sdio(d6), spi0(cs1n) +mpp10 10 gpio, sdio(d7) +mpp11 11 gpio, uart0(txd) +mpp12 12 gpio, sdio(pw_off), sdio(hw_rst) +mpp13 13 gpio +mpp14 14 gpio +mpp15 15 gpio +mpp16 16 gpio +mpp17 17 gpio +mpp18 18 gpio +mpp19 19 gpio, uart0(rxd), sdio(pw_off) + +GPIO: +----- +For common binding part and usage, refer to +Documentation/devicetree/bindings/gpio/gpio-mvebu.txt. + +Required properties: + +- compatible: "marvell,armada-8k-gpio" + +- offset: offset address inside the syscon block Example: +ap_syscon: system-controller@6f4000 { + compatible = "syscon", "simple-mfd"; + reg = <0x6f4000 0x1000>; - syscon: system-controller@6f4000 { - compatible = "marvell,ap806-system-controller", "syscon"; + ap_clk: clock { + compatible = "marvell,ap806-clock"; #clock-cells = <1>; - clock-output-names = "ap-cpu-cluster-0", "ap-cpu-cluster-1", - "ap-fixed", "ap-mss"; - reg = <0x6f4000 0x1000>; }; + + ap_pinctrl: pinctrl { + compatible = "marvell,ap806-pinctrl"; + }; + + ap_gpio: gpio { + compatible = "marvell,armada-8k-gpio"; + offset = <0x1040>; + ngpios = <19>; + gpio-controller; + #gpio-cells = <2>; + gpio-ranges = <&ap_pinctrl 0 0 19>; + }; +}; diff --git a/Documentation/devicetree/bindings/arm/marvell/cp110-system-controller0.txt b/Documentation/devicetree/bindings/arm/marvell/cp110-system-controller0.txt index 07dbb358182c..171d02cadea4 100644 --- a/Documentation/devicetree/bindings/arm/marvell/cp110-system-controller0.txt +++ b/Documentation/devicetree/bindings/arm/marvell/cp110-system-controller0.txt @@ -7,6 +7,13 @@ Controller 0 and System Controller 1. This Device Tree binding allows to describe the first system controller, which provides registers to configure various aspects of the SoC. +For the top level node: + - compatible: must be: "syscon", "simple-mfd"; + - reg: register area of the CP110 system controller 0 + +Clocks: +------- + The Device Tree node representing this System Controller 0 provides a number of clocks: @@ -27,6 +34,7 @@ The following clocks are available: - 0 2 EIP - 0 3 Core - 0 4 NAND core + - 0 5 SDIO core - Gatable clocks - 1 0 Audio - 1 1 Comm Unit @@ -56,28 +64,126 @@ The following clocks are available: Required properties: - compatible: must be: - "marvell,cp110-system-controller0", "syscon"; - - reg: register area of the CP110 system controller 0 + "marvell,cp110-clock" - #clock-cells: must be set to 2 - - core-clock-output-names must be set to: - "cpm-apll", "cpm-ppv2-core", "cpm-eip", "cpm-core", "cpm-nand-core" - - gate-clock-output-names must be set to: - "cpm-audio", "cpm-communit", "cpm-nand", "cpm-ppv2", "cpm-sdio", - "cpm-mg-domain", "cpm-mg-core", "cpm-xor1", "cpm-xor0", "cpm-gop-dp", "none", - "cpm-pcie_x10", "cpm-pcie_x11", "cpm-pcie_x4", "cpm-pcie-xor", "cpm-sata", - "cpm-sata-usb", "cpm-main", "cpm-sd-mmc-gop", "none", "none", "cpm-slow-io", - "cpm-usb3h0", "cpm-usb3h1", "cpm-usb3dev", "cpm-eip150", "cpm-eip197"; + +Pinctrl: +-------- + +For common binding part and usage, refer to the file +Documentation/devicetree/bindings/pinctrl/marvell,mvebu-pinctrl.txt. + +Required properties: + +- compatible: "marvell,armada-7k-pinctrl", + "marvell,armada-8k-cpm-pinctrl" or "marvell,armada-8k-cps-pinctrl" + depending on the specific variant of the SoC being used. + +Available mpp pins/groups and functions: +Note: brackets (x) are not part of the mpp name for marvell,function and given +only for more detailed description in this document. + +name pins functions +================================================================================ +mpp0 0 gpio, dev(ale1), au(i2smclk), ge0(rxd3), tdm(pclk), ptp(pulse), mss_i2c(sda), uart0(rxd), sata0(present_act), ge(mdio) +mpp1 1 gpio, dev(ale0), au(i2sdo_spdifo), ge0(rxd2), tdm(drx), ptp(clk), mss_i2c(sck), uart0(txd), sata1(present_act), ge(mdc) +mpp2 2 gpio, dev(ad15), au(i2sextclk), ge0(rxd1), tdm(dtx), mss_uart(rxd), ptp(pclk_out), i2c1(sck), uart1(rxd), sata0(present_act), xg(mdc) +mpp3 3 gpio, dev(ad14), au(i2slrclk), ge0(rxd0), tdm(fsync), mss_uart(txd), pcie(rstoutn), i2c1(sda), uart1(txd), sata1(present_act), xg(mdio) +mpp4 4 gpio, dev(ad13), au(i2sbclk), ge0(rxctl), tdm(rstn), mss_uart(rxd), uart1(cts), pcie0(clkreq), uart3(rxd), ge(mdc) +mpp5 5 gpio, dev(ad12), au(i2sdi), ge0(rxclk), tdm(intn), mss_uart(txd), uart1(rts), pcie1(clkreq), uart3(txd), ge(mdio) +mpp6 6 gpio, dev(ad11), ge0(txd3), spi0(csn2), au(i2sextclk), sata1(present_act), pcie2(clkreq), uart0(rxd), ptp(pulse) +mpp7 7 gpio, dev(ad10), ge0(txd2), spi0(csn1), spi1(csn1), sata0(present_act), led(data), uart0(txd), ptp(clk) +mpp8 8 gpio, dev(ad9), ge0(txd1), spi0(csn0), spi1(csn0), uart0(cts), led(stb), uart2(rxd), ptp(pclk_out), synce1(clk) +mpp9 9 gpio, dev(ad8), ge0(txd0), spi0(mosi), spi1(mosi), pcie(rstoutn), synce2(clk) +mpp10 10 gpio, dev(readyn), ge0(txctl), spi0(miso), spi1(miso), uart0(cts), sata1(present_act) +mpp11 11 gpio, dev(wen1), ge0(txclkout), spi0(clk), spi1(clk), uart0(rts), led(clk), uart2(txd), sata0(present_act) +mpp12 12 gpio, dev(clk_out), nf(rbn1), spi1(csn1), ge0(rxclk) +mpp13 13 gpio, dev(burstn), nf(rbn0), spi1(miso), ge0(rxctl), mss_spi(miso) +mpp14 14 gpio, dev(bootcsn), dev(csn0), spi1(csn0), spi0(csn3), au(i2sextclk), spi0(miso), sata0(present_act), mss_spi(csn) +mpp15 15 gpio, dev(ad7), spi1(mosi), spi0(mosi), mss_spi(mosi), ptp(pulse_cp2cp) +mpp16 16 gpio, dev(ad6), spi1(clk), mss_spi(clk) +mpp17 17 gpio, dev(ad5), ge0(txd3) +mpp18 18 gpio, dev(ad4), ge0(txd2), ptp(clk_cp2cp) +mpp19 19 gpio, dev(ad3), ge0(txd1), wakeup(out_cp2cp) +mpp20 20 gpio, dev(ad2), ge0(txd0) +mpp21 21 gpio, dev(ad1), ge0(txctl), sei(in_cp2cp) +mpp22 22 gpio, dev(ad0), ge0(txclkout), wakeup(in_cp2cp) +mpp23 23 gpio, dev(a1), au(i2smclk), link(rd_in_cp2cp) +mpp24 24 gpio, dev(a0), au(i2slrclk) +mpp25 25 gpio, dev(oen), au(i2sdo_spdifo) +mpp26 26 gpio, dev(wen0), au(i2sbclk) +mpp27 27 gpio, dev(csn0), spi1(miso), mss_gpio4, ge0(rxd3), spi0(csn4), ge(mdio), sata0(present_act), uart0(rts), rei(in_cp2cp) +mpp28 28 gpio, dev(csn1), spi1(csn0), mss_gpio5, ge0(rxd2), spi0(csn5), pcie2(clkreq), ptp(pulse), ge(mdc), sata1(present_act), uart0(cts), led(data) +mpp29 29 gpio, dev(csn2), spi1(mosi), mss_gpio6, ge0(rxd1), spi0(csn6), pcie1(clkreq), ptp(clk), mss_i2c(sda), sata0(present_act), uart0(rxd), led(stb) +mpp30 30 gpio, dev(csn3), spi1(clk), mss_gpio7, ge0(rxd0), spi0(csn7), pcie0(clkreq), ptp(pclk_out), mss_i2c(sck), sata1(present_act), uart0(txd), led(clk) +mpp31 31 gpio, dev(a2), mss_gpio4, pcie(rstoutn), ge(mdc) +mpp32 32 gpio, mii(col), mii(txerr), mss_spi(miso), tdm(drx), au(i2sextclk), au(i2sdi), ge(mdio), sdio(v18_en), pcie1(clkreq), mss_gpio0 +mpp33 33 gpio, mii(txclk), sdio(pwr10), mss_spi(csn), tdm(fsync), au(i2smclk), sdio(bus_pwr), xg(mdio), pcie2(clkreq), mss_gpio1 +mpp34 34 gpio, mii(rxerr), sdio(pwr11), mss_spi(mosi), tdm(dtx), au(i2slrclk), sdio(wr_protect), ge(mdc), pcie0(clkreq), mss_gpio2 +mpp35 35 gpio, sata1(present_act), i2c1(sda), mss_spi(clk), tdm(pclk), au(i2sdo_spdifo), sdio(card_detect), xg(mdio), ge(mdio), pcie(rstoutn), mss_gpio3 +mpp36 36 gpio, synce2(clk), i2c1(sck), ptp(clk), synce1(clk), au(i2sbclk), sata0(present_act), xg(mdc), ge(mdc), pcie2(clkreq), mss_gpio5 +mpp37 37 gpio, uart2(rxd), i2c0(sck), ptp(pclk_out), tdm(intn), mss_i2c(sck), sata1(present_act), ge(mdc), xg(mdc), pcie1(clkreq), mss_gpio6, link(rd_out_cp2cp) +mpp38 38 gpio, uart2(txd), i2c0(sda), ptp(pulse), tdm(rstn), mss_i2c(sda), sata0(present_act), ge(mdio), xg(mdio), au(i2sextclk), mss_gpio7, ptp(pulse_cp2cp) +mpp39 39 gpio, sdio(wr_protect), au(i2sbclk), ptp(clk), spi0(csn1), sata1(present_act), mss_gpio0 +mpp40 40 gpio, sdio(pwr11), synce1(clk), mss_i2c(sda), au(i2sdo_spdifo), ptp(pclk_out), spi0(clk), uart1(txd), ge(mdio), sata0(present_act), mss_gpio1 +mpp41 41 gpio, sdio(pwr10), sdio(bus_pwr), mss_i2c(sck), au(i2slrclk), ptp(pulse), spi0(mosi), uart1(rxd), ge(mdc), sata1(present_act), mss_gpio2, rei(out_cp2cp) +mpp42 42 gpio, sdio(v18_en), sdio(wr_protect), synce2(clk), au(i2smclk), mss_uart(txd), spi0(miso), uart1(cts), xg(mdc), sata0(present_act), mss_gpio4 +mpp43 43 gpio, sdio(card_detect), synce1(clk), au(i2sextclk), mss_uart(rxd), spi0(csn0), uart1(rts), xg(mdio), sata1(present_act), mss_gpio5, wakeup(out_cp2cp) +mpp44 44 gpio, ge1(txd2), uart0(rts), ptp(clk_cp2cp) +mpp45 45 gpio, ge1(txd3), uart0(txd), pcie(rstoutn) +mpp46 46 gpio, ge1(txd1), uart1(rts) +mpp47 47 gpio, ge1(txd0), spi1(clk), uart1(txd), ge(mdc) +mpp48 48 gpio, ge1(txctl_txen), spi1(mosi), xg(mdc), wakeup(in_cp2cp) +mpp49 49 gpio, ge1(txclkout), mii(crs), spi1(miso), uart1(rxd), ge(mdio), pcie0(clkreq), sdio(v18_en), sei(out_cp2cp) +mpp50 50 gpio, ge1(rxclk), mss_i2c(sda), spi1(csn0), uart2(txd), uart0(rxd), xg(mdio), sdio(pwr11) +mpp51 51 gpio, ge1(rxd0), mss_i2c(sck), spi1(csn1), uart2(rxd), uart0(cts), sdio(pwr10) +mpp52 52 gpio, ge1(rxd1), synce1(clk), synce2(clk), spi1(csn2), uart1(cts), led(clk), pcie(rstoutn), pcie0(clkreq) +mpp53 53 gpio, ge1(rxd2), ptp(clk), spi1(csn3), uart1(rxd), led(stb), sdio(led) +mpp54 54 gpio, ge1(rxd3), synce2(clk), ptp(pclk_out), synce1(clk), led(data), sdio(hw_rst), sdio(wr_protect) +mpp55 55 gpio, ge1(rxctl_rxdv), ptp(pulse), sdio(led), sdio(card_detect) +mpp56 56 gpio, tdm(drx), au(i2sdo_spdifo), spi0(clk), uart1(rxd), sata1(present_act), sdio(clk) +mpp57 57 gpio, mss_i2c(sda), ptp(pclk_out), tdm(intn), au(i2sbclk), spi0(mosi), uart1(txd), sata0(present_act), sdio(cmd) +mpp58 58 gpio, mss_i2c(sck), ptp(clk), tdm(rstn), au(i2sdi), spi0(miso), uart1(cts), led(clk), sdio(d0) +mpp59 59 gpio, mss_gpio7, synce2(clk), tdm(fsync), au(i2slrclk), spi0(csn0), uart0(cts), led(stb), uart1(txd), sdio(d1) +mpp60 60 gpio, mss_gpio6, ptp(pulse), tdm(dtx), au(i2smclk), spi0(csn1), uart0(rts), led(data), uart1(rxd), sdio(d2) +mpp61 61 gpio, mss_gpio5, ptp(clk), tdm(pclk), au(i2sextclk), spi0(csn2), uart0(txd), uart2(txd), sata1(present_act), ge(mdio), sdio(d3) +mpp62 62 gpio, mss_gpio4, synce1(clk), ptp(pclk_out), sata1(present_act), spi0(csn3), uart0(rxd), uart2(rxd), sata0(present_act), ge(mdc) + +GPIO: +----- + +For common binding part and usage, refer to +Documentation/devicetree/bindings/gpio/gpio-mvebu.txt. + +Required properties: + +- compatible: "marvell,armada-8k-gpio" + +- offset: offset address inside the syscon block Example: - cpm_syscon0: system-controller@440000 { - compatible = "marvell,cp110-system-controller0", "syscon"; - reg = <0x440000 0x1000>; +cpm_syscon0: system-controller@440000 { + compatible = "syscon", "simple-mfd"; + reg = <0x440000 0x1000>; + + cpm_clk: clock { + compatible = "marvell,cp110-clock"; #clock-cells = <2>; - core-clock-output-names = "cpm-apll", "cpm-ppv2-core", "cpm-eip", "cpm-core", "cpm-nand-core"; - gate-clock-output-names = "cpm-audio", "cpm-communit", "cpm-nand", "cpm-ppv2", "cpm-sdio", - "cpm-mg-domain", "cpm-mg-core", "cpm-xor1", "cpm-xor0", "cpm-gop-dp", "none", - "cpm-pcie_x10", "cpm-pcie_x11", "cpm-pcie_x4", "cpm-pcie-xor", "cpm-sata", - "cpm-sata-usb", "cpm-main", "cpm-sd-mmc-gop", "none", "none", "cpm-slow-io", - "cpm-usb3h0", "cpm-usb3h1", "cpm-usb3dev", "cpm-eip150", "cpm-eip197"; }; + + cpm_pinctrl: pinctrl { + compatible = "marvell,armada-8k-cpm-pinctrl"; + }; + + cpm_gpio1: gpio@100 { + compatible = "marvell,armada-8k-gpio"; + offset = <0x100>; + ngpios = <32>; + gpio-controller; + #gpio-cells = <2>; + gpio-ranges = <&cpm_pinctrl 0 0 32>; + status = "disabled"; + }; + +}; diff --git a/Documentation/devicetree/bindings/arm/mediatek.txt b/Documentation/devicetree/bindings/arm/mediatek.txt index c860b245d8c8..da7bd138e6f2 100644 --- a/Documentation/devicetree/bindings/arm/mediatek.txt +++ b/Documentation/devicetree/bindings/arm/mediatek.txt @@ -12,6 +12,8 @@ compatible: Must contain one of "mediatek,mt6592" "mediatek,mt6755" "mediatek,mt6795" + "mediatek,mt6797" + "mediatek,mt7622" "mediatek,mt7623" "mediatek,mt8127" "mediatek,mt8135" @@ -38,6 +40,12 @@ Supported boards: - Evaluation board for MT6795(Helio X10): Required root node properties: - compatible = "mediatek,mt6795-evb", "mediatek,mt6795"; +- Evaluation board for MT6797(Helio X20): + Required root node properties: + - compatible = "mediatek,mt6797-evb", "mediatek,mt6797"; +- Reference board variant 1 for MT7622: + Required root node properties: + - compatible = "mediatek,mt7622-rfb1", "mediatek,mt7622"; - Evaluation board for MT7623: Required root node properties: - compatible = "mediatek,mt7623-evb", "mediatek,mt7623"; diff --git a/Documentation/devicetree/bindings/arm/realtek.txt b/Documentation/devicetree/bindings/arm/realtek.txt new file mode 100644 index 000000000000..13d755787b4f --- /dev/null +++ b/Documentation/devicetree/bindings/arm/realtek.txt @@ -0,0 +1,20 @@ +Realtek platforms device tree bindings +-------------------------------------- + + +RTD1295 SoC +=========== + +Required root node properties: + + - compatible : must contain "realtek,rtd1295" + + +Root node property compatible must contain, depending on board: + + - Zidoo X9S: "zidoo,x9s" + + +Example: + + compatible = "zidoo,x9s", "realtek,rtd1295"; diff --git a/Documentation/devicetree/bindings/arm/rockchip.txt b/Documentation/devicetree/bindings/arm/rockchip.txt index c965d99e86c2..11c0ac4a2d56 100644 --- a/Documentation/devicetree/bindings/arm/rockchip.txt +++ b/Documentation/devicetree/bindings/arm/rockchip.txt @@ -42,6 +42,10 @@ Rockchip platforms device tree bindings Required root node properties: - compatible = "firefly,firefly-rk3288-reload", "rockchip,rk3288"; +- Firefly Firefly-RK3399 board: + Required root node properties: + - compatible = "firefly,firefly-rk3399", "rockchip,rk3399"; + - ChipSPARK PopMetal-RK3288 board: Required root node properties: - compatible = "chipspark,popmetal-rk3288", "rockchip,rk3288"; @@ -138,9 +142,9 @@ Rockchip platforms device tree bindings Required root node properties: - compatible = "rockchip,px5-evb", "rockchip,px5", "rockchip,rk3368"; -- Rockchip RK1108 Evaluation board +- Rockchip RV1108 Evaluation board Required root node properties: - - compatible = "rockchip,rk1108-evb", "rockchip,rk1108"; + - compatible = "rockchip,rv1108-evb", "rockchip,rv1108"; - Rockchip RK3368 evb: Required root node properties: diff --git a/Documentation/devicetree/bindings/arm/shmobile.txt b/Documentation/devicetree/bindings/arm/shmobile.txt index 170fe0562c63..1a671e329864 100644 --- a/Documentation/devicetree/bindings/arm/shmobile.txt +++ b/Documentation/devicetree/bindings/arm/shmobile.txt @@ -55,12 +55,19 @@ Boards: compatible = "renesas,bockw", "renesas,r8a7778" - Genmai (RTK772100BC00000BR) compatible = "renesas,genmai", "renesas,r7s72100" + - GR-Peach (X28A-M01-E/F) + compatible = "renesas,gr-peach", "renesas,r7s72100" - Gose (RTP0RC7793SEB00010S) compatible = "renesas,gose", "renesas,r8a7793" - - H3ULCB (R-Car Starter Kit Premier, RTP0RC7795SKB00010S) + - H3ULCB (R-Car Starter Kit Premier, RTP0RC7795SKBX0010SA00 (H3 ES1.1)) + H3ULCB (R-Car Starter Kit Premier, RTP0RC77951SKBX010SA00 (H3 ES2.0)) compatible = "renesas,h3ulcb", "renesas,r8a7795"; - Henninger compatible = "renesas,henninger", "renesas,r8a7791" + - iWave Systems RZ/G1M Qseven Development Platform (iW-RainboW-G20D-Qseven) + compatible = "iwave,g20d", "iwave,g20m", "renesas,r8a7743" + - iWave Systems RZ/G1M Qseven System On Module (iW-RainboW-G20M-Qseven) + compatible = "iwave,g20m", "renesas,r8a7743" - Koelsch (RTP0RC7791SEB00010S) compatible = "renesas,koelsch", "renesas,r8a7791" - Kyoto Microcomputer Co. KZM-A9-Dual @@ -69,7 +76,7 @@ Boards: compatible = "renesas,kzm9g", "renesas,sh73a0" - Lager (RTP0RC7790SEB00010S) compatible = "renesas,lager", "renesas,r8a7790" - - M3ULCB (R-Car Starter Kit Pro, RTP0RC7796SKB00010S) + - M3ULCB (R-Car Starter Kit Pro, RTP0RC7796SKBX0010SA09 (M3 ES1.0)) compatible = "renesas,m3ulcb", "renesas,r8a7796"; - Marzen (R0P7779A00010S) compatible = "renesas,marzen", "renesas,r8a7779" @@ -81,6 +88,8 @@ Boards: compatible = "renesas,salvator-x", "renesas,r8a7795"; - Salvator-X (RTP0RC7796SIPB0011S) compatible = "renesas,salvator-x", "renesas,r8a7796"; + - Salvator-XS (Salvator-X 2nd version, RTP0RC7795SIPB0012S) + compatible = "renesas,salvator-xs", "renesas,r8a7795"; - SILK (RTP0RC7794LCB00011S) compatible = "renesas,silk", "renesas,r8a7794" - SK-RZG1E (YR8A77450S000BE) diff --git a/Documentation/devicetree/bindings/arm/tegra.txt b/Documentation/devicetree/bindings/arm/tegra.txt index b5a4342c1d46..7f1411bbabf7 100644 --- a/Documentation/devicetree/bindings/arm/tegra.txt +++ b/Documentation/devicetree/bindings/arm/tegra.txt @@ -29,7 +29,6 @@ board-specific compatible values: nvidia,harmony nvidia,seaboard nvidia,ventana - nvidia,whistler toradex,apalis_t30 toradex,apalis_t30-eval toradex,apalis-tk1 diff --git a/Documentation/devicetree/bindings/arm/topology.txt b/Documentation/devicetree/bindings/arm/topology.txt index 1061faf5f602..de9eb0486630 100644 --- a/Documentation/devicetree/bindings/arm/topology.txt +++ b/Documentation/devicetree/bindings/arm/topology.txt @@ -29,9 +29,9 @@ corresponding to the system hierarchy; syntactically they are defined as device tree nodes. The remainder of this document provides the topology bindings for ARM, based -on the ePAPR standard, available from: +on the Devicetree Specification, available from: -http://www.power.org/documentation/epapr-version-1-1/ +https://www.devicetree.org/specifications/ If not stated otherwise, whenever a reference to a cpu node phandle is made its value must point to a cpu node compliant with the cpu node bindings as diff --git a/Documentation/devicetree/bindings/ata/ahci-fsl-qoriq.txt b/Documentation/devicetree/bindings/ata/ahci-fsl-qoriq.txt index fc33ca01e9ba..7c3ca0e13de0 100644 --- a/Documentation/devicetree/bindings/ata/ahci-fsl-qoriq.txt +++ b/Documentation/devicetree/bindings/ata/ahci-fsl-qoriq.txt @@ -3,7 +3,7 @@ Binding for Freescale QorIQ AHCI SATA Controller Required properties: - reg: Physical base address and size of the controller's register area. - compatible: Compatibility string. Must be 'fsl,<chip>-ahci', where - chip could be ls1021a, ls1043a, ls1046a, ls2080a etc. + chip could be ls1021a, ls1043a, ls1046a, ls1088a, ls2080a etc. - clocks: Input clock specifier. Refer to common clock bindings. - interrupts: Interrupt specifier. Refer to interrupt binding. diff --git a/Documentation/devicetree/bindings/ata/cortina,gemini-sata-bridge.txt b/Documentation/devicetree/bindings/ata/cortina,gemini-sata-bridge.txt new file mode 100644 index 000000000000..1c3d3cc70051 --- /dev/null +++ b/Documentation/devicetree/bindings/ata/cortina,gemini-sata-bridge.txt @@ -0,0 +1,55 @@ +* Cortina Systems Gemini SATA Bridge + +The Gemini SATA bridge in a SoC-internal PATA to SATA bridge that +takes two Faraday Technology FTIDE010 PATA controllers and bridges +them in different configurations to two SATA ports. + +Required properties: +- compatible: should be + "cortina,gemini-sata-bridge" +- reg: registers and size for the block +- resets: phandles to the reset lines for both SATA bridges +- reset-names: must be "sata0", "sata1" +- clocks: phandles to the compulsory peripheral clocks +- clock-names: must be "SATA0_PCLK", "SATA1_PCLK" +- syscon: a phandle to the global Gemini system controller +- cortina,gemini-ata-muxmode: tell the desired multiplexing mode for + the ATA controller and SATA bridges. Values 0..3: + Mode 0: ata0 master <-> sata0 + ata1 master <-> sata1 + ata0 slave interface brought out on IDE pads + Mode 1: ata0 master <-> sata0 + ata1 master <-> sata1 + ata1 slave interface brought out on IDE pads + Mode 2: ata1 master <-> sata1 + ata1 slave <-> sata0 + ata0 master and slave interfaces brought out + on IDE pads + Mode 3: ata0 master <-> sata0 + ata0 slave <-> sata1 + ata1 master and slave interfaces brought out + on IDE pads + +Optional boolean properties: +- cortina,gemini-enable-ide-pins: enables the PATA to IDE connection. + The muxmode setting decides whether ATA0 or ATA1 is brought out, + and whether master, slave or both interfaces get brought out. +- cortina,gemini-enable-sata-bridge: enables the PATA to SATA bridge + inside the Gemnini SoC. The Muxmode decides what PATA blocks will + be muxed out and how. + +Example: + +sata: sata@46000000 { + compatible = "cortina,gemini-sata-bridge"; + reg = <0x46000000 0x100>; + resets = <&rcon 26>, <&rcon 27>; + reset-names = "sata0", "sata1"; + clocks = <&gcc GEMINI_CLK_GATE_SATA0>, + <&gcc GEMINI_CLK_GATE_SATA1>; + clock-names = "SATA0_PCLK", "SATA1_PCLK"; + syscon = <&syscon>; + cortina,gemini-ata-muxmode = <3>; + cortina,gemini-enable-ide-pins; + cortina,gemini-enable-sata-bridge; +}; diff --git a/Documentation/devicetree/bindings/ata/faraday,ftide010.txt b/Documentation/devicetree/bindings/ata/faraday,ftide010.txt new file mode 100644 index 000000000000..a0c64a29104d --- /dev/null +++ b/Documentation/devicetree/bindings/ata/faraday,ftide010.txt @@ -0,0 +1,38 @@ +* Faraday Technology FTIDE010 PATA controller + +This controller is the first Faraday IDE interface block, used in the +StorLink SL2312 and SL3516, later known as the Cortina Systems Gemini +platform. The controller can do PIO modes 0 through 4, Multi-word DMA +(MWDM)modes 0 through 2 and Ultra DMA modes 0 through 6. + +On the Gemini platform, this PATA block is accompanied by a PATA to +SATA bridge in order to support SATA. This is why a phandle to that +controller is compulsory on that platform. + +The timing properties are unique per-SoC, not per-board. + +Required properties: +- compatible: should be one of + "cortina,gemini-pata", "faraday,ftide010" + "faraday,ftide010" +- interrupts: interrupt for the block +- reg: registers and size for the block + +Optional properties: +- clocks: a SoC clock running the peripheral. +- clock-names: should be set to "PCLK" for the peripheral clock. + +Required properties for "cortina,gemini-pata" compatible: +- sata: a phande to the Gemini PATA to SATA bridge, see + cortina,gemini-sata-bridge.txt for details. + +Example: + +ata@63000000 { + compatible = "cortina,gemini-pata", "faraday,ftide010"; + reg = <0x63000000 0x100>; + interrupts = <4 IRQ_TYPE_EDGE_RISING>; + clocks = <&gcc GEMINI_CLK_GATE_IDE>; + clock-names = "PCLK"; + sata = <&sata>; +}; diff --git a/Documentation/devicetree/bindings/bus/brcm,gisb-arb.txt b/Documentation/devicetree/bindings/bus/brcm,gisb-arb.txt index 1eceefb20f01..8a6c3c2e58fe 100644 --- a/Documentation/devicetree/bindings/bus/brcm,gisb-arb.txt +++ b/Documentation/devicetree/bindings/bus/brcm,gisb-arb.txt @@ -3,7 +3,8 @@ Broadcom GISB bus Arbiter controller Required properties: - compatible: - "brcm,gisb-arb" or "brcm,bcm7445-gisb-arb" for 28nm chips + "brcm,bcm7278-gisb-arb" for V7 28nm chips + "brcm,gisb-arb" or "brcm,bcm7445-gisb-arb" for other 28nm chips "brcm,bcm7435-gisb-arb" for newer 40nm chips "brcm,bcm7400-gisb-arb" for older 40nm chips and all 65nm chips "brcm,bcm7038-gisb-arb" for 130nm chips diff --git a/Documentation/devicetree/bindings/bus/simple-pm-bus.txt b/Documentation/devicetree/bindings/bus/simple-pm-bus.txt index d032237512c2..6f15037131ed 100644 --- a/Documentation/devicetree/bindings/bus/simple-pm-bus.txt +++ b/Documentation/devicetree/bindings/bus/simple-pm-bus.txt @@ -10,7 +10,7 @@ enabled for child devices connected to the bus (either on-SoC or externally) to function. While "simple-pm-bus" follows the "simple-bus" set of properties, as specified -in ePAPR, it is not an extension of "simple-bus". +in the Devicetree Specification, it is not an extension of "simple-bus". Required properties: diff --git a/Documentation/devicetree/bindings/chosen.txt b/Documentation/devicetree/bindings/chosen.txt index b5e39af4ddc0..dee3f5d9df26 100644 --- a/Documentation/devicetree/bindings/chosen.txt +++ b/Documentation/devicetree/bindings/chosen.txt @@ -10,7 +10,8 @@ stdout-path property -------------------- Device trees may specify the device to be used for boot console output -with a stdout-path property under /chosen, as described in ePAPR, e.g. +with a stdout-path property under /chosen, as described in the Devicetree +Specification, e.g. / { chosen { diff --git a/Documentation/devicetree/bindings/clock/amlogic,meson8b-clkc.txt b/Documentation/devicetree/bindings/clock/amlogic,meson8b-clkc.txt index 2b7b3fa588d7..606da38c0959 100644 --- a/Documentation/devicetree/bindings/clock/amlogic,meson8b-clkc.txt +++ b/Documentation/devicetree/bindings/clock/amlogic,meson8b-clkc.txt @@ -1,11 +1,14 @@ -* Amlogic Meson8b Clock and Reset Unit +* Amlogic Meson8, Meson8b and Meson8m2 Clock and Reset Unit -The Amlogic Meson8b clock controller generates and supplies clock to various -controllers within the SoC. +The Amlogic Meson8 / Meson8b / Meson8m2 clock controller generates and +supplies clock to various controllers within the SoC. Required Properties: -- compatible: should be "amlogic,meson8b-clkc" +- compatible: must be one of: + - "amlogic,meson8-clkc" for Meson8 (S802) SoCs + - "amlogic,meson8b-clkc" for Meson8 (S805) SoCs + - "amlogic,meson8m2-clkc" for Meson8m2 (S812) SoCs - reg: it must be composed by two tuples: 0) physical base address of the xtal register and length of memory mapped region. diff --git a/Documentation/devicetree/bindings/clock/brcm,iproc-clocks.txt b/Documentation/devicetree/bindings/clock/brcm,iproc-clocks.txt index 6f66e9aa354c..f2c5f0e4a363 100644 --- a/Documentation/devicetree/bindings/clock/brcm,iproc-clocks.txt +++ b/Documentation/devicetree/bindings/clock/brcm,iproc-clocks.txt @@ -219,3 +219,79 @@ BCM63138 -------- PLL and leaf clock compatible strings for BCM63138 are: "brcm,bcm63138-armpll" + +Stingray +----------- +PLL and leaf clock compatible strings for Stingray are: + "brcm,sr-genpll0" + "brcm,sr-genpll1" + "brcm,sr-genpll2" + "brcm,sr-genpll3" + "brcm,sr-genpll4" + "brcm,sr-genpll5" + "brcm,sr-genpll6" + + "brcm,sr-lcpll0" + "brcm,sr-lcpll1" + "brcm,sr-lcpll-pcie" + + +The following table defines the set of PLL/clock index and ID for Stingray. +These clock IDs are defined in: + "include/dt-bindings/clock/bcm-sr.h" + + Clock Source Index ID + --- ----- ----- --------- + crystal N/A N/A N/A + crmu_ref25m crystal N/A N/A + + genpll0 crystal 0 BCM_SR_GENPLL0 + clk_125m genpll0 1 BCM_SR_GENPLL0_125M_CLK + clk_scr genpll0 2 BCM_SR_GENPLL0_SCR_CLK + clk_250 genpll0 3 BCM_SR_GENPLL0_250M_CLK + clk_pcie_axi genpll0 4 BCM_SR_GENPLL0_PCIE_AXI_CLK + clk_paxc_axi_x2 genpll0 5 BCM_SR_GENPLL0_PAXC_AXI_X2_CLK + clk_paxc_axi genpll0 6 BCM_SR_GENPLL0_PAXC_AXI_CLK + + genpll1 crystal 0 BCM_SR_GENPLL1 + clk_pcie_tl genpll1 1 BCM_SR_GENPLL1_PCIE_TL_CLK + clk_mhb_apb genpll1 2 BCM_SR_GENPLL1_MHB_APB_CLK + + genpll2 crystal 0 BCM_SR_GENPLL2 + clk_nic genpll2 1 BCM_SR_GENPLL2_NIC_CLK + clk_ts_500_ref genpll2 2 BCM_SR_GENPLL2_TS_500_REF_CLK + clk_125_nitro genpll2 3 BCM_SR_GENPLL2_125_NITRO_CLK + clk_chimp genpll2 4 BCM_SR_GENPLL2_CHIMP_CLK + clk_nic_flash genpll2 5 BCM_SR_GENPLL2_NIC_FLASH + + genpll3 crystal 0 BCM_SR_GENPLL3 + clk_hsls genpll3 1 BCM_SR_GENPLL3_HSLS_CLK + clk_sdio genpll3 2 BCM_SR_GENPLL3_SDIO_CLK + + genpll4 crystal 0 BCM_SR_GENPLL4 + ccn genpll4 1 BCM_SR_GENPLL4_CCN_CLK + clk_tpiu_pll genpll4 2 BCM_SR_GENPLL4_TPIU_PLL_CLK + noc_clk genpll4 3 BCM_SR_GENPLL4_NOC_CLK + clk_chclk_fs4 genpll4 4 BCM_SR_GENPLL4_CHCLK_FS4_CLK + clk_bridge_fscpu genpll4 5 BCM_SR_GENPLL4_BRIDGE_FSCPU_CLK + + + genpll5 crystal 0 BCM_SR_GENPLL5 + fs4_hf_clk genpll5 1 BCM_SR_GENPLL5_FS4_HF_CLK + crypto_ae_clk genpll5 2 BCM_SR_GENPLL5_CRYPTO_AE_CLK + raid_ae_clk genpll5 3 BCM_SR_GENPLL5_RAID_AE_CLK + + genpll6 crystal 0 BCM_SR_GENPLL6 + 48_usb genpll6 1 BCM_SR_GENPLL6_48_USB_CLK + + lcpll0 crystal 0 BCM_SR_LCPLL0 + clk_sata_refp lcpll0 1 BCM_SR_LCPLL0_SATA_REFP_CLK + clk_sata_refn lcpll0 2 BCM_SR_LCPLL0_SATA_REFN_CLK + clk_usb_ref lcpll0 3 BCM_SR_LCPLL0_USB_REF_CLK + sata_refpn lcpll0 3 BCM_SR_LCPLL0_SATA_REFPN_CLK + + lcpll1 crystal 0 BCM_SR_LCPLL1 + wan lcpll1 1 BCM_SR_LCPLL0_WAN_CLK + + lcpll_pcie crystal 0 BCM_SR_LCPLL_PCIE + pcie_phy_ref lcpll1 1 BCM_SR_LCPLL_PCIE_PHY_REF_CLK diff --git a/Documentation/devicetree/bindings/clock/hi6220-clock.txt b/Documentation/devicetree/bindings/clock/hi6220-clock.txt index e4d5feaebc29..ef3deb7b86ea 100644 --- a/Documentation/devicetree/bindings/clock/hi6220-clock.txt +++ b/Documentation/devicetree/bindings/clock/hi6220-clock.txt @@ -11,6 +11,7 @@ Required Properties: - compatible: the compatible should be one of the following strings to indicate the clock controller functionality. + - "hisilicon,hi6220-acpu-sctrl" - "hisilicon,hi6220-aoctrl" - "hisilicon,hi6220-sysctrl" - "hisilicon,hi6220-mediactrl" diff --git a/Documentation/devicetree/bindings/clock/img,boston-clock.txt b/Documentation/devicetree/bindings/clock/img,boston-clock.txt new file mode 100644 index 000000000000..7bc5e9ffb624 --- /dev/null +++ b/Documentation/devicetree/bindings/clock/img,boston-clock.txt @@ -0,0 +1,31 @@ +Binding for Imagination Technologies MIPS Boston clock sources. + +This binding uses the common clock binding[1]. + +[1] Documentation/devicetree/bindings/clock/clock-bindings.txt + +The device node must be a child node of the syscon node corresponding to the +Boston system's platform registers. + +Required properties: +- compatible : Should be "img,boston-clock". +- #clock-cells : Should be set to 1. + Values available for clock consumers can be found in the header file: + <dt-bindings/clock/boston-clock.h> + +Example: + + system-controller@17ffd000 { + compatible = "img,boston-platform-regs", "syscon"; + reg = <0x17ffd000 0x1000>; + + clk_boston: clock { + compatible = "img,boston-clock"; + #clock-cells = <1>; + }; + }; + + uart0: uart@17ffe000 { + /* ... */ + clocks = <&clk_boston BOSTON_CLK_SYS>; + }; diff --git a/Documentation/devicetree/bindings/clock/qcom,gcc.txt b/Documentation/devicetree/bindings/clock/qcom,gcc.txt index 5b4dfc1ea54f..551d03be9665 100644 --- a/Documentation/devicetree/bindings/clock/qcom,gcc.txt +++ b/Documentation/devicetree/bindings/clock/qcom,gcc.txt @@ -8,6 +8,7 @@ Required properties : "qcom,gcc-apq8084" "qcom,gcc-ipq8064" "qcom,gcc-ipq4019" + "qcom,gcc-ipq8074" "qcom,gcc-msm8660" "qcom,gcc-msm8916" "qcom,gcc-msm8960" diff --git a/Documentation/devicetree/bindings/clock/qoriq-clock.txt b/Documentation/devicetree/bindings/clock/qoriq-clock.txt index 6ed469c66b32..6498e1fdbb33 100644 --- a/Documentation/devicetree/bindings/clock/qoriq-clock.txt +++ b/Documentation/devicetree/bindings/clock/qoriq-clock.txt @@ -57,6 +57,11 @@ Optional properties: - clocks: If clock-frequency is not specified, sysclk may be provided as an input clock. Either clock-frequency or clocks must be provided. + A second input clock, called "coreclk", may be provided if + core PLLs are based on a different input clock from the + platform PLL. +- clock-names: Required if a coreclk is present. Valid names are + "sysclk" and "coreclk". 2. Clock Provider @@ -73,6 +78,7 @@ second cell is the clock index for the specified type. 2 hwaccel index (n in CLKCGnHWACSR) 3 fman 0 for fm1, 1 for fm2 4 platform pll 0=pll, 1=pll/2, 2=pll/3, 3=pll/4 + 5 coreclk must be 0 3. Example diff --git a/Documentation/devicetree/bindings/clock/renesas,cpg-mssr.txt b/Documentation/devicetree/bindings/clock/renesas,cpg-mssr.txt index f4f944d81308..0cd894f987a3 100644 --- a/Documentation/devicetree/bindings/clock/renesas,cpg-mssr.txt +++ b/Documentation/devicetree/bindings/clock/renesas,cpg-mssr.txt @@ -15,6 +15,11 @@ Required Properties: - compatible: Must be one of: - "renesas,r8a7743-cpg-mssr" for the r8a7743 SoC (RZ/G1M) - "renesas,r8a7745-cpg-mssr" for the r8a7745 SoC (RZ/G1E) + - "renesas,r8a7790-cpg-mssr" for the r8a7790 SoC (R-Car H2) + - "renesas,r8a7791-cpg-mssr" for the r8a7791 SoC (R-Car M2-W) + - "renesas,r8a7792-cpg-mssr" for the r8a7792 SoC (R-Car V2H) + - "renesas,r8a7793-cpg-mssr" for the r8a7793 SoC (R-Car M2-N) + - "renesas,r8a7794-cpg-mssr" for the r8a7794 SoC (R-Car E2) - "renesas,r8a7795-cpg-mssr" for the r8a7795 SoC (R-Car H3) - "renesas,r8a7796-cpg-mssr" for the r8a7796 SoC (R-Car M3-W) @@ -24,9 +29,10 @@ Required Properties: - clocks: References to external parent clocks, one entry for each entry in clock-names - clock-names: List of external parent clock names. Valid names are: - - "extal" (r8a7743, r8a7745, r8a7795, r8a7796) + - "extal" (r8a7743, r8a7745, r8a7790, r8a7791, r8a7792, r8a7793, r8a7794, + r8a7795, r8a7796) - "extalr" (r8a7795, r8a7796) - - "usb_extal" (r8a7743, r8a7745) + - "usb_extal" (r8a7743, r8a7745, r8a7790, r8a7791, r8a7793, r8a7794) - #clock-cells: Must be 2 - For CPG core clocks, the two clock specifier cells must be "CPG_CORE" diff --git a/Documentation/devicetree/bindings/clock/rockchip,rk3128-cru.txt b/Documentation/devicetree/bindings/clock/rockchip,rk3128-cru.txt new file mode 100644 index 000000000000..455a9a00a623 --- /dev/null +++ b/Documentation/devicetree/bindings/clock/rockchip,rk3128-cru.txt @@ -0,0 +1,56 @@ +* Rockchip RK3128 Clock and Reset Unit + +The RK3128 clock controller generates and supplies clock to various +controllers within the SoC and also implements a reset controller for SoC +peripherals. + +Required Properties: + +- compatible: should be "rockchip,rk3128-cru" +- reg: physical base address of the controller and length of memory mapped + region. +- #clock-cells: should be 1. +- #reset-cells: should be 1. + +Optional Properties: + +- rockchip,grf: phandle to the syscon managing the "general register files" + If missing pll rates are not changeable, due to the missing pll lock status. + +Each clock is assigned an identifier and client nodes can use this identifier +to specify the clock which they consume. All available clocks are defined as +preprocessor macros in the dt-bindings/clock/rk3128-cru.h headers and can be +used in device tree sources. Similar macros exist for the reset sources in +these files. + +External clocks: + +There are several clocks that are generated outside the SoC. It is expected +that they are defined using standard clock bindings with following +clock-output-names: + - "xin24m" - crystal input - required, + - "ext_i2s" - external I2S clock - optional, + - "gmac_clkin" - external GMAC clock - optional + +Example: Clock controller node: + + cru: cru@20000000 { + compatible = "rockchip,rk3128-cru"; + reg = <0x20000000 0x1000>; + rockchip,grf = <&grf>; + + #clock-cells = <1>; + #reset-cells = <1>; + }; + +Example: UART controller node that consumes the clock generated by the clock + controller: + + uart2: serial@20068000 { + compatible = "rockchip,serial"; + reg = <0x20068000 0x100>; + interrupts = <GIC_SPI 22 IRQ_TYPE_LEVEL_HIGH>; + clock-frequency = <24000000>; + clocks = <&cru SCLK_UART2>, <&cru PCLK_UART2>; + clock-names = "sclk_uart", "pclk_uart"; + }; diff --git a/Documentation/devicetree/bindings/clock/sun8i-de2.txt b/Documentation/devicetree/bindings/clock/sun8i-de2.txt new file mode 100644 index 000000000000..631d27cd89d6 --- /dev/null +++ b/Documentation/devicetree/bindings/clock/sun8i-de2.txt @@ -0,0 +1,31 @@ +Allwinner Display Engine 2.0 Clock Control Binding +-------------------------------------------------- + +Required properties : +- compatible: must contain one of the following compatibles: + - "allwinner,sun8i-a83t-de2-clk" + - "allwinner,sun8i-v3s-de2-clk" + - "allwinner,sun50i-h5-de2-clk" + +- reg: Must contain the registers base address and length +- clocks: phandle to the clocks feeding the display engine subsystem. + Three are needed: + - "mod": the display engine module clock + - "bus": the bus clock for the whole display engine subsystem +- clock-names: Must contain the clock names described just above +- resets: phandle to the reset control for the display engine subsystem. +- #clock-cells : must contain 1 +- #reset-cells : must contain 1 + +Example: +de2_clocks: clock@1000000 { + compatible = "allwinner,sun8i-a83t-de2-clk"; + reg = <0x01000000 0x100000>; + clocks = <&ccu CLK_BUS_DE>, + <&ccu CLK_DE>; + clock-names = "bus", + "mod"; + resets = <&ccu RST_BUS_DE>; + #clock-cells = <1>; + #reset-cells = <1>; +}; diff --git a/Documentation/devicetree/bindings/clock/sunxi-ccu.txt b/Documentation/devicetree/bindings/clock/sunxi-ccu.txt index e9c5a1d9834a..df9fad58facd 100644 --- a/Documentation/devicetree/bindings/clock/sunxi-ccu.txt +++ b/Documentation/devicetree/bindings/clock/sunxi-ccu.txt @@ -6,6 +6,8 @@ Required properties : - "allwinner,sun6i-a31-ccu" - "allwinner,sun8i-a23-ccu" - "allwinner,sun8i-a33-ccu" + - "allwinner,sun8i-a83t-ccu" + - "allwinner,sun8i-a83t-r-ccu" - "allwinner,sun8i-h3-ccu" - "allwinner,sun8i-h3-r-ccu" - "allwinner,sun8i-v3s-ccu" @@ -18,11 +20,13 @@ Required properties : - clocks: phandle to the oscillators feeding the CCU. Two are needed: - "hosc": the high frequency oscillator (usually at 24MHz) - "losc": the low frequency oscillator (usually at 32kHz) + On the A83T, this is the internal 16MHz oscillator divided by 512 - clock-names: Must contain the clock names described just above - #clock-cells : must contain 1 - #reset-cells : must contain 1 -For the PRCM CCUs on H3/A64, one more clock is needed: +For the PRCM CCUs on A83T/H3/A64, two more clocks are needed: +- "pll-periph": the SoC's peripheral PLL from the main CCU - "iosc": the SoC's internal frequency oscillator Example for generic CCU: @@ -39,8 +43,8 @@ Example for PRCM CCU: r_ccu: clock@01f01400 { compatible = "allwinner,sun50i-a64-r-ccu"; reg = <0x01f01400 0x100>; - clocks = <&osc24M>, <&osc32k>, <&iosc>; - clock-names = "hosc", "losc", "iosc"; + clocks = <&osc24M>, <&osc32k>, <&iosc>, <&ccu CLK_PLL_PERIPH0>; + clock-names = "hosc", "losc", "iosc", "pll-periph"; #clock-cells = <1>; #reset-cells = <1>; }; diff --git a/Documentation/devicetree/bindings/clock/ti,sci-clk.txt b/Documentation/devicetree/bindings/clock/ti,sci-clk.txt new file mode 100644 index 000000000000..1e884c40ab50 --- /dev/null +++ b/Documentation/devicetree/bindings/clock/ti,sci-clk.txt @@ -0,0 +1,37 @@ +Texas Instruments TI-SCI Clocks +=============================== + +All clocks on Texas Instruments' SoCs that contain a System Controller, +are only controlled by this entity. Communication between a host processor +running an OS and the System Controller happens through a protocol known +as TI-SCI[1]. This clock implementation plugs into the common clock +framework and makes use of the TI-SCI protocol on clock API requests. + +[1] Documentation/devicetree/bindings/arm/keystone/ti,sci.txt + +Required properties: +------------------- +- compatible: Must be "ti,k2g-sci-clk" +- #clock-cells: Shall be 2. + In clock consumers, this cell represents the device ID and clock ID + exposed by the PM firmware. The assignments can be found in the header + files <dt-bindings/genpd/<soc>.h> (which covers the device IDs) and + <dt-bindings/clock/<soc>.h> (which covers the clock IDs), where <soc> + is the SoC involved, for example 'k2g'. + +Examples: +-------- + +pmmc: pmmc { + compatible = "ti,k2g-sci"; + + k2g_clks: clocks { + compatible = "ti,k2g-sci-clk"; + #clock-cells = <2>; + }; +}; + +uart0: serial@2530c00 { + compatible = "ns16550a"; + clocks = <&k2g_clks 0x2c 0>; +}; diff --git a/Documentation/devicetree/bindings/clock/ti-clkctrl.txt b/Documentation/devicetree/bindings/clock/ti-clkctrl.txt new file mode 100644 index 000000000000..48ee6991f2cc --- /dev/null +++ b/Documentation/devicetree/bindings/clock/ti-clkctrl.txt @@ -0,0 +1,56 @@ +Texas Instruments clkctrl clock binding + +Texas Instruments SoCs can have a clkctrl clock controller for each +interconnect target module. The clkctrl clock controller manages functional +and interface clocks for each module. Each clkctrl controller can also +gate one or more optional functional clocks for a module, and can have one +or more clock muxes. There is a clkctrl clock controller typically for each +interconnect target module on omap4 and later variants. + +The clock consumers can specify the index of the clkctrl clock using +the hardware offset from the clkctrl instance register space. The optional +clocks can be specified by clkctrl hardware offset and the index of the +optional clock. + +For more information, please see the Linux clock framework binding at +Documentation/devicetree/bindings/clock/clock-bindings.txt. + +Required properties : +- compatible : shall be "ti,clkctrl" +- #clock-cells : shall contain 2 with the first entry being the instance + offset from the clock domain base and the second being the + clock index + +Example: Clock controller node on omap 4430: + +&cm2 { + l4per: cm@1400 { + cm_l4per@0 { + cm_l4per_clkctrl: clk@20 { + compatible = "ti,clkctrl"; + reg = <0x20 0x1b0>; + #clock-cells = <2>; + }; + }; + }; +}; + +Example: Preprocessor helper macros in dt-bindings/clock/ti-clkctrl.h + +#define OMAP4_CLKCTRL_OFFSET 0x20 +#define OMAP4_CLKCTRL_INDEX(offset) ((offset) - OMAP4_CLKCTRL_OFFSET) +#define MODULEMODE_HWCTRL 1 +#define MODULEMODE_SWCTRL 2 + +#define OMAP4_GPTIMER10_CLKTRL OMAP4_CLKCTRL_INDEX(0x28) +#define OMAP4_GPTIMER11_CLKTRL OMAP4_CLKCTRL_INDEX(0x30) +#define OMAP4_GPTIMER2_CLKTRL OMAP4_CLKCTRL_INDEX(0x38) +... +#define OMAP4_GPIO2_CLKCTRL OMAP_CLKCTRL_INDEX(0x60) + +Example: Clock consumer node for GPIO2: + +&gpio2 { + clocks = <&cm_l4per_clkctrl OMAP4_GPIO2_CLKCTRL 0 + &cm_l4per_clkctrl OMAP4_GPIO2_CLKCTRL 8>; +}; diff --git a/Documentation/devicetree/bindings/common-properties.txt b/Documentation/devicetree/bindings/common-properties.txt index 3193979b1d05..697714f8d75c 100644 --- a/Documentation/devicetree/bindings/common-properties.txt +++ b/Documentation/devicetree/bindings/common-properties.txt @@ -1,6 +1,6 @@ Common properties -The ePAPR specification does not define any properties related to hardware +The Devicetree Specification does not define any properties related to hardware byteswapping, but endianness issues show up frequently in porting Linux to different machine types. This document attempts to provide a consistent way of handling byteswapping across drivers. diff --git a/Documentation/devicetree/bindings/cpufreq/ti-cpufreq.txt b/Documentation/devicetree/bindings/cpufreq/ti-cpufreq.txt index ba0e15ad5bd9..0c38e4b8fc51 100644 --- a/Documentation/devicetree/bindings/cpufreq/ti-cpufreq.txt +++ b/Documentation/devicetree/bindings/cpufreq/ti-cpufreq.txt @@ -63,64 +63,64 @@ cpu0_opp_table: opp-table { * because they can not be enabled simultaneously on a * single SoC. */ - opp50@300000000 { + opp50-300000000 { opp-hz = /bits/ 64 <300000000>; opp-microvolt = <950000 931000 969000>; opp-supported-hw = <0x06 0x0010>; opp-suspend; }; - opp100@275000000 { + opp100-275000000 { opp-hz = /bits/ 64 <275000000>; opp-microvolt = <1100000 1078000 1122000>; opp-supported-hw = <0x01 0x00FF>; opp-suspend; }; - opp100@300000000 { + opp100-300000000 { opp-hz = /bits/ 64 <300000000>; opp-microvolt = <1100000 1078000 1122000>; opp-supported-hw = <0x06 0x0020>; opp-suspend; }; - opp100@500000000 { + opp100-500000000 { opp-hz = /bits/ 64 <500000000>; opp-microvolt = <1100000 1078000 1122000>; opp-supported-hw = <0x01 0xFFFF>; }; - opp100@600000000 { + opp100-600000000 { opp-hz = /bits/ 64 <600000000>; opp-microvolt = <1100000 1078000 1122000>; opp-supported-hw = <0x06 0x0040>; }; - opp120@600000000 { + opp120-600000000 { opp-hz = /bits/ 64 <600000000>; opp-microvolt = <1200000 1176000 1224000>; opp-supported-hw = <0x01 0xFFFF>; }; - opp120@720000000 { + opp120-720000000 { opp-hz = /bits/ 64 <720000000>; opp-microvolt = <1200000 1176000 1224000>; opp-supported-hw = <0x06 0x0080>; }; - oppturbo@720000000 { + oppturbo-720000000 { opp-hz = /bits/ 64 <720000000>; opp-microvolt = <1260000 1234800 1285200>; opp-supported-hw = <0x01 0xFFFF>; }; - oppturbo@800000000 { + oppturbo-800000000 { opp-hz = /bits/ 64 <800000000>; opp-microvolt = <1260000 1234800 1285200>; opp-supported-hw = <0x06 0x0100>; }; - oppnitro@1000000000 { + oppnitro-1000000000 { opp-hz = /bits/ 64 <1000000000>; opp-microvolt = <1325000 1298500 1351500>; opp-supported-hw = <0x04 0x0200>; diff --git a/Documentation/devicetree/bindings/crypto/fsl-sec4.txt b/Documentation/devicetree/bindings/crypto/fsl-sec4.txt index 10a425f451fc..7aef0eae58d4 100644 --- a/Documentation/devicetree/bindings/crypto/fsl-sec4.txt +++ b/Documentation/devicetree/bindings/crypto/fsl-sec4.txt @@ -118,8 +118,8 @@ PROPERTIES Definition: A list of clock name strings in the same order as the clocks property. - Note: All other standard properties (see the ePAPR) are allowed - but are optional. + Note: All other standard properties (see the Devicetree Specification) + are allowed but are optional. EXAMPLE diff --git a/Documentation/devicetree/bindings/crypto/fsl-sec6.txt b/Documentation/devicetree/bindings/crypto/fsl-sec6.txt index baf8a3c1b469..73b0eb950bb3 100644 --- a/Documentation/devicetree/bindings/crypto/fsl-sec6.txt +++ b/Documentation/devicetree/bindings/crypto/fsl-sec6.txt @@ -55,8 +55,8 @@ PROPERTIES triplet that includes the child address, parent address, & length. - Note: All other standard properties (see the ePAPR) are allowed - but are optional. + Note: All other standard properties (see the Devicetree Specification) + are allowed but are optional. EXAMPLE crypto@a0000 { diff --git a/Documentation/devicetree/bindings/crypto/inside-secure-safexcel.txt b/Documentation/devicetree/bindings/crypto/inside-secure-safexcel.txt new file mode 100644 index 000000000000..f69773f4252b --- /dev/null +++ b/Documentation/devicetree/bindings/crypto/inside-secure-safexcel.txt @@ -0,0 +1,29 @@ +Inside Secure SafeXcel cryptographic engine + +Required properties: +- compatible: Should be "inside-secure,safexcel-eip197". +- reg: Base physical address of the engine and length of memory mapped region. +- interrupts: Interrupt numbers for the rings and engine. +- interrupt-names: Should be "ring0", "ring1", "ring2", "ring3", "eip", "mem". + +Optional properties: +- clocks: Reference to the crypto engine clock. +- dma-mask: The address mask limitation. Defaults to 64. + +Example: + + crypto: crypto@800000 { + compatible = "inside-secure,safexcel-eip197"; + reg = <0x800000 0x200000>; + interrupts = <GIC_SPI 34 IRQ_TYPE_LEVEL_HIGH>, + <GIC_SPI 54 IRQ_TYPE_LEVEL_HIGH>, + <GIC_SPI 55 IRQ_TYPE_LEVEL_HIGH>, + <GIC_SPI 56 IRQ_TYPE_LEVEL_HIGH>, + <GIC_SPI 57 IRQ_TYPE_LEVEL_HIGH>, + <GIC_SPI 58 IRQ_TYPE_LEVEL_HIGH>; + interrupt-names = "mem", "ring0", "ring1", "ring2", "ring3", + "eip"; + clocks = <&cpm_syscon0 1 26>; + dma-mask = <0xff 0xffffffff>; + status = "disabled"; + }; diff --git a/Documentation/devicetree/bindings/crypto/mediatek-crypto.txt b/Documentation/devicetree/bindings/crypto/mediatek-crypto.txt index c204725e5873..450da3661cad 100644 --- a/Documentation/devicetree/bindings/crypto/mediatek-crypto.txt +++ b/Documentation/devicetree/bindings/crypto/mediatek-crypto.txt @@ -6,8 +6,7 @@ Required properties: - interrupts: Should contain the five crypto engines interrupts in numeric order. These are global system and four descriptor rings. - clocks: the clock used by the core -- clock-names: the names of the clock listed in the clocks property. These are - "ethif", "cryp" +- clock-names: Must contain "cryp". - power-domains: Must contain a reference to the PM domain. @@ -20,8 +19,7 @@ Example: <GIC_SPI 84 IRQ_TYPE_LEVEL_LOW>, <GIC_SPI 91 IRQ_TYPE_LEVEL_LOW>, <GIC_SPI 97 IRQ_TYPE_LEVEL_LOW>; - clocks = <&topckgen CLK_TOP_ETHIF_SEL>, - <ðsys CLK_ETHSYS_CRYPTO>; - clock-names = "ethif","cryp"; + clocks = <ðsys CLK_ETHSYS_CRYPTO>; + clock-names = "cryp"; power-domains = <&scpsys MT2701_POWER_DOMAIN_ETH>; }; diff --git a/Documentation/devicetree/bindings/display/brcm,bcm-vc4.txt b/Documentation/devicetree/bindings/display/brcm,bcm-vc4.txt index ca02d3e4db91..284e2b14cfbe 100644 --- a/Documentation/devicetree/bindings/display/brcm,bcm-vc4.txt +++ b/Documentation/devicetree/bindings/display/brcm,bcm-vc4.txt @@ -5,7 +5,7 @@ with HDMI output and the HVS (Hardware Video Scaler) for compositing display planes. Required properties for VC4: -- compatible: Should be "brcm,bcm2835-vc4" +- compatible: Should be "brcm,bcm2835-vc4" or "brcm,cygnus-vc4" Required properties for Pixel Valve: - compatible: Should be one of "brcm,bcm2835-pixelvalve0", @@ -54,11 +54,14 @@ Required properties for VEC: See bindings/interrupt-controller/brcm,bcm2835-armctrl-ic.txt Required properties for V3D: -- compatible: Should be "brcm,bcm2835-v3d" +- compatible: Should be "brcm,bcm2835-v3d" or "brcm,cygnus-v3d" - reg: Physical base address and length of the V3D's registers - interrupts: The interrupt number See bindings/interrupt-controller/brcm,bcm2835-armctrl-ic.txt +Optional properties for V3D: +- clocks: The clock the unit runs on + Required properties for DSI: - compatible: Should be "brcm,bcm2835-dsi0" or "brcm,bcm2835-dsi1" - reg: Physical base address and length of the DSI block's registers diff --git a/Documentation/devicetree/bindings/display/bridge/adi,adv7511.txt b/Documentation/devicetree/bindings/display/bridge/adi,adv7511.txt index 00ea670b8c4d..06668bca7ffc 100644 --- a/Documentation/devicetree/bindings/display/bridge/adi,adv7511.txt +++ b/Documentation/devicetree/bindings/display/bridge/adi,adv7511.txt @@ -78,6 +78,7 @@ graph bindings specified in Documentation/devicetree/bindings/graph.txt. remote endpoint phandle should be a reference to a valid mipi_dsi_host device node. - Video port 1 for the HDMI output +- Audio port 2 for the HDMI audio input Example @@ -112,5 +113,12 @@ Example remote-endpoint = <&hdmi_connector_in>; }; }; + + port@2 { + reg = <2>; + codec_endpoint: endpoint { + remote-endpoint = <&i2s0_cpu_endpoint>; + }; + }; }; }; diff --git a/Documentation/devicetree/bindings/display/bridge/renesas,dw-hdmi.txt b/Documentation/devicetree/bindings/display/bridge/renesas,dw-hdmi.txt index f6b3f36d422b..81b68580e199 100644 --- a/Documentation/devicetree/bindings/display/bridge/renesas,dw-hdmi.txt +++ b/Documentation/devicetree/bindings/display/bridge/renesas,dw-hdmi.txt @@ -25,7 +25,8 @@ Required properties: - clock-names: Shall contain "iahb" and "isfr" as defined in dw_hdmi.txt. - ports: See dw_hdmi.txt. The DWC HDMI shall have one port numbered 0 corresponding to the video input of the controller and one port numbered 1 - corresponding to its HDMI output. Each port shall have a single endpoint. + corresponding to its HDMI output, and one port numbered 2 corresponding to + sound input of the controller. Each port shall have a single endpoint. Optional properties: @@ -59,6 +60,12 @@ Example: remote-endpoint = <&hdmi0_con>; }; }; + port@2 { + reg = <2>; + rcar_dw_hdmi0_sound_in: endpoint { + remote-endpoint = <&hdmi_sound_out>; + }; + }; }; }; diff --git a/Documentation/devicetree/bindings/display/exynos/exynos5433-decon.txt b/Documentation/devicetree/bindings/display/exynos/exynos5433-decon.txt index c9fd7b3807e7..549c538b38a5 100644 --- a/Documentation/devicetree/bindings/display/exynos/exynos5433-decon.txt +++ b/Documentation/devicetree/bindings/display/exynos/exynos5433-decon.txt @@ -8,12 +8,13 @@ Required properties: - compatible: value should be one of: "samsung,exynos5433-decon", "samsung,exynos5433-decon-tv"; - reg: physical base address and length of the DECON registers set. -- interrupts: should contain a list of all DECON IP block interrupts in the - order: VSYNC, LCD_SYSTEM. The interrupt specifier format - depends on the interrupt controller used. -- interrupt-names: should contain the interrupt names: "vsync", "lcd_sys" - in the same order as they were listed in the interrupts - property. +- interrupt-names: should contain the interrupt names depending on mode of work: + video mode: "vsync", + command mode: "lcd_sys", + command mode with software trigger: "lcd_sys", "te". +- interrupts or interrupts-extended: list of interrupt specifiers corresponding + to names privided in interrupt-names, as described in + interrupt-controller/interrupts.txt - clocks: must include clock specifiers corresponding to entries in the clock-names property. - clock-names: list of clock names sorted in the same order as the clocks diff --git a/Documentation/devicetree/bindings/display/panel/auo,p320hvn03.txt b/Documentation/devicetree/bindings/display/panel/auo,p320hvn03.txt new file mode 100644 index 000000000000..59bb6cd8aa75 --- /dev/null +++ b/Documentation/devicetree/bindings/display/panel/auo,p320hvn03.txt @@ -0,0 +1,8 @@ +AU Optronics Corporation 31.5" FHD (1920x1080) TFT LCD panel + +Required properties: +- compatible: should be "auo,p320hvn03" +- power-supply: as specified in the base binding + +This binding is compatible with the simple-panel binding, which is specified +in simple-panel.txt in this directory. diff --git a/Documentation/devicetree/bindings/display/panel/display-timing.txt b/Documentation/devicetree/bindings/display/panel/display-timing.txt index 81a75893d1b8..58fa3e48481d 100644 --- a/Documentation/devicetree/bindings/display/panel/display-timing.txt +++ b/Documentation/devicetree/bindings/display/panel/display-timing.txt @@ -57,11 +57,11 @@ can be specified. The parameters are defined as: +----------+-------------------------------------+----------+-------+ - | | โ | | | + | | ^ | | | | | |vback_porch | | | - | | โ | | | + | | v | | | +----------#######################################----------+-------+ - | # โ # | | + | # ^ # | | | # | # | | | hback # | # hfront | hsync | | porch # | hactive # porch | len | @@ -69,15 +69,15 @@ The parameters are defined as: | # | # | | | # |vactive # | | | # | # | | - | # โ # | | + | # v # | | +----------#######################################----------+-------+ - | | โ | | | + | | ^ | | | | | |vfront_porch | | | - | | โ | | | + | | v | | | +----------+-------------------------------------+----------+-------+ - | | โ | | | + | | ^ | | | | | |vsync_len | | | - | | โ | | | + | | v | | | +----------+-------------------------------------+----------+-------+ Example: diff --git a/Documentation/devicetree/bindings/display/panel/innolux,p079zca.txt b/Documentation/devicetree/bindings/display/panel/innolux,p079zca.txt new file mode 100644 index 000000000000..5c70a8380e58 --- /dev/null +++ b/Documentation/devicetree/bindings/display/panel/innolux,p079zca.txt @@ -0,0 +1,23 @@ +Innolux P079ZCA 7.85" 768x1024 TFT LCD panel + +Required properties: +- compatible: should be "innolux,p079zca" +- reg: DSI virtual channel of the peripheral +- power-supply: phandle of the regulator that provides the supply voltage +- enable-gpios: panel enable gpio + +Optional properties: +- backlight: phandle of the backlight device attached to the panel + +Example: + + &mipi_dsi { + panel { + compatible = "innolux,p079zca"; + reg = <0>; + power-supply = <...>; + backlight = <&backlight>; + enable-gpios = <&gpio1 13 GPIO_ACTIVE_HIGH>; + status = "okay"; + }; + }; diff --git a/Documentation/devicetree/bindings/display/panel/nec,nl12880b20-05.txt b/Documentation/devicetree/bindings/display/panel/nec,nl12880b20-05.txt new file mode 100644 index 000000000000..71cbc49ecfab --- /dev/null +++ b/Documentation/devicetree/bindings/display/panel/nec,nl12880b20-05.txt @@ -0,0 +1,8 @@ +NEC LCD Technologies, Ltd. 12.1" WXGA (1280x800) LVDS TFT LCD panel + +Required properties: +- compatible: should be "nec,nl12880bc20-05" +- power-supply: as specified in the base binding + +This binding is compatible with the simple-panel binding, which is specified +in simple-panel.txt in this directory. diff --git a/Documentation/devicetree/bindings/display/panel/nlt,nl192108ac18-02d.txt b/Documentation/devicetree/bindings/display/panel/nlt,nl192108ac18-02d.txt new file mode 100644 index 000000000000..1a639fd8778d --- /dev/null +++ b/Documentation/devicetree/bindings/display/panel/nlt,nl192108ac18-02d.txt @@ -0,0 +1,8 @@ +NLT Technologies, Ltd. 15.6" FHD (1920x1080) LVDS TFT LCD panel + +Required properties: +- compatible: should be "nlt,nl192108ac18-02d" +- power-supply: as specified in the base binding + +This binding is compatible with the simple-panel binding, which is specified +in simple-panel.txt in this directory. diff --git a/Documentation/devicetree/bindings/display/panel/samsung,s6e3ha2.txt b/Documentation/devicetree/bindings/display/panel/samsung,s6e3ha2.txt index 18854f4c8376..4acea25c244b 100644 --- a/Documentation/devicetree/bindings/display/panel/samsung,s6e3ha2.txt +++ b/Documentation/devicetree/bindings/display/panel/samsung,s6e3ha2.txt @@ -1,7 +1,10 @@ Samsung S6E3HA2 5.7" 1440x2560 AMOLED panel +Samsung S6E3HF2 5.65" 1600x2560 AMOLED panel Required properties: - - compatible: "samsung,s6e3ha2" + - compatible: should be one of: + "samsung,s6e3ha2", + "samsung,s6e3hf2". - reg: the virtual channel number of a DSI peripheral - vdd3-supply: I/O voltage supply - vci-supply: voltage supply for analog circuits diff --git a/Documentation/devicetree/bindings/display/st,stm32-ltdc.txt b/Documentation/devicetree/bindings/display/st,stm32-ltdc.txt new file mode 100644 index 000000000000..8e1476941c0f --- /dev/null +++ b/Documentation/devicetree/bindings/display/st,stm32-ltdc.txt @@ -0,0 +1,36 @@ +* STMicroelectronics STM32 lcd-tft display controller + +- ltdc: lcd-tft display controller host + must be a sub-node of st-display-subsystem + Required properties: + - compatible: "st,stm32-ltdc" + - reg: Physical base address of the IP registers and length of memory mapped region. + - clocks: A list of phandle + clock-specifier pairs, one for each + entry in 'clock-names'. + - clock-names: A list of clock names. For ltdc it should contain: + - "lcd" for the clock feeding the output pixel clock & IP clock. + - resets: reset to be used by the device (defined by use of RCC macro). + Required nodes: + - Video port for RGB output. + +Example: + +/ { + ... + soc { + ... + ltdc: display-controller@40016800 { + compatible = "st,stm32-ltdc"; + reg = <0x40016800 0x200>; + interrupts = <88>, <89>; + resets = <&rcc STM32F4_APB2_RESET(LTDC)>; + clocks = <&rcc 1 CLK_LCD>; + clock-names = "lcd"; + + port { + ltdc_out_rgb: endpoint { + }; + }; + }; + }; +}; diff --git a/Documentation/devicetree/bindings/display/sunxi/sun4i-drm.txt b/Documentation/devicetree/bindings/display/sunxi/sun4i-drm.txt index 57a8d0610062..b83e6018041d 100644 --- a/Documentation/devicetree/bindings/display/sunxi/sun4i-drm.txt +++ b/Documentation/devicetree/bindings/display/sunxi/sun4i-drm.txt @@ -4,6 +4,44 @@ Allwinner A10 Display Pipeline The Allwinner A10 Display pipeline is composed of several components that are going to be documented below: +For the input port of all components up to the TCON in the display +pipeline, if there are multiple components, the local endpoint IDs +must correspond to the index of the upstream block. For example, if +the remote endpoint is Frontend 1, then the local endpoint ID must +be 1. + +Conversely, for the output ports of the same group, the remote endpoint +ID must be the index of the local hardware block. If the local backend +is backend 1, then the remote endpoint ID must be 1. + +HDMI Encoder +------------ + +The HDMI Encoder supports the HDMI video and audio outputs, and does +CEC. It is one end of the pipeline. + +Required properties: + - compatible: value must be one of: + * allwinner,sun5i-a10s-hdmi + - reg: base address and size of memory-mapped region + - interrupts: interrupt associated to this IP + - clocks: phandles to the clocks feeding the HDMI encoder + * ahb: the HDMI interface clock + * mod: the HDMI module clock + * pll-0: the first video PLL + * pll-1: the second video PLL + - clock-names: the clock names mentioned above + - dmas: phandles to the DMA channels used by the HDMI encoder + * ddc-tx: The channel for DDC transmission + * ddc-rx: The channel for DDC reception + * audio-tx: The channel used for audio transmission + - dma-names: the channel names mentioned above + + - ports: A ports node with endpoint definitions as defined in + Documentation/devicetree/bindings/media/video-interfaces.txt. The + first port should be the input endpoint. The second should be the + output, usually to an HDMI connector. + TV Encoder ---------- @@ -31,6 +69,7 @@ Required properties: * allwinner,sun6i-a31-tcon * allwinner,sun6i-a31s-tcon * allwinner,sun8i-a33-tcon + * allwinner,sun8i-v3s-tcon - reg: base address and size of memory-mapped region - interrupts: interrupt associated to this IP - clocks: phandles to the clocks feeding the TCON. Three are needed: @@ -47,12 +86,15 @@ Required properties: Documentation/devicetree/bindings/media/video-interfaces.txt. The first port should be the input endpoint, the second one the output - The output should have two endpoints. The first is the block - connected to the TCON channel 0 (usually a panel or a bridge), the - second the block connected to the TCON channel 1 (usually the TV - encoder) + The output may have multiple endpoints. The TCON has two channels, + usually with the first channel being used for the panels interfaces + (RGB, LVDS, etc.), and the second being used for the outputs that + require another controller (TV Encoder, HDMI, etc.). The endpoints + will take an extra property, allwinner,tcon-channel, to specify the + channel the endpoint is associated to. If that property is not + present, the endpoint number will be used as the channel number. -On SoCs other than the A33, there is one more clock required: +On SoCs other than the A33 and V3s, there is one more clock required: - 'tcon-ch1': The clock driving the TCON channel 1 DRC @@ -138,6 +180,26 @@ Required properties: Documentation/devicetree/bindings/media/video-interfaces.txt. The first port should be the input endpoints, the second one the outputs +Display Engine 2.0 Mixer +------------------------ + +The DE2 mixer have many functionalities, currently only layer blending is +supported. + +Required properties: + - compatible: value must be one of: + * allwinner,sun8i-v3s-de2-mixer + - reg: base address and size of the memory-mapped region. + - clocks: phandles to the clocks feeding the mixer + * bus: the mixer interface clock + * mod: the mixer module clock + - clock-names: the clock names mentioned above + - resets: phandles to the reset controllers driving the mixer + +- ports: A ports node with endpoint definitions as defined in + Documentation/devicetree/bindings/media/video-interfaces.txt. The + first port should be the input endpoints, the second one the output + Display Engine Pipeline ----------------------- @@ -148,13 +210,15 @@ extra node. Required properties: - compatible: value must be one of: + * allwinner,sun5i-a10s-display-engine * allwinner,sun5i-a13-display-engine * allwinner,sun6i-a31-display-engine * allwinner,sun6i-a31s-display-engine * allwinner,sun8i-a33-display-engine + * allwinner,sun8i-v3s-display-engine - allwinner,pipelines: list of phandle to the display engine - frontends available. + frontends (DE 1.0) or mixers (DE 2.0) available. Example: @@ -173,6 +237,57 @@ panel: panel { }; }; +connector { + compatible = "hdmi-connector"; + type = "a"; + + port { + hdmi_con_in: endpoint { + remote-endpoint = <&hdmi_out_con>; + }; + }; +}; + +hdmi: hdmi@01c16000 { + compatible = "allwinner,sun5i-a10s-hdmi"; + reg = <0x01c16000 0x1000>; + interrupts = <58>; + clocks = <&ccu CLK_AHB_HDMI>, <&ccu CLK_HDMI>, + <&ccu CLK_PLL_VIDEO0_2X>, + <&ccu CLK_PLL_VIDEO1_2X>; + clock-names = "ahb", "mod", "pll-0", "pll-1"; + dmas = <&dma SUN4I_DMA_NORMAL 16>, + <&dma SUN4I_DMA_NORMAL 16>, + <&dma SUN4I_DMA_DEDICATED 24>; + dma-names = "ddc-tx", "ddc-rx", "audio-tx"; + status = "disabled"; + + ports { + #address-cells = <1>; + #size-cells = <0>; + + port@0 { + #address-cells = <1>; + #size-cells = <0>; + reg = <0>; + + hdmi_in_tcon0: endpoint { + remote-endpoint = <&tcon0_out_hdmi>; + }; + }; + + port@1 { + #address-cells = <1>; + #size-cells = <0>; + reg = <1>; + + hdmi_out_con: endpoint { + remote-endpoint = <&hdmi_con_in>; + }; + }; + }; +}; + tve0: tv-encoder@01c0a000 { compatible = "allwinner,sun4i-a10-tv-encoder"; reg = <0x01c0a000 0x1000>; diff --git a/Documentation/devicetree/bindings/display/zte,vou.txt b/Documentation/devicetree/bindings/display/zte,vou.txt index 9c356284232b..38476475fd60 100644 --- a/Documentation/devicetree/bindings/display/zte,vou.txt +++ b/Documentation/devicetree/bindings/display/zte,vou.txt @@ -58,6 +58,18 @@ Required properties: integer cells. The first cell is the offset of SYSCTRL register used to control TV Encoder DAC power, and the second cell is the bit mask. +* VGA output device + +Required properties: + - compatible: should be "zte,zx296718-vga" + - reg: Physical base address and length of the VGA device IO region + - interrupts : VGA interrupt number to CPU + - clocks: Phandle with clock-specifier pointing to VGA I2C clock. + - clock-names: Must be "i2c_wclk". + - zte,vga-power-control: the phandle to SYSCTRL block followed by two + integer cells. The first cell is the offset of SYSCTRL register used + to control VGA DAC power, and the second cell is the bit mask. + Example: vou: vou@1440000 { @@ -81,6 +93,15 @@ vou: vou@1440000 { "main_wclk", "aux_wclk"; }; + vga: vga@8000 { + compatible = "zte,zx296718-vga"; + reg = <0x8000 0x1000>; + interrupts = <GIC_SPI 86 IRQ_TYPE_LEVEL_HIGH>; + clocks = <&topcrm VGA_I2C_WCLK>; + clock-names = "i2c_wclk"; + zte,vga-power-control = <&sysctrl 0x170 0xe0>; + }; + hdmi: hdmi@c000 { compatible = "zte,zx296718-hdmi"; reg = <0xc000 0x4000>; diff --git a/Documentation/devicetree/bindings/dma/arm-pl08x.txt b/Documentation/devicetree/bindings/dma/arm-pl08x.txt index 8a0097a029d3..0ba81f79266f 100644 --- a/Documentation/devicetree/bindings/dma/arm-pl08x.txt +++ b/Documentation/devicetree/bindings/dma/arm-pl08x.txt @@ -3,6 +3,11 @@ Required properties: - compatible: "arm,pl080", "arm,primecell"; "arm,pl081", "arm,primecell"; + "faraday,ftdmac020", "arm,primecell" +- arm,primecell-periphid: on the FTDMAC020 the primecell ID is not hard-coded + in the hardware and must be specified here as <0x0003b080>. This number + follows the PrimeCell standard numbering using the JEP106 vendor code 0x38 + for Faraday Technology. - reg: Address range of the PL08x registers - interrupt: The PL08x interrupt number - clocks: The clock running the IP core clock @@ -20,8 +25,8 @@ Optional properties: - dma-requests: contains the total number of DMA requests supported by the DMAC - memcpy-burst-size: the size of the bursts for memcpy: 1, 4, 8, 16, 32 64, 128 or 256 bytes are legal values -- memcpy-bus-width: the bus width used for memcpy: 8, 16 or 32 are legal - values +- memcpy-bus-width: the bus width used for memcpy in bits: 8, 16 or 32 are legal + values, the Faraday FTDMAC020 can also accept 64 bits Clients Required properties: diff --git a/Documentation/devicetree/bindings/dma/brcm,iproc-sba.txt b/Documentation/devicetree/bindings/dma/brcm,iproc-sba.txt new file mode 100644 index 000000000000..092913a28457 --- /dev/null +++ b/Documentation/devicetree/bindings/dma/brcm,iproc-sba.txt @@ -0,0 +1,29 @@ +* Broadcom SBA RAID engine + +Required properties: +- compatible: Should be one of the following + "brcm,iproc-sba" + "brcm,iproc-sba-v2" + The "brcm,iproc-sba" has support for only 6 PQ coefficients + The "brcm,iproc-sba-v2" has support for only 30 PQ coefficients +- mboxes: List of phandle and mailbox channel specifiers + +Example: + +raid_mbox: mbox@67400000 { + ... + #mbox-cells = <3>; + ... +}; + +raid0 { + compatible = "brcm,iproc-sba-v2"; + mboxes = <&raid_mbox 0 0x1 0xffff>, + <&raid_mbox 1 0x1 0xffff>, + <&raid_mbox 2 0x1 0xffff>, + <&raid_mbox 3 0x1 0xffff>, + <&raid_mbox 4 0x1 0xffff>, + <&raid_mbox 5 0x1 0xffff>, + <&raid_mbox 6 0x1 0xffff>, + <&raid_mbox 7 0x1 0xffff>; +}; diff --git a/Documentation/devicetree/bindings/dma/renesas,rcar-dmac.txt b/Documentation/devicetree/bindings/dma/renesas,rcar-dmac.txt index 3316a9c2e638..79a204d50234 100644 --- a/Documentation/devicetree/bindings/dma/renesas,rcar-dmac.txt +++ b/Documentation/devicetree/bindings/dma/renesas,rcar-dmac.txt @@ -30,8 +30,9 @@ Required Properties: - interrupts: interrupt specifiers for the DMAC, one for each entry in interrupt-names. -- interrupt-names: one entry per channel, named "ch%u", where %u is the - channel number ranging from zero to the number of channels minus one. +- interrupt-names: one entry for the error interrupt, named "error", plus one + entry per channel, named "ch%u", where %u is the channel number ranging from + zero to the number of channels minus one. - clock-names: "fck" for the functional clock - clocks: a list of phandle + clock-specifier pairs, one for each entry diff --git a/Documentation/devicetree/bindings/dma/shdma.txt b/Documentation/devicetree/bindings/dma/shdma.txt index 2a3f3b8946b9..a91920a49433 100644 --- a/Documentation/devicetree/bindings/dma/shdma.txt +++ b/Documentation/devicetree/bindings/dma/shdma.txt @@ -1,6 +1,6 @@ * SHDMA Device Tree bindings -Sh-/r-mobile and r-car systems often have multiple identical DMA controller +Sh-/r-mobile and R-Car systems often have multiple identical DMA controller instances, capable of serving any of a common set of DMA slave devices, using the same configuration. To describe this topology we require all compatible SHDMA DT nodes to be placed under a DMA multiplexer node. All such compatible diff --git a/Documentation/devicetree/bindings/fsi/fsi-master-gpio.txt b/Documentation/devicetree/bindings/fsi/fsi-master-gpio.txt new file mode 100644 index 000000000000..a767259dedad --- /dev/null +++ b/Documentation/devicetree/bindings/fsi/fsi-master-gpio.txt @@ -0,0 +1,24 @@ +Device-tree bindings for gpio-based FSI master driver +----------------------------------------------------- + +Required properties: + - compatible = "fsi-master-gpio"; + - clock-gpios = <gpio-descriptor>; : GPIO for FSI clock + - data-gpios = <gpio-descriptor>; : GPIO for FSI data signal + +Optional properties: + - enable-gpios = <gpio-descriptor>; : GPIO for enable signal + - trans-gpios = <gpio-descriptor>; : GPIO for voltage translator enable + - mux-gpios = <gpio-descriptor>; : GPIO for pin multiplexing with other + functions (eg, external FSI masters) + +Examples: + + fsi-master { + compatible = "fsi-master-gpio", "fsi-master"; + clock-gpios = <&gpio 0>; + data-gpios = <&gpio 1>; + enable-gpios = <&gpio 2>; + trans-gpios = <&gpio 3>; + mux-gpios = <&gpio 4>; + } diff --git a/Documentation/devicetree/bindings/gpio/gpio-mvebu.txt b/Documentation/devicetree/bindings/gpio/gpio-mvebu.txt index 42c3bb2d53e8..38ca2201e8ae 100644 --- a/Documentation/devicetree/bindings/gpio/gpio-mvebu.txt +++ b/Documentation/devicetree/bindings/gpio/gpio-mvebu.txt @@ -2,17 +2,27 @@ Required properties: -- compatible : Should be "marvell,orion-gpio", "marvell,mv78200-gpio" - or "marvell,armadaxp-gpio". "marvell,orion-gpio" should be used for - Orion, Kirkwood, Dove, Discovery (except MV78200) and Armada - 370. "marvell,mv78200-gpio" should be used for the Discovery - MV78200. "marvel,armadaxp-gpio" should be used for all Armada XP - SoCs (MV78230, MV78260, MV78460). +- compatible : Should be "marvell,orion-gpio", "marvell,mv78200-gpio", + "marvell,armadaxp-gpio" or "marvell,armada-8k-gpio". + + "marvell,orion-gpio" should be used for Orion, Kirkwood, Dove, + Discovery (except MV78200) and Armada 370. "marvell,mv78200-gpio" + should be used for the Discovery MV78200. + + "marvel,armadaxp-gpio" should be used for all Armada XP SoCs + (MV78230, MV78260, MV78460). + + "marvell,armada-8k-gpio" should be used for the Armada 7K and 8K + SoCs (either from AP or CP), see + Documentation/devicetree/bindings/arm/marvell/cp110-system-controller0.txt + and + Documentation/devicetree/bindings/arm/marvell/ap806-system-controller.txt + for specific details about the offset property. - reg: Address and length of the register set for the device. Only one entry is expected, except for the "marvell,armadaxp-gpio" variant for which two entries are expected: one for the general registers, - one for the per-cpu registers. + one for the per-cpu registers. Not used for marvell,armada-8k-gpio. - interrupts: The list of interrupts that are used for all the pins managed by this GPIO bank. There can be more than one interrupt @@ -41,9 +51,9 @@ Required properties: Optional properties: In order to use the GPIO lines in PWM mode, some additional optional -properties are required. Only Armada 370 and XP support these properties. +properties are required. -- compatible: Must contain "marvell,armada-370-xp-gpio" +- compatible: Must contain "marvell,armada-370-gpio" - reg: an additional register set is needed, for the GPIO Blink Counter on/off registers. @@ -71,7 +81,7 @@ Example: }; gpio1: gpio@18140 { - compatible = "marvell,armada-370-xp-gpio"; + compatible = "marvell,armada-370-gpio"; reg = <0x18140 0x40>, <0x181c8 0x08>; reg-names = "gpio", "pwm"; ngpios = <17>; diff --git a/Documentation/devicetree/bindings/gpio/gpio.txt b/Documentation/devicetree/bindings/gpio/gpio.txt index 84ede036f73d..802402f6cc5d 100644 --- a/Documentation/devicetree/bindings/gpio/gpio.txt +++ b/Documentation/devicetree/bindings/gpio/gpio.txt @@ -74,11 +74,14 @@ GPIO pin number, and GPIO flags as accepted by the "qe_pio_e" gpio-controller. Optional standard bitfield specifiers for the last cell: - Bit 0: 0 means active high, 1 means active low -- Bit 1: 1 means single-ended wiring, see: +- Bit 1: 0 mean push-pull wiring, see: + https://en.wikipedia.org/wiki/Push-pull_output + 1 means single-ended wiring, see: https://en.wikipedia.org/wiki/Single-ended_triode - When used with active-low, this means open drain/collector, see: +- Bit 2: 0 means open-source, 1 means open drain, see: https://en.wikipedia.org/wiki/Open_collector - When used with active-high, this means open source/emitter +- Bit 3: 0 means the output should be maintained during sleep/low-power mode + 1 means the output state can be lost during sleep/low-power mode 1.1) GPIO specifier best practices ---------------------------------- @@ -282,8 +285,8 @@ Example 1: }; Here, a single GPIO controller has GPIOs 0..9 routed to pin controller -pinctrl1's pins 20..29, and GPIOs 10..19 routed to pin controller pinctrl2's -pins 50..59. +pinctrl1's pins 20..29, and GPIOs 10..29 routed to pin controller pinctrl2's +pins 50..69. Example 2: diff --git a/Documentation/devicetree/bindings/gpio/gpio_atmel.txt b/Documentation/devicetree/bindings/gpio/gpio_atmel.txt index 85f8c0d084fa..29416f9c3220 100644 --- a/Documentation/devicetree/bindings/gpio/gpio_atmel.txt +++ b/Documentation/devicetree/bindings/gpio/gpio_atmel.txt @@ -5,9 +5,13 @@ Required properties: - reg: Should contain GPIO controller registers location and length - interrupts: Should be the port interrupt shared by all the pins. - #gpio-cells: Should be two. The first cell is the pin number and - the second cell is used to specify optional parameters (currently - unused). + the second cell is used to specify optional parameters to declare if the GPIO + is active high or low. See gpio.txt. - gpio-controller: Marks the device node as a GPIO controller. +- interrupt-controller: Marks the device node as an interrupt controller. +- #interrupt-cells: Should be two. The first cell is the pin number and the + second cell is used to specify irq type flags, see the two cell description + in interrupt-controller/interrupts.txt for details. optional properties: - #gpio-lines: Number of gpio if absent 32. @@ -21,5 +25,7 @@ Example: #gpio-cells = <2>; gpio-controller; #gpio-lines = <19>; + interrupt-controller; + #interrupt-cells = <2>; }; diff --git a/Documentation/devicetree/bindings/gpio/ingenic,gpio.txt b/Documentation/devicetree/bindings/gpio/ingenic,gpio.txt new file mode 100644 index 000000000000..7988aeb725f4 --- /dev/null +++ b/Documentation/devicetree/bindings/gpio/ingenic,gpio.txt @@ -0,0 +1,46 @@ +Ingenic jz47xx GPIO controller + +That the Ingenic GPIO driver node must be a sub-node of the Ingenic pinctrl +driver node. + +Required properties: +-------------------- + + - compatible: Must contain one of: + - "ingenic,jz4740-gpio" + - "ingenic,jz4770-gpio" + - "ingenic,jz4780-gpio" + - reg: The GPIO bank number. + - interrupt-controller: Marks the device node as an interrupt controller. + - interrupts: Interrupt specifier for the controllers interrupt. + - #interrupt-cells: Should be 2. Refer to + ../interrupt-controller/interrupts.txt for more details. + - gpio-controller: Marks the device node as a GPIO controller. + - #gpio-cells: Should be 2. The first cell is the GPIO number and the second + cell specifies GPIO flags, as defined in <dt-bindings/gpio/gpio.h>. Only the + GPIO_ACTIVE_HIGH and GPIO_ACTIVE_LOW flags are supported. + - gpio-ranges: Range of pins managed by the GPIO controller. Refer to + 'gpio.txt' in this directory for more details. + +Example: +-------- + +&pinctrl { + #address-cells = <1>; + #size-cells = <0>; + + gpa: gpio@0 { + compatible = "ingenic,jz4740-gpio"; + reg = <0>; + + gpio-controller; + gpio-ranges = <&pinctrl 0 0 32>; + #gpio-cells = <2>; + + interrupt-controller; + #interrupt-cells = <2>; + + interrupt-parent = <&intc>; + interrupts = <28>; + }; +}; diff --git a/Documentation/devicetree/bindings/gpio/renesas,gpio-rcar.txt b/Documentation/devicetree/bindings/gpio/renesas,gpio-rcar.txt index 7c1ab3b3254f..6826a371fb69 100644 --- a/Documentation/devicetree/bindings/gpio/renesas,gpio-rcar.txt +++ b/Documentation/devicetree/bindings/gpio/renesas,gpio-rcar.txt @@ -3,6 +3,7 @@ Required Properties: - compatible: should contain one of the following. + - "renesas,gpio-r8a7743": for R8A7743 (RZ/G1M) compatible GPIO controller. - "renesas,gpio-r8a7778": for R8A7778 (R-Mobile M1) compatible GPIO controller. - "renesas,gpio-r8a7779": for R8A7779 (R-Car H1) compatible GPIO controller. - "renesas,gpio-r8a7790": for R8A7790 (R-Car H2) compatible GPIO controller. diff --git a/Documentation/devicetree/bindings/gpu/arm,mali-midgard.txt b/Documentation/devicetree/bindings/gpu/arm,mali-midgard.txt new file mode 100644 index 000000000000..d3b6e1a4713a --- /dev/null +++ b/Documentation/devicetree/bindings/gpu/arm,mali-midgard.txt @@ -0,0 +1,86 @@ +ARM Mali Midgard GPU +==================== + +Required properties: + +- compatible : + * Must contain one of the following: + + "arm,mali-t604" + + "arm,mali-t624" + + "arm,mali-t628" + + "arm,mali-t720" + + "arm,mali-t760" + + "arm,mali-t820" + + "arm,mali-t830" + + "arm,mali-t860" + + "arm,mali-t880" + * which must be preceded by one of the following vendor specifics: + + "amlogic,meson-gxm-mali" + + "rockchip,rk3288-mali" + +- reg : Physical base address of the device and length of the register area. + +- interrupts : Contains the three IRQ lines required by Mali Midgard devices. + +- interrupt-names : Contains the names of IRQ resources in the order they were + provided in the interrupts property. Must contain: "job", "mmu", "gpu". + + +Optional properties: + +- clocks : Phandle to clock for the Mali Midgard device. + +- mali-supply : Phandle to regulator for the Mali device. Refer to + Documentation/devicetree/bindings/regulator/regulator.txt for details. + +- operating-points-v2 : Refer to Documentation/devicetree/bindings/power/opp.txt + for details. + + +Example for a Mali-T760: + +gpu@ffa30000 { + compatible = "rockchip,rk3288-mali", "arm,mali-t760", "arm,mali-midgard"; + reg = <0xffa30000 0x10000>; + interrupts = <GIC_SPI 6 IRQ_TYPE_LEVEL_HIGH>, + <GIC_SPI 7 IRQ_TYPE_LEVEL_HIGH>, + <GIC_SPI 8 IRQ_TYPE_LEVEL_HIGH>; + interrupt-names = "job", "mmu", "gpu"; + clocks = <&cru ACLK_GPU>; + mali-supply = <&vdd_gpu>; + operating-points-v2 = <&gpu_opp_table>; + power-domains = <&power RK3288_PD_GPU>; +}; + +gpu_opp_table: opp_table0 { + compatible = "operating-points-v2"; + + opp@533000000 { + opp-hz = /bits/ 64 <533000000>; + opp-microvolt = <1250000>; + }; + opp@450000000 { + opp-hz = /bits/ 64 <450000000>; + opp-microvolt = <1150000>; + }; + opp@400000000 { + opp-hz = /bits/ 64 <400000000>; + opp-microvolt = <1125000>; + }; + opp@350000000 { + opp-hz = /bits/ 64 <350000000>; + opp-microvolt = <1075000>; + }; + opp@266000000 { + opp-hz = /bits/ 64 <266000000>; + opp-microvolt = <1025000>; + }; + opp@160000000 { + opp-hz = /bits/ 64 <160000000>; + opp-microvolt = <925000>; + }; + opp@100000000 { + opp-hz = /bits/ 64 <100000000>; + opp-microvolt = <912500>; + }; +}; diff --git a/Documentation/devicetree/bindings/graph.txt b/Documentation/devicetree/bindings/graph.txt index fcb1c6a4787b..0415e2c53ba0 100644 --- a/Documentation/devicetree/bindings/graph.txt +++ b/Documentation/devicetree/bindings/graph.txt @@ -34,7 +34,7 @@ remote device, an 'endpoint' child node must be provided for each link. If more than one port is present in a device node or there is more than one endpoint at a port, or a port node needs to be associated with a selected hardware interface, a common scheme using '#address-cells', '#size-cells' -and 'reg' properties is used number the nodes. +and 'reg' properties is used to number the nodes. device { ... @@ -89,9 +89,9 @@ Links between endpoints Each endpoint should contain a 'remote-endpoint' phandle property that points to the corresponding endpoint in the port of the remote device. In turn, the -remote endpoint should contain a 'remote-endpoint' property. If it has one, -it must not point to another than the local endpoint. Two endpoints with their -'remote-endpoint' phandles pointing at each other form a link between the +remote endpoint should contain a 'remote-endpoint' property. If it has one, it +must not point to anything other than the local endpoint. Two endpoints with +their 'remote-endpoint' phandles pointing at each other form a link between the containing ports. device-1 { @@ -110,13 +110,12 @@ device-2 { }; }; - Required properties ------------------- If there is more than one 'port' or more than one 'endpoint' node or 'reg' -property is present in port and/or endpoint nodes the following properties -are required in a relevant parent node: +property present in the port and/or endpoint nodes then the following +properties are required in a relevant parent node: - #address-cells : number of cells required to define port/endpoint identifier, should be 1. diff --git a/Documentation/devicetree/bindings/hwlock/sprd-hwspinlock.txt b/Documentation/devicetree/bindings/hwlock/sprd-hwspinlock.txt new file mode 100644 index 000000000000..581db9d941ba --- /dev/null +++ b/Documentation/devicetree/bindings/hwlock/sprd-hwspinlock.txt @@ -0,0 +1,23 @@ +SPRD Hardware Spinlock Device Binding +------------------------------------- + +Required properties : +- compatible : should be "sprd,hwspinlock-r3p0". +- reg : the register address of hwspinlock. +- #hwlock-cells : hwlock users only use the hwlock id to represent a specific + hwlock, so the number of cells should be <1> here. +- clock-names : Must contain "enable". +- clocks : Must contain a phandle entry for the clock in clock-names, see the + common clock bindings. + +Please look at the generic hwlock binding for usage information for consumers, +"Documentation/devicetree/bindings/hwlock/hwlock.txt" + +Example of hwlock provider: + hwspinlock@40500000 { + compatible = "sprd,hwspinlock-r3p0"; + reg = <0 0x40500000 0 0x1000>; + #hwlock-cells = <1>; + clock-names = "enable"; + clocks = <&clk_aon_apb_gates0 22>; + }; diff --git a/Documentation/devicetree/bindings/i2c/i2c-aspeed.txt b/Documentation/devicetree/bindings/i2c/i2c-aspeed.txt new file mode 100644 index 000000000000..bd6480b19535 --- /dev/null +++ b/Documentation/devicetree/bindings/i2c/i2c-aspeed.txt @@ -0,0 +1,48 @@ +Device tree configuration for the I2C busses on the AST24XX and AST25XX SoCs. + +Required Properties: +- #address-cells : should be 1 +- #size-cells : should be 0 +- reg : address offset and range of bus +- compatible : should be "aspeed,ast2400-i2c-bus" + or "aspeed,ast2500-i2c-bus" +- clocks : root clock of bus, should reference the APB + clock +- interrupts : interrupt number +- interrupt-parent : interrupt controller for bus, should reference a + aspeed,ast2400-i2c-ic or aspeed,ast2500-i2c-ic + interrupt controller + +Optional Properties: +- bus-frequency : frequency of the bus clock in Hz defaults to 100 kHz when not + specified +- multi-master : states that there is another master active on this bus. + +Example: + +i2c { + compatible = "simple-bus"; + #address-cells = <1>; + #size-cells = <1>; + ranges = <0 0x1e78a000 0x1000>; + + i2c_ic: interrupt-controller@0 { + #interrupt-cells = <1>; + compatible = "aspeed,ast2400-i2c-ic"; + reg = <0x0 0x40>; + interrupts = <12>; + interrupt-controller; + }; + + i2c0: i2c-bus@40 { + #address-cells = <1>; + #size-cells = <0>; + #interrupt-cells = <1>; + reg = <0x40 0x40>; + compatible = "aspeed,ast2400-i2c-bus"; + clocks = <&clk_apb>; + bus-frequency = <100000>; + interrupts = <0>; + interrupt-parent = <&i2c_ic>; + }; +}; diff --git a/Documentation/devicetree/bindings/i2c/i2c-designware.txt b/Documentation/devicetree/bindings/i2c/i2c-designware.txt index fee26dc3e858..fbb0a6d8b964 100644 --- a/Documentation/devicetree/bindings/i2c/i2c-designware.txt +++ b/Documentation/devicetree/bindings/i2c/i2c-designware.txt @@ -20,7 +20,7 @@ Optional properties : - i2c-sda-falling-time-ns : should contain the SDA falling time in nanoseconds. This value which is by default 300ns is used to compute the tHIGH period. -Example : +Examples : i2c@f0000 { #address-cells = <1>; @@ -43,3 +43,17 @@ Example : i2c-sda-falling-time-ns = <300>; i2c-scl-falling-time-ns = <300>; }; + + i2c@1120000 { + #address-cells = <1>; + #size-cells = <0>; + reg = <0x2000 0x100>; + clock-frequency = <400000>; + clocks = <&i2cclk>; + interrupts = <0>; + + eeprom@64 { + compatible = "linux,slave-24c02"; + reg = <0x40000064>; + }; + }; diff --git a/Documentation/devicetree/bindings/i2c/i2c-mt6577.txt b/Documentation/devicetree/bindings/i2c/i2c-mtk.txt index 0ce6fa3242f0..bd5a7befd951 100644 --- a/Documentation/devicetree/bindings/i2c/i2c-mt6577.txt +++ b/Documentation/devicetree/bindings/i2c/i2c-mtk.txt @@ -4,11 +4,11 @@ The Mediatek's I2C controller is used to interface with I2C devices. Required properties: - compatible: value should be either of the following. - (a) "mediatek,mt6577-i2c", for i2c compatible with mt6577 i2c. - (b) "mediatek,mt6589-i2c", for i2c compatible with mt6589 i2c. - (c) "mediatek,mt8127-i2c", for i2c compatible with mt8127 i2c. - (d) "mediatek,mt8135-i2c", for i2c compatible with mt8135 i2c. - (e) "mediatek,mt8173-i2c", for i2c compatible with mt8173 i2c. + "mediatek,mt2701-i2c", "mediatek,mt6577-i2c": for Mediatek mt2701 + "mediatek,mt6577-i2c": for i2c compatible with mt6577. + "mediatek,mt6589-i2c": for i2c compatible with mt6589. + "mediatek,mt7623-i2c", "mediatek,mt6577-i2c": for i2c compatible with mt7623. + "mediatek,mt8173-i2c": for i2c compatible with mt8173. - reg: physical base address of the controller and dma base, length of memory mapped region. - interrupts: interrupt number to the cpu. diff --git a/Documentation/devicetree/bindings/i2c/i2c-mux-gpmux.txt b/Documentation/devicetree/bindings/i2c/i2c-mux-gpmux.txt new file mode 100644 index 000000000000..2907dab56298 --- /dev/null +++ b/Documentation/devicetree/bindings/i2c/i2c-mux-gpmux.txt @@ -0,0 +1,99 @@ +General Purpose I2C Bus Mux + +This binding describes an I2C bus multiplexer that uses a mux controller +from the mux subsystem to route the I2C signals. + + .-----. .-----. + | dev | | dev | + .------------. '-----' '-----' + | SoC | | | + | | .--------+--------' + | .------. | .------+ child bus A, on MUX value set to 0 + | | I2C |-|--| Mux | + | '------' | '--+---+ child bus B, on MUX value set to 1 + | .------. | | '----------+--------+--------. + | | MUX- | | | | | | + | | Ctrl |-|-----+ .-----. .-----. .-----. + | '------' | | dev | | dev | | dev | + '------------' '-----' '-----' '-----' + +Required properties: +- compatible: i2c-mux +- i2c-parent: The phandle of the I2C bus that this multiplexer's master-side + port is connected to. +- mux-controls: The phandle of the mux controller to use for operating the + mux. +* Standard I2C mux properties. See i2c-mux.txt in this directory. +* I2C child bus nodes. See i2c-mux.txt in this directory. The sub-bus number + is also the mux-controller state described in ../mux/mux-controller.txt + +Optional properties: +- mux-locked: If present, explicitly allow unrelated I2C transactions on the + parent I2C adapter at these times: + + during setup of the multiplexer + + between setup of the multiplexer and the child bus I2C transaction + + between the child bus I2C transaction and releasing of the multiplexer + + during releasing of the multiplexer + However, I2C transactions to devices behind all I2C multiplexers connected + to the same parent adapter that this multiplexer is connected to are blocked + for the full duration of the complete multiplexed I2C transaction (i.e. + including the times covered by the above list). + If mux-locked is not present, the multiplexer is assumed to be parent-locked. + This means that no unrelated I2C transactions are allowed on the parent I2C + adapter for the complete multiplexed I2C transaction. + The properties of mux-locked and parent-locked multiplexers are discussed + in more detail in Documentation/i2c/i2c-topology. + +For each i2c child node, an I2C child bus will be created. They will +be numbered based on their order in the device tree. + +Whenever an access is made to a device on a child bus, the value set +in the relevant node's reg property will be set as the state in the +mux controller. + +Example: + mux: mux-controller { + compatible = "gpio-mux"; + #mux-control-cells = <0>; + + mux-gpios = <&pioA 0 GPIO_ACTIVE_HIGH>, + <&pioA 1 GPIO_ACTIVE_HIGH>; + }; + + i2c-mux { + compatible = "i2c-mux"; + mux-locked; + i2c-parent = <&i2c1>; + + mux-controls = <&mux>; + + #address-cells = <1>; + #size-cells = <0>; + + i2c@1 { + reg = <1>; + #address-cells = <1>; + #size-cells = <0>; + + ssd1307: oled@3c { + compatible = "solomon,ssd1307fb-i2c"; + reg = <0x3c>; + pwms = <&pwm 4 3000>; + reset-gpios = <&gpio2 7 1>; + reset-active-low; + }; + }; + + i2c@3 { + reg = <3>; + #address-cells = <1>; + #size-cells = <0>; + + pca9555: pca9555@20 { + compatible = "nxp,pca9555"; + gpio-controller; + #gpio-cells = <2>; + reg = <0x20>; + }; + }; + }; diff --git a/Documentation/devicetree/bindings/i2c/i2c-pca-platform.txt b/Documentation/devicetree/bindings/i2c/i2c-pca-platform.txt new file mode 100644 index 000000000000..f1f3876bb8e8 --- /dev/null +++ b/Documentation/devicetree/bindings/i2c/i2c-pca-platform.txt @@ -0,0 +1,29 @@ +* NXP PCA PCA9564/PCA9665 I2C controller + +The PCA9564/PCA9665 serves as an interface between most standard +parallel-bus microcontrollers/microprocessors and the serial I2C-bus +and allows the parallel bus system to communicate bi-directionally +with the I2C-bus. + +Required properties : + + - reg : Offset and length of the register set for the device + - compatible : one of "nxp,pca9564" or "nxp,pca9665" + +Optional properties + - interrupts : the interrupt number + - interrupt-parent : the phandle for the interrupt controller. + If an interrupt is not specified polling will be used. + - reset-gpios : gpio specifier for gpio connected to RESET_N pin. As the line + is active low, it should be marked GPIO_ACTIVE_LOW. + - clock-frequency : I2C bus frequency. + +Example: + i2c0: i2c@80000 { + compatible = "nxp,pca9564"; + #address-cells = <1>; + #size-cells = <0>; + reg = <0x80000 0x4>; + reset-gpios = <&gpio1 0 GPIO_ACTIVE_LOW>; + clock-frequency = <100000>; + }; diff --git a/Documentation/devicetree/bindings/i2c/i2c-zx2967.txt b/Documentation/devicetree/bindings/i2c/i2c-zx2967.txt new file mode 100644 index 000000000000..cb806d1ae4c9 --- /dev/null +++ b/Documentation/devicetree/bindings/i2c/i2c-zx2967.txt @@ -0,0 +1,22 @@ +ZTE zx2967 I2C controller + +Required properties: + - compatible: must be "zte,zx296718-i2c" + - reg: physical address and length of the device registers + - interrupts: a single interrupt specifier + - clocks: clock for the device + - #address-cells: should be <1> + - #size-cells: should be <0> + - clock-frequency: the desired I2C bus clock frequency. + +Examples: + + i2c@112000 { + compatible = "zte,zx296718-i2c"; + reg = <0x00112000 0x1000>; + interrupts = <GIC_SPI 112 IRQ_TYPE_LEVEL_HIGH>; + clocks = <&osc24m>; + #address-cells = <1> + #size-cells = <0>; + clock-frequency = <1600000>; + }; diff --git a/Documentation/devicetree/bindings/iio/adc/amlogic,meson-saradc.txt b/Documentation/devicetree/bindings/iio/adc/amlogic,meson-saradc.txt index 047189192aec..f413e82c8b83 100644 --- a/Documentation/devicetree/bindings/iio/adc/amlogic,meson-saradc.txt +++ b/Documentation/devicetree/bindings/iio/adc/amlogic,meson-saradc.txt @@ -2,6 +2,8 @@ Required properties: - compatible: depending on the SoC this should be one of: + - "amlogic,meson8-saradc" for Meson8 + - "amlogic,meson8b-saradc" for Meson8b - "amlogic,meson-gxbb-saradc" for GXBB - "amlogic,meson-gxl-saradc" for GXL - "amlogic,meson-gxm-saradc" for GXM diff --git a/Documentation/devicetree/bindings/iio/adc/renesas,gyroadc.txt b/Documentation/devicetree/bindings/iio/adc/renesas,gyroadc.txt index f5b0adae6010..df5b9f2ad8d8 100644 --- a/Documentation/devicetree/bindings/iio/adc/renesas,gyroadc.txt +++ b/Documentation/devicetree/bindings/iio/adc/renesas,gyroadc.txt @@ -1,4 +1,4 @@ -* Renesas RCar GyroADC device driver +* Renesas R-Car GyroADC device driver The GyroADC block is a reduced SPI block with up to 8 chipselect lines, which supports the SPI protocol of a selected few SPI ADCs. The SPI ADCs @@ -16,8 +16,7 @@ Required properties: - clocks: References to all the clocks specified in the clock-names property as specified in Documentation/devicetree/bindings/clock/clock-bindings.txt. -- clock-names: Shall contain "fck" and "if". The "fck" is the GyroADC block - clock, the "if" is the interface clock. +- clock-names: Shall contain "fck". The "fck" is the GyroADC block clock. - power-domains: Must contain a reference to the PM domain, if available. - #address-cells: Should be <1> (setting for the subnodes) for all ADCs except for "fujitsu,mb88101a". Should be <0> (setting for @@ -75,8 +74,8 @@ Example: adc@e6e54000 { compatible = "renesas,r8a7791-gyroadc", "renesas,rcar-gyroadc"; reg = <0 0xe6e54000 0 64>; - clocks = <&mstp9_clks R8A7791_CLK_GYROADC>, <&clk_65m>; - clock-names = "fck", "if"; + clocks = <&mstp9_clks R8A7791_CLK_GYROADC>; + clock-names = "fck"; power-domains = <&sysc R8A7791_PD_ALWAYS_ON>; pinctrl-0 = <&adc_pins>; diff --git a/Documentation/devicetree/bindings/iio/adc/st,stm32-adc.txt b/Documentation/devicetree/bindings/iio/adc/st,stm32-adc.txt index e35f9f1b3200..8310073f14e1 100644 --- a/Documentation/devicetree/bindings/iio/adc/st,stm32-adc.txt +++ b/Documentation/devicetree/bindings/iio/adc/st,stm32-adc.txt @@ -21,11 +21,19 @@ own configurable sequence and trigger: Contents of a stm32 adc root node: ----------------------------------- Required properties: -- compatible: Should be "st,stm32f4-adc-core". +- compatible: Should be one of: + "st,stm32f4-adc-core" + "st,stm32h7-adc-core" - reg: Offset and length of the ADC block register set. - interrupts: Must contain the interrupt for ADC block. -- clocks: Clock for the analog circuitry (common to all ADCs). -- clock-names: Must be "adc". +- clocks: Core can use up to two clocks, depending on part used: + - "adc" clock: for the analog circuitry, common to all ADCs. + It's required on stm32f4. + It's optional on stm32h7. + - "bus" clock: for registers access, common to all ADCs. + It's not present on stm32f4. + It's required on stm32h7. +- clock-names: Must be "adc" and/or "bus" depending on part used. - interrupt-controller: Identifies the controller node as interrupt-parent - vref-supply: Phandle to the vref input analog reference voltage. - #interrupt-cells = <1>; @@ -42,14 +50,18 @@ An ADC block node should contain at least one subnode, representing an ADC instance available on the machine. Required properties: -- compatible: Should be "st,stm32f4-adc". +- compatible: Should be one of: + "st,stm32f4-adc" + "st,stm32h7-adc" - reg: Offset of ADC instance in ADC block (e.g. may be 0x0, 0x100, 0x200). -- clocks: Input clock private to this ADC instance. +- clocks: Input clock private to this ADC instance. It's required only on + stm32f4, that has per instance clock input for registers access. - interrupt-parent: Phandle to the parent interrupt controller. - interrupts: IRQ Line for the ADC (e.g. may be 0 for adc@0, 1 for adc@100 or 2 for adc@200). - st,adc-channels: List of single-ended channels muxed for this ADC. - It can have up to 16 channels, numbered from 0 to 15 (resp. for in0..in15). + It can have up to 16 channels on stm32f4 or 20 channels on stm32h7, numbered + from 0 to 15 or 19 (resp. for in0..in15 or in0..in19). - #io-channel-cells = <1>: See the IIO bindings section "IIO consumers" in Documentation/devicetree/bindings/iio/iio-bindings.txt @@ -58,7 +70,9 @@ Optional properties: See ../../dma/dma.txt for details. - dma-names: Must be "rx" when dmas property is being used. - assigned-resolution-bits: Resolution (bits) to use for conversions. Must - match device available resolutions (e.g. can be 6, 8, 10 or 12 on stm32f4). + match device available resolutions: + * can be 6, 8, 10 or 12 on stm32f4 + * can be 8, 10, 12, 14 or 16 on stm32h7 Default is maximum resolution if unset. Example: diff --git a/Documentation/devicetree/bindings/iio/adc/ti-adc084s021.txt b/Documentation/devicetree/bindings/iio/adc/ti-adc084s021.txt new file mode 100644 index 000000000000..4259e50620bc --- /dev/null +++ b/Documentation/devicetree/bindings/iio/adc/ti-adc084s021.txt @@ -0,0 +1,19 @@ +* Texas Instruments' ADC084S021 + +Required properties: + - compatible : Must be "ti,adc084s021" + - reg : SPI chip select number for the device + - vref-supply : The regulator supply for ADC reference voltage + - spi-cpol : Per spi-bus bindings + - spi-cpha : Per spi-bus bindings + - spi-max-frequency : Per spi-bus bindings + +Example: +adc@0 { + compatible = "ti,adc084s021"; + reg = <0>; + vref-supply = <&adc_vref>; + spi-cpol; + spi-cpha; + spi-max-frequency = <16000000>; +}; diff --git a/Documentation/devicetree/bindings/iio/adc/ti-adc108s102.txt b/Documentation/devicetree/bindings/iio/adc/ti-adc108s102.txt new file mode 100644 index 000000000000..bbbbb4a9f58f --- /dev/null +++ b/Documentation/devicetree/bindings/iio/adc/ti-adc108s102.txt @@ -0,0 +1,18 @@ +* Texas Instruments' ADC108S102 and ADC128S102 ADC chip + +Required properties: + - compatible: Should be "ti,adc108s102" + - reg: spi chip select number for the device + - vref-supply: The regulator supply for ADC reference voltage + +Recommended properties: + - spi-max-frequency: Definition as per + Documentation/devicetree/bindings/spi/spi-bus.txt + +Example: +adc@0 { + compatible = "ti,adc108s102"; + reg = <0>; + vref-supply = <&vdd_supply>; + spi-max-frequency = <1000000>; +}; diff --git a/Documentation/devicetree/bindings/iio/imu/st_lsm6dsx.txt b/Documentation/devicetree/bindings/iio/imu/st_lsm6dsx.txt index 8305fb05ffda..6f28ff55f3ec 100644 --- a/Documentation/devicetree/bindings/iio/imu/st_lsm6dsx.txt +++ b/Documentation/devicetree/bindings/iio/imu/st_lsm6dsx.txt @@ -13,7 +13,8 @@ Optional properties: "data ready" (valid values: 1 or 2). - interrupt-parent: should be the phandle for the interrupt controller - interrupts: interrupt mapping for IRQ. It should be configured with - flags IRQ_TYPE_LEVEL_HIGH or IRQ_TYPE_EDGE_RISING. + flags IRQ_TYPE_LEVEL_HIGH, IRQ_TYPE_EDGE_RISING, IRQ_TYPE_LEVEL_LOW or + IRQ_TYPE_EDGE_FALLING. Refer to interrupt-controller/interrupts.txt for generic interrupt client node bindings. diff --git a/Documentation/devicetree/bindings/iio/multiplexer/io-channel-mux.txt b/Documentation/devicetree/bindings/iio/multiplexer/io-channel-mux.txt new file mode 100644 index 000000000000..c82794002595 --- /dev/null +++ b/Documentation/devicetree/bindings/iio/multiplexer/io-channel-mux.txt @@ -0,0 +1,39 @@ +I/O channel multiplexer bindings + +If a multiplexer is used to select which hardware signal is fed to +e.g. an ADC channel, these bindings describe that situation. + +Required properties: +- compatible : "io-channel-mux" +- io-channels : Channel node of the parent channel that has multiplexed + input. +- io-channel-names : Should be "parent". +- #address-cells = <1>; +- #size-cells = <0>; +- mux-controls : Mux controller node to use for operating the mux +- channels : List of strings, labeling the mux controller states. + +For each non-empty string in the channels property, an io-channel will +be created. The number of this io-channel is the same as the index into +the list of strings in the channels property, and also matches the mux +controller state. The mux controller state is described in +../mux/mux-controller.txt + +Example: + mux: mux-controller { + compatible = "mux-gpio"; + #mux-control-cells = <0>; + + mux-gpios = <&pioA 0 GPIO_ACTIVE_HIGH>, + <&pioA 1 GPIO_ACTIVE_HIGH>; + }; + + adc-mux { + compatible = "io-channel-mux"; + io-channels = <&adc 0>; + io-channel-names = "parent"; + + mux-controls = <&mux>; + + channels = "sync", "in", "system-regulator"; + }; diff --git a/Documentation/devicetree/bindings/iio/proximity/as3935.txt b/Documentation/devicetree/bindings/iio/proximity/as3935.txt index ae23dd8da736..38d74314b7ab 100644 --- a/Documentation/devicetree/bindings/iio/proximity/as3935.txt +++ b/Documentation/devicetree/bindings/iio/proximity/as3935.txt @@ -3,6 +3,7 @@ Austrian Microsystems AS3935 Franklin lightning sensor device driver Required properties: - compatible: must be "ams,as3935" - reg: SPI chip select number for the device + - spi-max-frequency: specifies maximum SPI clock frequency - spi-cpha: SPI Mode 1. Refer to spi/spi-bus.txt for generic SPI slave node bindings. - interrupt-parent : should be the phandle for the interrupt controller @@ -21,6 +22,7 @@ Example: as3935@0 { compatible = "ams,as3935"; reg = <0>; + spi-max-frequency = <400000>; spi-cpha; interrupt-parent = <&gpio1>; interrupts = <16 1>; diff --git a/Documentation/devicetree/bindings/input/dlink,dir685-touchkeys.txt b/Documentation/devicetree/bindings/input/dlink,dir685-touchkeys.txt new file mode 100644 index 000000000000..10dec1c57abf --- /dev/null +++ b/Documentation/devicetree/bindings/input/dlink,dir685-touchkeys.txt @@ -0,0 +1,21 @@ +* D-Link DIR-685 Touchkeys + +This is a I2C one-off touchkey controller based on the Cypress Semiconductor +CY8C214 MCU with some firmware in its internal 8KB flash. The circuit +board inside the router is named E119921. + +The touchkey device node should be placed inside an I2C bus node. + +Required properties: +- compatible: must be "dlink,dir685-touchkeys" +- reg: the I2C address of the touchkeys +- interrupts: reference to the interrupt number + +Example: + +touchkeys@26 { + compatible = "dlink,dir685-touchkeys"; + reg = <0x26>; + interrupt-parent = <&gpio0>; + interrupts = <17 IRQ_TYPE_EDGE_FALLING>; +}; diff --git a/Documentation/devicetree/bindings/input/touchscreen/st,stmfts.txt b/Documentation/devicetree/bindings/input/touchscreen/st,stmfts.txt new file mode 100644 index 000000000000..9683595cd0f5 --- /dev/null +++ b/Documentation/devicetree/bindings/input/touchscreen/st,stmfts.txt @@ -0,0 +1,43 @@ +* ST-Microelectronics FingerTip touchscreen controller + +The ST-Microelectronics FingerTip device provides a basic touchscreen +functionality. Along with it the user can enable the touchkey which can work as +a basic HOME and BACK key for phones. + +The driver supports also hovering as an absolute single touch event with x, y, z +coordinates. + +Required properties: +- compatible : must be "st,stmfts" +- reg : I2C slave address, (e.g. 0x49) +- interrupt-parent : the phandle to the interrupt controller which provides + the interrupt +- interrupts : interrupt specification +- avdd-supply : analogic power supply +- vdd-supply : power supply +- touchscreen-size-x : see touchscreen.txt +- touchscreen-size-y : see touchscreen.txt + +Optional properties: +- touch-key-connected : specifies whether the touchkey feature is connected +- ledvdd-supply : power supply to the touch key leds + +Example: + +i2c@00000000 { + + /* ... */ + + touchscreen@49 { + compatible = "st,stmfts"; + reg = <0x49>; + interrupt-parent = <&gpa1>; + interrupts = <1 IRQ_TYPE_NONE>; + touchscreen-size-x = <1599>; + touchscreen-size-y = <2559>; + touch-key-connected; + avdd-supply = <&ldo30_reg>; + vdd-supply = <&ldo31_reg>; + ledvdd-supply = <&ldo33_reg>; + }; +}; diff --git a/Documentation/devicetree/bindings/interrupt-controller/allwinner,sunxi-nmi.txt b/Documentation/devicetree/bindings/interrupt-controller/allwinner,sunxi-nmi.txt index 81cd3692405e..4ae553eb333d 100644 --- a/Documentation/devicetree/bindings/interrupt-controller/allwinner,sunxi-nmi.txt +++ b/Documentation/devicetree/bindings/interrupt-controller/allwinner,sunxi-nmi.txt @@ -3,8 +3,11 @@ Allwinner Sunxi NMI Controller Required properties: -- compatible : should be "allwinner,sun7i-a20-sc-nmi" or - "allwinner,sun6i-a31-sc-nmi" or "allwinner,sun9i-a80-nmi" +- compatible : should be one of the following: + - "allwinner,sun7i-a20-sc-nmi" + - "allwinner,sun6i-a31-sc-nmi" (deprecated) + - "allwinner,sun6i-a31-r-intc" + - "allwinner,sun9i-a80-nmi" - reg : Specifies base physical address and size of the registers. - interrupt-controller : Identifies the node as an interrupt controller - #interrupt-cells : Specifies the number of cells needed to encode an diff --git a/Documentation/devicetree/bindings/interrupt-controller/aspeed,ast2400-i2c-ic.txt b/Documentation/devicetree/bindings/interrupt-controller/aspeed,ast2400-i2c-ic.txt new file mode 100644 index 000000000000..033cc82e5684 --- /dev/null +++ b/Documentation/devicetree/bindings/interrupt-controller/aspeed,ast2400-i2c-ic.txt @@ -0,0 +1,25 @@ +Device tree configuration for the I2C Interrupt Controller on the AST24XX and +AST25XX SoCs. + +Required Properties: +- #address-cells : should be 1 +- #size-cells : should be 1 +- #interrupt-cells : should be 1 +- compatible : should be "aspeed,ast2400-i2c-ic" + or "aspeed,ast2500-i2c-ic" +- reg : address start and range of controller +- interrupts : interrupt number +- interrupt-controller : denotes that the controller receives and fires + new interrupts for child busses + +Example: + +i2c_ic: interrupt-controller@0 { + #address-cells = <1>; + #size-cells = <1>; + #interrupt-cells = <1>; + compatible = "aspeed,ast2400-i2c-ic"; + reg = <0x0 0x40>; + interrupts = <12>; + interrupt-controller; +}; diff --git a/Documentation/devicetree/bindings/interrupt-controller/aspeed,ast2400-vic.txt b/Documentation/devicetree/bindings/interrupt-controller/aspeed,ast2400-vic.txt index 6c6e85324b9d..e3fea0758d25 100644 --- a/Documentation/devicetree/bindings/interrupt-controller/aspeed,ast2400-vic.txt +++ b/Documentation/devicetree/bindings/interrupt-controller/aspeed,ast2400-vic.txt @@ -1,12 +1,13 @@ Aspeed Vectored Interrupt Controller -These bindings are for the Aspeed AST2400 interrupt controller register layout. -The SoC has an legacy register layout, but this driver does not support that -mode of operation. +These bindings are for the Aspeed interrupt controller. The AST2400 and +AST2500 SoC families include a legacy register layout before a re-designed +layout, but the bindings do not prescribe the use of one or the other. Required properties: -- compatible : should be "aspeed,ast2400-vic". +- compatible : "aspeed,ast2400-vic" + "aspeed,ast2500-vic" - interrupt-controller : Identifies the node as an interrupt controller - #interrupt-cells : Specifies the number of cells needed to encode an diff --git a/Documentation/devicetree/bindings/interrupt-controller/marvell,gicp.txt b/Documentation/devicetree/bindings/interrupt-controller/marvell,gicp.txt new file mode 100644 index 000000000000..64a00ceb7da4 --- /dev/null +++ b/Documentation/devicetree/bindings/interrupt-controller/marvell,gicp.txt @@ -0,0 +1,27 @@ +Marvell GICP Controller +----------------------- + +GICP is a Marvell extension of the GIC that allows to trigger GIC SPI +interrupts by doing a memory transaction. It is used by the ICU +located in the Marvell CP110 to turn wired interrupts inside the CP +into GIC SPI interrupts. + +Required properties: + +- compatible: Must be "marvell,ap806-gicp" + +- reg: Must be the address and size of the GICP SPI registers + +- marvell,spi-ranges: tuples of GIC SPI interrupts ranges available + for this GICP + +- msi-controller: indicates that this is an MSI controller + +Example: + +gicp_spi: gicp-spi@3f0040 { + compatible = "marvell,ap806-gicp"; + reg = <0x3f0040 0x10>; + marvell,spi-ranges = <64 64>, <288 64>; + msi-controller; +}; diff --git a/Documentation/devicetree/bindings/interrupt-controller/marvell,icu.txt b/Documentation/devicetree/bindings/interrupt-controller/marvell,icu.txt new file mode 100644 index 000000000000..aa8bf2ec8905 --- /dev/null +++ b/Documentation/devicetree/bindings/interrupt-controller/marvell,icu.txt @@ -0,0 +1,51 @@ +Marvell ICU Interrupt Controller +-------------------------------- + +The Marvell ICU (Interrupt Consolidation Unit) controller is +responsible for collecting all wired-interrupt sources in the CP and +communicating them to the GIC in the AP, the unit translates interrupt +requests on input wires to MSG memory mapped transactions to the GIC. + +Required properties: + +- compatible: Should be "marvell,cp110-icu" + +- reg: Should contain ICU registers location and length. + +- #interrupt-cells: Specifies the number of cells needed to encode an + interrupt source. The value shall be 3. + + The 1st cell is the group type of the ICU interrupt. Possible group + types are: + + ICU_GRP_NSR (0x0) : Shared peripheral interrupt, non-secure + ICU_GRP_SR (0x1) : Shared peripheral interrupt, secure + ICU_GRP_SEI (0x4) : System error interrupt + ICU_GRP_REI (0x5) : RAM error interrupt + + The 2nd cell is the index of the interrupt in the ICU unit. + + The 3rd cell is the type of the interrupt. See arm,gic.txt for + details. + +- interrupt-controller: Identifies the node as an interrupt + controller. + +- msi-parent: Should point to the GICP controller, the GIC extension + that allows to trigger interrupts using MSG memory mapped + transactions. + +Example: + +icu: interrupt-controller@1e0000 { + compatible = "marvell,cp110-icu"; + reg = <0x1e0000 0x10>; + #interrupt-cells = <3>; + interrupt-controller; + msi-parent = <&gicp>; +}; + +usb3h0: usb3@500000 { + interrupt-parent = <&icu>; + interrupts = <ICU_GRP_NSR 106 IRQ_TYPE_LEVEL_HIGH>; +}; diff --git a/Documentation/devicetree/bindings/interrupt-controller/mediatek,sysirq.txt b/Documentation/devicetree/bindings/interrupt-controller/mediatek,sysirq.txt index a89c03bb1a81..11cc87aeb276 100644 --- a/Documentation/devicetree/bindings/interrupt-controller/mediatek,sysirq.txt +++ b/Documentation/devicetree/bindings/interrupt-controller/mediatek,sysirq.txt @@ -1,21 +1,23 @@ -+Mediatek 65xx/67xx/81xx sysirq ++Mediatek MT65xx/MT67xx/MT81xx sysirq Mediatek SOCs sysirq support controllable irq inverter for each GIC SPI interrupt. Required properties: -- compatible: should be one of: - "mediatek,mt8173-sysirq" - "mediatek,mt8135-sysirq" - "mediatek,mt8127-sysirq" - "mediatek,mt6795-sysirq" - "mediatek,mt6755-sysirq" - "mediatek,mt6592-sysirq" - "mediatek,mt6589-sysirq" - "mediatek,mt6582-sysirq" - "mediatek,mt6580-sysirq" - "mediatek,mt6577-sysirq" - "mediatek,mt2701-sysirq" +- compatible: should be + "mediatek,mt8173-sysirq", "mediatek,mt6577-sysirq": for MT8173 + "mediatek,mt8135-sysirq", "mediatek,mt6577-sysirq": for MT8135 + "mediatek,mt8127-sysirq", "mediatek,mt6577-sysirq": for MT8127 + "mediatek,mt7622-sysirq", "mediatek,mt6577-sysirq": for MT7622 + "mediatek,mt6795-sysirq", "mediatek,mt6577-sysirq": for MT6795 + "mediatek,mt6797-sysirq", "mediatek,mt6577-sysirq": for MT6797 + "mediatek,mt6755-sysirq", "mediatek,mt6577-sysirq": for MT6755 + "mediatek,mt6592-sysirq", "mediatek,mt6577-sysirq": for MT6592 + "mediatek,mt6589-sysirq", "mediatek,mt6577-sysirq": for MT6589 + "mediatek,mt6582-sysirq", "mediatek,mt6577-sysirq": for MT6582 + "mediatek,mt6580-sysirq", "mediatek,mt6577-sysirq": for MT6580 + "mediatek,mt6577-sysirq": for MT6577 + "mediatek,mt2701-sysirq", "mediatek,mt6577-sysirq": for MT2701 - interrupt-controller : Identifies the node as an interrupt controller - #interrupt-cells : Use the same format as specified by GIC in arm,gic.txt. - interrupt-parent: phandle of irq parent for sysirq. The parent must diff --git a/Documentation/devicetree/bindings/interrupt-controller/open-pic.txt b/Documentation/devicetree/bindings/interrupt-controller/open-pic.txt index 909a902dff85..ccbbfdc53c72 100644 --- a/Documentation/devicetree/bindings/interrupt-controller/open-pic.txt +++ b/Documentation/devicetree/bindings/interrupt-controller/open-pic.txt @@ -92,7 +92,6 @@ Example 2: * References -[1] Power.org (TM) Standard for Embedded Power Architecture (TM) Platform - Requirements (ePAPR), Version 1.0, July 2008. - (http://www.power.org/resources/downloads/Power_ePAPR_APPROVED_v1.0.pdf) +[1] Devicetree Specification + (https://www.devicetree.org/specifications/) diff --git a/Documentation/devicetree/bindings/iommu/arm,smmu-v3.txt b/Documentation/devicetree/bindings/iommu/arm,smmu-v3.txt index be57550e14e4..c9abbf3e4f68 100644 --- a/Documentation/devicetree/bindings/iommu/arm,smmu-v3.txt +++ b/Documentation/devicetree/bindings/iommu/arm,smmu-v3.txt @@ -26,6 +26,12 @@ the PCIe specification. * "priq" - PRI Queue not empty * "cmdq-sync" - CMD_SYNC complete * "gerror" - Global Error activated + * "combined" - The combined interrupt is optional, + and should only be provided if the + hardware supports just a single, + combined interrupt line. + If provided, then the combined interrupt + will be used in preference to any others. - #iommu-cells : See the generic IOMMU binding described in devicetree/bindings/pci/pci-iommu.txt @@ -49,6 +55,12 @@ the PCIe specification. - hisilicon,broken-prefetch-cmd : Avoid sending CMD_PREFETCH_* commands to the SMMU. +- cavium,cn9900-broken-page1-regspace + : Replaces all page 1 offsets used for EVTQ_PROD/CONS, + PRIQ_PROD/CONS register access with page 0 offsets. + Set for Cavium ThunderX2 silicon that doesn't support + SMMU page1 register space. + ** Example smmu@2b400000 { diff --git a/Documentation/devicetree/bindings/leds/common.txt b/Documentation/devicetree/bindings/leds/common.txt index 24b656014089..1d4afe9644b6 100644 --- a/Documentation/devicetree/bindings/leds/common.txt +++ b/Documentation/devicetree/bindings/leds/common.txt @@ -1,4 +1,4 @@ -Common leds properties. +* Common leds properties. LED and flash LED devices provide the same basic functionality as current regulators, but extended with LED and flash LED specific features like @@ -49,6 +49,22 @@ Optional properties for child nodes: - panic-indicator : This property specifies that the LED should be used, if at all possible, as a panic indicator. +- trigger-sources : List of devices which should be used as a source triggering + this LED activity. Some LEDs can be related to a specific + device and should somehow indicate its state. E.g. USB 2.0 + LED may react to device(s) in a USB 2.0 port(s). + Another common example is switch or router with multiple + Ethernet ports each of them having its own LED assigned + (assuming they are not hardwired). In such cases this + property should contain phandle(s) of related source + device(s). + In many cases LED can be related to more than one device + (e.g. one USB LED vs. multiple USB ports). Each source + should be represented by a node in the device tree and be + referenced by a phandle and a set of phandle arguments. A + length of arguments should be specified by the + #trigger-source-cells property in the source node. + Required properties for flash LED child nodes: - flash-max-microamp : Maximum flash LED supply current in microamperes. - flash-max-timeout-us : Maximum timeout in microseconds after which the flash @@ -59,7 +75,17 @@ property can be omitted. For controllers that have no configurable timeout the flash-max-timeout-us property can be omitted. -Examples: +* Trigger source providers + +Each trigger source should be represented by a device tree node. It may be e.g. +a USB port or an Ethernet device. + +Required properties for trigger source: +- #trigger-source-cells : Number of cells in a source trigger. Typically 0 for + nodes of simple trigger sources (e.g. a specific USB + port). + +* Examples gpio-leds { compatible = "gpio-leds"; @@ -69,6 +95,11 @@ gpio-leds { linux,default-trigger = "heartbeat"; gpios = <&gpio0 0 GPIO_ACTIVE_HIGH>; }; + + usb { + gpios = <&gpio0 1 GPIO_ACTIVE_HIGH>; + trigger-sources = <&ohci_port1>, <&ehci_port1>; + }; }; max77693-led { diff --git a/Documentation/devicetree/bindings/leds/pca963x.txt b/Documentation/devicetree/bindings/leds/pca963x.txt index dfbdb123a9bf..4eee41482041 100644 --- a/Documentation/devicetree/bindings/leds/pca963x.txt +++ b/Documentation/devicetree/bindings/leds/pca963x.txt @@ -10,6 +10,7 @@ Optional properties: - nxp,period-scale : In some configurations, the chip blinks faster than expected. This parameter provides a scaling ratio (fixed point, decimal divided by 1000) to compensate, e.g. 1300=1.3x and 750=0.75x. +- nxp,inverted-out: invert the polarity of the generated PWM Each led is represented as a sub-node of the nxp,pca963x device. diff --git a/Documentation/devicetree/bindings/mailbox/qcom,apcs-kpss-global.txt b/Documentation/devicetree/bindings/mailbox/qcom,apcs-kpss-global.txt new file mode 100644 index 000000000000..fb961c310f44 --- /dev/null +++ b/Documentation/devicetree/bindings/mailbox/qcom,apcs-kpss-global.txt @@ -0,0 +1,46 @@ +Binding for the Qualcomm APCS global block +========================================== + +This binding describes the APCS "global" block found in various Qualcomm +platforms. + +- compatible: + Usage: required + Value type: <string> + Definition: must be one of: + "qcom,msm8916-apcs-kpss-global", + "qcom,msm8996-apcs-hmss-global" + +- reg: + Usage: required + Value type: <prop-encoded-array> + Definition: must specify the base address and size of the global block + +- #mbox-cells: + Usage: required + Value type: <u32> + Definition: as described in mailbox.txt, must be 1 + + += EXAMPLE +The following example describes the APCS HMSS found in MSM8996 and part of the +GLINK RPM referencing the "rpm_hlos" doorbell therein. + + apcs_glb: mailbox@9820000 { + compatible = "qcom,msm8996-apcs-hmss-global"; + reg = <0x9820000 0x1000>; + + #mbox-cells = <1>; + }; + + rpm-glink { + compatible = "qcom,glink-rpm"; + + interrupts = <GIC_SPI 168 IRQ_TYPE_EDGE_RISING>; + + qcom,rpm-msg-ram = <&rpm_msg_ram>; + + mboxes = <&apcs_glb 0>; + mbox-names = "rpm_hlos"; + }; + diff --git a/Documentation/devicetree/bindings/mfd/arizona.txt b/Documentation/devicetree/bindings/mfd/arizona.txt index 8f2e2822238d..b37bdde5cfda 100644 --- a/Documentation/devicetree/bindings/mfd/arizona.txt +++ b/Documentation/devicetree/bindings/mfd/arizona.txt @@ -30,7 +30,8 @@ Required properties: - gpio-controller : Indicates this device is a GPIO controller. - #gpio-cells : Must be 2. The first cell is the pin number and the - second cell is used to specify optional parameters (currently unused). + second cell is used to specify optional parameters, see ../gpio/gpio.txt + for details. - AVDD-supply, DBVDD1-supply, CPVDD-supply : Power supplies for the device, as covered in Documentation/devicetree/bindings/regulator/regulator.txt diff --git a/Documentation/devicetree/bindings/mfd/hi6421.txt b/Documentation/devicetree/bindings/mfd/hi6421.txt index 0d5a4466a494..22da96d344a7 100644 --- a/Documentation/devicetree/bindings/mfd/hi6421.txt +++ b/Documentation/devicetree/bindings/mfd/hi6421.txt @@ -1,7 +1,9 @@ * HI6421 Multi-Functional Device (MFD), by HiSilicon Ltd. Required parent device properties: -- compatible : contains "hisilicon,hi6421-pmic"; +- compatible : One of the following chip-specific strings: + "hisilicon,hi6421-pmic"; + "hisilicon,hi6421v530-pmic"; - reg : register range space of hi6421; Supported Hi6421 sub-devices include: diff --git a/Documentation/devicetree/bindings/mfd/lp87565.txt b/Documentation/devicetree/bindings/mfd/lp87565.txt new file mode 100644 index 000000000000..a48df7c08ab0 --- /dev/null +++ b/Documentation/devicetree/bindings/mfd/lp87565.txt @@ -0,0 +1,43 @@ +TI LP87565 PMIC MFD driver + +Required properties: + - compatible: "ti,lp87565", "ti,lp87565-q1" + - reg: I2C slave address. + - gpio-controller: Marks the device node as a GPIO Controller. + - #gpio-cells: Should be two. The first cell is the pin number and + the second cell is used to specify flags. + See ../gpio/gpio.txt for more information. + - xxx-in-supply: Phandle to parent supply node of each regulator + populated under regulators node. xxx should match + the supply_name populated in driver. +Example: + +lp87565_pmic: pmic@60 { + compatible = "ti,lp87565-q1"; + reg = <0x60>; + gpio-controller; + #gpio-cells = <2>; + + buck10-in-supply = <&vsys_3v3>; + buck23-in-supply = <&vsys_3v3>; + + regulators: regulators { + buck10_reg: buck10 { + /* VDD_MPU */ + regulator-name = "buck10"; + regulator-min-microvolt = <850000>; + regulator-max-microvolt = <1250000>; + regulator-always-on; + regulator-boot-on; + }; + + buck23_reg: buck23 { + /* VDD_GPU */ + regulator-name = "buck23"; + regulator-min-microvolt = <850000>; + regulator-max-microvolt = <1250000>; + regulator-boot-on; + regulator-always-on; + }; + }; +}; diff --git a/Documentation/devicetree/bindings/mfd/stm32-timers.txt b/Documentation/devicetree/bindings/mfd/stm32-timers.txt index bbd083f5600a..1db6e0057a63 100644 --- a/Documentation/devicetree/bindings/mfd/stm32-timers.txt +++ b/Documentation/devicetree/bindings/mfd/stm32-timers.txt @@ -31,7 +31,7 @@ Example: compatible = "st,stm32-timers"; reg = <0x40010000 0x400>; clocks = <&rcc 0 160>; - clock-names = "clk_int"; + clock-names = "int"; pwm { compatible = "st,stm32-pwm"; diff --git a/Documentation/devicetree/bindings/mfd/tps65910.txt b/Documentation/devicetree/bindings/mfd/tps65910.txt index 38833e63a59f..8af1202b381d 100644 --- a/Documentation/devicetree/bindings/mfd/tps65910.txt +++ b/Documentation/devicetree/bindings/mfd/tps65910.txt @@ -61,6 +61,10 @@ Optional properties: There should be 9 entries here, one for each gpio. - ti,system-power-controller: Telling whether or not this pmic is controlling the system power. +- ti,sleep-enable: Enable SLEEP state. +- ti,sleep-keep-therm: Keep thermal monitoring on in sleep state. +- ti,sleep-keep-ck32k: Keep the 32KHz clock output on in sleep state. +- ti,sleep-keep-hsclk: Keep high speed internal clock on in sleep state. Regulator Optional properties: - ti,regulator-ext-sleep-control: enable external sleep diff --git a/Documentation/devicetree/bindings/misc/allwinner,syscon.txt b/Documentation/devicetree/bindings/misc/allwinner,syscon.txt new file mode 100644 index 000000000000..31494a24fe69 --- /dev/null +++ b/Documentation/devicetree/bindings/misc/allwinner,syscon.txt @@ -0,0 +1,20 @@ +* Allwinner sun8i system controller + +This file describes the bindings for the system controller present in +Allwinner SoC H3, A83T and A64. +The principal function of this syscon is to control EMAC PHY choice and +config. + +Required properties for the system controller: +- reg: address and length of the register for the device. +- compatible: should be "syscon" and one of the following string: + "allwinner,sun8i-h3-system-controller" + "allwinner,sun8i-v3s-system-controller" + "allwinner,sun50i-a64-system-controller" + "allwinner,sun8i-a83t-system-controller" + +Example: +syscon: syscon@1c00000 { + compatible = "allwinner,sun8i-h3-system-controller", "syscon"; + reg = <0x01c00000 0x1000>; +}; diff --git a/Documentation/devicetree/bindings/mmc/fsl-esdhc.txt b/Documentation/devicetree/bindings/mmc/fsl-esdhc.txt index dedfb02c744a..a2cf5e1c87d8 100644 --- a/Documentation/devicetree/bindings/mmc/fsl-esdhc.txt +++ b/Documentation/devicetree/bindings/mmc/fsl-esdhc.txt @@ -7,6 +7,20 @@ This file documents differences between the core properties described by mmc.txt and the properties used by the sdhci-esdhc driver. Required properties: + - compatible : should be "fsl,esdhc", or "fsl,<chip>-esdhc". + Possible compatibles for PowerPC: + "fsl,mpc8536-esdhc" + "fsl,mpc8378-esdhc" + "fsl,p2020-esdhc" + "fsl,p4080-esdhc" + "fsl,t1040-esdhc" + "fsl,t4240-esdhc" + Possible compatibles for ARM: + "fsl,ls1012a-esdhc" + "fsl,ls1088a-esdhc" + "fsl,ls1043a-esdhc" + "fsl,ls1046a-esdhc" + "fsl,ls2080a-esdhc" - interrupt-parent : interrupt source phandle. - clock-frequency : specifies eSDHC base clock frequency. diff --git a/Documentation/devicetree/bindings/mmc/k3-dw-mshc.txt b/Documentation/devicetree/bindings/mmc/k3-dw-mshc.txt index df370585cbcc..8af1afcb86dc 100644 --- a/Documentation/devicetree/bindings/mmc/k3-dw-mshc.txt +++ b/Documentation/devicetree/bindings/mmc/k3-dw-mshc.txt @@ -12,6 +12,7 @@ extensions to the Synopsys Designware Mobile Storage Host Controller. Required Properties: * compatible: should be one of the following. + - "hisilicon,hi3660-dw-mshc": for controllers with hi3660 specific extensions. - "hisilicon,hi4511-dw-mshc": for controllers with hi4511 specific extensions. - "hisilicon,hi6220-dw-mshc": for controllers with hi6220 specific extensions. diff --git a/Documentation/devicetree/bindings/mmc/rockchip-dw-mshc.txt b/Documentation/devicetree/bindings/mmc/rockchip-dw-mshc.txt index 520d61dad6dd..49ed3ad2524a 100644 --- a/Documentation/devicetree/bindings/mmc/rockchip-dw-mshc.txt +++ b/Documentation/devicetree/bindings/mmc/rockchip-dw-mshc.txt @@ -15,6 +15,7 @@ Required Properties: - "rockchip,rk3288-dw-mshc": for Rockchip RK3288 - "rockchip,rv1108-dw-mshc", "rockchip,rk3288-dw-mshc": for Rockchip RV1108 - "rockchip,rk3036-dw-mshc", "rockchip,rk3288-dw-mshc": for Rockchip RK3036 + - "rockchip,rk3328-dw-mshc", "rockchip,rk3288-dw-mshc": for Rockchip RK3328 - "rockchip,rk3368-dw-mshc", "rockchip,rk3288-dw-mshc": for Rockchip RK3368 - "rockchip,rk3399-dw-mshc", "rockchip,rk3288-dw-mshc": for Rockchip RK3399 @@ -31,6 +32,10 @@ Optional Properties: probing, low speeds or in case where all phases work at tuning time. If not specified 0 deg will be used. +* rockchip,desired-num-phases: The desired number of times that the host + execute tuning when needed. If not specified, the host will do tuning + for 360 times, namely tuning for each degree. + Example: rkdwmmc0@12200000 { diff --git a/Documentation/devicetree/bindings/mmc/ti-omap-hsmmc.txt b/Documentation/devicetree/bindings/mmc/ti-omap-hsmmc.txt index 74166a0d460d..0e026c151c1c 100644 --- a/Documentation/devicetree/bindings/mmc/ti-omap-hsmmc.txt +++ b/Documentation/devicetree/bindings/mmc/ti-omap-hsmmc.txt @@ -18,7 +18,7 @@ Required properties: Optional properties: ti,dual-volt: boolean, supports dual voltage cards <supply-name>-supply: phandle to the regulator device tree node -"supply-name" examples are "vmmc", "vmmc_aux" etc +"supply-name" examples are "vmmc", "vmmc_aux"(deprecated)/"vqmmc" etc ti,non-removable: non-removable slot (like eMMC) ti,needs-special-reset: Requires a special softreset sequence ti,needs-special-hs-handling: HSMMC IP needs special setting for handling High Speed diff --git a/Documentation/devicetree/bindings/mtd/atmel-nand.txt b/Documentation/devicetree/bindings/mtd/atmel-nand.txt index f6bee57e453a..9bb66e476672 100644 --- a/Documentation/devicetree/bindings/mtd/atmel-nand.txt +++ b/Documentation/devicetree/bindings/mtd/atmel-nand.txt @@ -59,8 +59,22 @@ Required properties: - reg: should contain 2 register ranges. The first one is pointing to the PMECC block, and the second one to the PMECC_ERRLOC block. +* SAMA5 NFC I/O bindings: + +SAMA5 SoCs embed an advanced NAND controller logic to automate READ/WRITE page +operations. This interface to this logic is placed in a separate I/O range and +should thus have its own DT node. + +- compatible: should be "atmel,sama5d3-nfc-io", "syscon". +- reg: should contain the I/O range used to interact with the NFC logic. + Example: + nfc_io: nfc-io@70000000 { + compatible = "atmel,sama5d3-nfc-io", "syscon"; + reg = <0x70000000 0x8000000>; + }; + pmecc: ecc-engine@ffffc070 { compatible = "atmel,at91sam9g45-pmecc"; reg = <0xffffc070 0x490>, diff --git a/Documentation/devicetree/bindings/mtd/denali-nand.txt b/Documentation/devicetree/bindings/mtd/denali-nand.txt index e593bbeb2115..504291d2e5c2 100644 --- a/Documentation/devicetree/bindings/mtd/denali-nand.txt +++ b/Documentation/devicetree/bindings/mtd/denali-nand.txt @@ -3,10 +3,23 @@ Required properties: - compatible : should be one of the following: "altr,socfpga-denali-nand" - for Altera SOCFPGA + "socionext,uniphier-denali-nand-v5a" - for Socionext UniPhier (v5a) + "socionext,uniphier-denali-nand-v5b" - for Socionext UniPhier (v5b) - reg : should contain registers location and length for data and reg. - reg-names: Should contain the reg names "nand_data" and "denali_reg" - interrupts : The interrupt number. +Optional properties: + - nand-ecc-step-size: see nand.txt for details. If present, the value must be + 512 for "altr,socfpga-denali-nand" + 1024 for "socionext,uniphier-denali-nand-v5a" + 1024 for "socionext,uniphier-denali-nand-v5b" + - nand-ecc-strength: see nand.txt for details. Valid values are: + 8, 15 for "altr,socfpga-denali-nand" + 8, 16, 24 for "socionext,uniphier-denali-nand-v5a" + 8, 16 for "socionext,uniphier-denali-nand-v5b" + - nand-ecc-maximize: see nand.txt for details + The device tree may optionally contain sub-nodes describing partitions of the address space. See partition.txt for more detail. diff --git a/Documentation/devicetree/bindings/mtd/elm.txt b/Documentation/devicetree/bindings/mtd/elm.txt index 8c1528c421d4..59ddc61c1076 100644 --- a/Documentation/devicetree/bindings/mtd/elm.txt +++ b/Documentation/devicetree/bindings/mtd/elm.txt @@ -1,7 +1,7 @@ Error location module Required properties: -- compatible: Must be "ti,am33xx-elm" +- compatible: Must be "ti,am3352-elm" - reg: physical base address and size of the registers map. - interrupts: Interrupt number for the elm. diff --git a/Documentation/devicetree/bindings/mtd/gpmc-nand.txt b/Documentation/devicetree/bindings/mtd/gpmc-nand.txt index 174f68c26c1b..dd559045593d 100644 --- a/Documentation/devicetree/bindings/mtd/gpmc-nand.txt +++ b/Documentation/devicetree/bindings/mtd/gpmc-nand.txt @@ -5,7 +5,7 @@ the GPMC controller with a name of "nand". All timing relevant properties as well as generic gpmc child properties are explained in a separate documents - please refer to -Documentation/devicetree/bindings/bus/ti-gpmc.txt +Documentation/devicetree/bindings/memory-controllers/omap-gpmc.txt For NAND specific properties such as ECC modes or bus width, please refer to Documentation/devicetree/bindings/mtd/nand.txt diff --git a/Documentation/devicetree/bindings/mtd/gpmc-nor.txt b/Documentation/devicetree/bindings/mtd/gpmc-nor.txt index 4828c17bb784..131d3a74d0bd 100644 --- a/Documentation/devicetree/bindings/mtd/gpmc-nor.txt +++ b/Documentation/devicetree/bindings/mtd/gpmc-nor.txt @@ -5,7 +5,7 @@ child nodes of the GPMC controller with a name of "nor". All timing relevant properties as well as generic GPMC child properties are explained in a separate documents. Please refer to -Documentation/devicetree/bindings/bus/ti-gpmc.txt +Documentation/devicetree/bindings/memory-controllers/omap-gpmc.txt Required properties: - bank-width: Width of NOR flash in bytes. GPMC supports 8-bit and @@ -28,7 +28,7 @@ Required properties: Optional properties: - gpmc,XXX Additional GPMC timings and settings parameters. See - Documentation/devicetree/bindings/bus/ti-gpmc.txt + Documentation/devicetree/bindings/memory-controllers/omap-gpmc.txt Optional properties for partition table parsing: - #address-cells: should be set to 1 diff --git a/Documentation/devicetree/bindings/mtd/gpmc-onenand.txt b/Documentation/devicetree/bindings/mtd/gpmc-onenand.txt index 5d8fa527c496..b6e8bfd024f4 100644 --- a/Documentation/devicetree/bindings/mtd/gpmc-onenand.txt +++ b/Documentation/devicetree/bindings/mtd/gpmc-onenand.txt @@ -5,7 +5,7 @@ the GPMC controller with a name of "onenand". All timing relevant properties as well as generic gpmc child properties are explained in a separate documents - please refer to -Documentation/devicetree/bindings/bus/ti-gpmc.txt +Documentation/devicetree/bindings/memory-controllers/omap-gpmc.txt Required properties: diff --git a/Documentation/devicetree/bindings/mtd/gpmi-nand.txt b/Documentation/devicetree/bindings/mtd/gpmi-nand.txt index d02acaff3c35..b289ef3c1b7e 100644 --- a/Documentation/devicetree/bindings/mtd/gpmi-nand.txt +++ b/Documentation/devicetree/bindings/mtd/gpmi-nand.txt @@ -4,7 +4,12 @@ The GPMI nand controller provides an interface to control the NAND flash chips. Required properties: - - compatible : should be "fsl,<chip>-gpmi-nand" + - compatible : should be "fsl,<chip>-gpmi-nand", chip can be: + * imx23 + * imx28 + * imx6q + * imx6sx + * imx7d - reg : should contain registers location and length for gpmi and bch. - reg-names: Should contain the reg names "gpmi-nand" and "bch" - interrupts : BCH interrupt number. @@ -13,6 +18,13 @@ Required properties: and GPMI DMA channel ID. Refer to dma.txt and fsl-mxs-dma.txt for details. - dma-names: Must be "rx-tx". + - clocks : clocks phandle and clock specifier corresponding to each clock + specified in clock-names. + - clock-names : The "gpmi_io" clock is always required. Which clocks are + exactly required depends on chip: + * imx23/imx28 : "gpmi_io" + * imx6q/sx : "gpmi_io", "gpmi_apb", "gpmi_bch", "gpmi_bch_apb", "per1_bch" + * imx7d : "gpmi_io", "gpmi_bch_apb" Optional properties: - nand-on-flash-bbt: boolean to enable on flash bbt option if not diff --git a/Documentation/devicetree/bindings/mtd/microchip,mchp23k256.txt b/Documentation/devicetree/bindings/mtd/microchip,mchp23k256.txt new file mode 100644 index 000000000000..7328eb92a03c --- /dev/null +++ b/Documentation/devicetree/bindings/mtd/microchip,mchp23k256.txt @@ -0,0 +1,18 @@ +* MTD SPI driver for Microchip 23K256 (and similar) serial SRAM + +Required properties: +- #address-cells, #size-cells : Must be present if the device has sub-nodes + representing partitions. +- compatible : Must be one of "microchip,mchp23k256" or "microchip,mchp23lcv1024" +- reg : Chip-Select number +- spi-max-frequency : Maximum frequency of the SPI bus the chip can operate at + +Example: + + spi-sram@0 { + #address-cells = <1>; + #size-cells = <1>; + compatible = "microchip,mchp23k256"; + reg = <0>; + spi-max-frequency = <20000000>; + }; diff --git a/Documentation/devicetree/bindings/mtd/mtk-nand.txt b/Documentation/devicetree/bindings/mtd/mtk-nand.txt index 069c192ed5c2..dbf9e054c11c 100644 --- a/Documentation/devicetree/bindings/mtd/mtk-nand.txt +++ b/Documentation/devicetree/bindings/mtd/mtk-nand.txt @@ -12,7 +12,8 @@ tree nodes. The first part of NFC is NAND Controller Interface (NFI) HW. Required NFI properties: -- compatible: Should be "mediatek,mtxxxx-nfc". +- compatible: Should be one of "mediatek,mt2701-nfc", + "mediatek,mt2712-nfc". - reg: Base physical address and size of NFI. - interrupts: Interrupts of NFI. - clocks: NFI required clocks. @@ -141,7 +142,7 @@ Example: ============== Required BCH properties: -- compatible: Should be "mediatek,mtxxxx-ecc". +- compatible: Should be one of "mediatek,mt2701-ecc", "mediatek,mt2712-ecc". - reg: Base physical address and size of ECC. - interrupts: Interrupts of ECC. - clocks: ECC required clocks. diff --git a/Documentation/devicetree/bindings/mtd/nand.txt b/Documentation/devicetree/bindings/mtd/nand.txt index b05601600083..133f3813719c 100644 --- a/Documentation/devicetree/bindings/mtd/nand.txt +++ b/Documentation/devicetree/bindings/mtd/nand.txt @@ -21,7 +21,7 @@ Optional NAND chip properties: - nand-ecc-mode : String, operation mode of the NAND ecc mode. Supported values are: "none", "soft", "hw", "hw_syndrome", - "hw_oob_first". + "hw_oob_first", "on-die". Deprecated values: "soft_bch": use "soft" and nand-ecc-algo instead - nand-ecc-algo: string, algorithm of NAND ECC. diff --git a/Documentation/devicetree/bindings/mtd/partition.txt b/Documentation/devicetree/bindings/mtd/partition.txt index 81a224da63be..36f3b769a626 100644 --- a/Documentation/devicetree/bindings/mtd/partition.txt +++ b/Documentation/devicetree/bindings/mtd/partition.txt @@ -1,29 +1,49 @@ -Representing flash partitions in devicetree +Flash partitions in device tree +=============================== -Partitions can be represented by sub-nodes of an mtd device. This can be used +Flash devices can be partitioned into one or more functional ranges (e.g. "boot +code", "nvram", "kernel"). + +Different devices may be partitioned in a different ways. Some may use a fixed +flash layout set at production time. Some may use on-flash table that describes +the geometry and naming/purpose of each functional region. It is also possible +to see these methods mixed. + +To assist system software in locating partitions, we allow describing which +method is used for a given flash device. To describe the method there should be +a subnode of the flash device that is named 'partitions'. It must have a +'compatible' property, which is used to identify the method to use. + +We currently only document a binding for fixed layouts. + + +Fixed Partitions +================ + +Partitions can be represented by sub-nodes of a flash device. This can be used on platforms which have strong conventions about which portions of a flash are used for what purposes, but which don't use an on-flash partition table such as RedBoot. -The partition table should be a subnode of the mtd node and should be named +The partition table should be a subnode of the flash node and should be named 'partitions'. This node should have the following property: - compatible : (required) must be "fixed-partitions" Partitions are then defined in subnodes of the partitions node. -For backwards compatibility partitions as direct subnodes of the mtd device are +For backwards compatibility partitions as direct subnodes of the flash device are supported. This use is discouraged. NOTE: also for backwards compatibility, direct subnodes that have a compatible string are not considered partitions, as they may be used for other bindings. #address-cells & #size-cells must both be present in the partitions subnode of the -mtd device. There are two valid values for both: +flash device. There are two valid values for both: <1>: for partitions that require a single 32-bit cell to represent their size/address (aka the value is below 4 GiB) <2>: for partitions that require two 32-bit cells to represent their size/address (aka the value is 4 GiB or greater). Required properties: -- reg : The partition's offset and size within the mtd bank. +- reg : The partition's offset and size within the flash Optional properties: - label : The label / name for this partition. If omitted, the label is taken diff --git a/Documentation/devicetree/bindings/mux/adi,adg792a.txt b/Documentation/devicetree/bindings/mux/adi,adg792a.txt new file mode 100644 index 000000000000..96b787a69f50 --- /dev/null +++ b/Documentation/devicetree/bindings/mux/adi,adg792a.txt @@ -0,0 +1,75 @@ +Bindings for Analog Devices ADG792A/G Triple 4:1 Multiplexers + +Required properties: +- compatible : "adi,adg792a" or "adi,adg792g" +- #mux-control-cells : <0> if parallel (the three muxes are bound together + with a single mux controller controlling all three muxes), or <1> if + not (one mux controller for each mux). +* Standard mux-controller bindings as described in mux-controller.txt + +Optional properties for ADG792G: +- gpio-controller : if present, #gpio-cells below is required. +- #gpio-cells : should be <2> + - First cell is the GPO line number, i.e. 0 or 1 + - Second cell is used to specify active high (0) + or active low (1) + +Optional properties: +- idle-state : if present, array of states that the mux controllers will have + when idle. The special state MUX_IDLE_AS_IS is the default and + MUX_IDLE_DISCONNECT is also supported. + +States 0 through 3 correspond to signals A through D in the datasheet. + +Example: + + /* + * Three independent mux controllers (of which one is used). + * Mux 0 is disconnected when idle, mux 1 idles in the previously + * selected state and mux 2 idles with signal B. + */ + &i2c0 { + mux: mux-controller@50 { + compatible = "adi,adg792a"; + reg = <0x50>; + #mux-control-cells = <1>; + + idle-state = <MUX_IDLE_DISCONNECT MUX_IDLE_AS_IS 1>; + }; + }; + + adc-mux { + compatible = "io-channel-mux"; + io-channels = <&adc 0>; + io-channel-names = "parent"; + + mux-controls = <&mux 2>; + + channels = "sync-1", "", "out"; + }; + + + /* + * Three parallel muxes with one mux controller, useful e.g. if + * the adc is differential, thus needing two signals to be muxed + * simultaneously for correct operation. + */ + &i2c0 { + pmux: mux-controller@50 { + compatible = "adi,adg792a"; + reg = <0x50>; + #mux-control-cells = <0>; + + idle-state = <1>; + }; + }; + + diff-adc-mux { + compatible = "io-channel-mux"; + io-channels = <&adc 0>; + io-channel-names = "parent"; + + mux-controls = <&pmux>; + + channels = "sync-1", "", "out"; + }; diff --git a/Documentation/devicetree/bindings/mux/gpio-mux.txt b/Documentation/devicetree/bindings/mux/gpio-mux.txt new file mode 100644 index 000000000000..b8f746344d80 --- /dev/null +++ b/Documentation/devicetree/bindings/mux/gpio-mux.txt @@ -0,0 +1,69 @@ +GPIO-based multiplexer controller bindings + +Define what GPIO pins are used to control a multiplexer. Or several +multiplexers, if the same pins control more than one multiplexer. + +Required properties: +- compatible : "gpio-mux" +- mux-gpios : list of gpios used to control the multiplexer, least + significant bit first. +- #mux-control-cells : <0> +* Standard mux-controller bindings as decribed in mux-controller.txt + +Optional properties: +- idle-state : if present, the state the mux will have when idle. The + special state MUX_IDLE_AS_IS is the default. + +The multiplexer state is defined as the number represented by the +multiplexer GPIO pins, where the first pin is the least significant +bit. An active pin is a binary 1, an inactive pin is a binary 0. + +Example: + + mux: mux-controller { + compatible = "gpio-mux"; + #mux-control-cells = <0>; + + mux-gpios = <&pioA 0 GPIO_ACTIVE_HIGH>, + <&pioA 1 GPIO_ACTIVE_HIGH>; + }; + + adc-mux { + compatible = "io-channel-mux"; + io-channels = <&adc 0>; + io-channel-names = "parent"; + + mux-controls = <&mux>; + + channels = "sync-1", "in", "out", "sync-2"; + }; + + i2c-mux { + compatible = "i2c-mux"; + i2c-parent = <&i2c1>; + + mux-controls = <&mux>; + + #address-cells = <1>; + #size-cells = <0>; + + i2c@0 { + reg = <0>; + #address-cells = <1>; + #size-cells = <0>; + + ssd1307: oled@3c { + /* ... */ + }; + }; + + i2c@3 { + reg = <3>; + #address-cells = <1>; + #size-cells = <0>; + + pca9555: pca9555@20 { + /* ... */ + }; + }; + }; diff --git a/Documentation/devicetree/bindings/mux/mmio-mux.txt b/Documentation/devicetree/bindings/mux/mmio-mux.txt new file mode 100644 index 000000000000..a9bfb4d8b6ac --- /dev/null +++ b/Documentation/devicetree/bindings/mux/mmio-mux.txt @@ -0,0 +1,60 @@ +MMIO register bitfield-based multiplexer controller bindings + +Define register bitfields to be used to control multiplexers. The parent +device tree node must be a syscon node to provide register access. + +Required properties: +- compatible : "mmio-mux" +- #mux-control-cells : <1> +- mux-reg-masks : an array of register offset and pre-shifted bitfield mask + pairs, each describing a single mux control. +* Standard mux-controller bindings as decribed in mux-controller.txt + +Optional properties: +- idle-states : if present, the state the muxes will have when idle. The + special state MUX_IDLE_AS_IS is the default. + +The multiplexer state of each multiplexer is defined as the value of the +bitfield described by the corresponding register offset and bitfield mask pair +in the mux-reg-masks array, accessed through the parent syscon. + +Example: + + syscon { + compatible = "syscon"; + + mux: mux-controller { + compatible = "mmio-mux"; + #mux-control-cells = <1>; + + mux-reg-masks = <0x3 0x30>, /* 0: reg 0x3, bits 5:4 */ + <0x3 0x40>, /* 1: reg 0x3, bit 6 */ + idle-states = <MUX_IDLE_AS_IS>, <0>; + }; + }; + + video-mux { + compatible = "video-mux"; + mux-controls = <&mux 0>; + + ports { + /* inputs 0..3 */ + port@0 { + reg = <0>; + }; + port@1 { + reg = <1>; + }; + port@2 { + reg = <2>; + }; + port@3 { + reg = <3>; + }; + + /* output */ + port@4 { + reg = <4>; + }; + }; + }; diff --git a/Documentation/devicetree/bindings/mux/mux-controller.txt b/Documentation/devicetree/bindings/mux/mux-controller.txt new file mode 100644 index 000000000000..4f47e4bd2fa0 --- /dev/null +++ b/Documentation/devicetree/bindings/mux/mux-controller.txt @@ -0,0 +1,157 @@ +Common multiplexer controller bindings +====================================== + +A multiplexer (or mux) controller will have one, or several, consumer devices +that uses the mux controller. Thus, a mux controller can possibly control +several parallel multiplexers. Presumably there will be at least one +multiplexer needed by each consumer, but a single mux controller can of course +control several multiplexers for a single consumer. + +A mux controller provides a number of states to its consumers, and the state +space is a simple zero-based enumeration. I.e. 0-1 for a 2-way multiplexer, +0-7 for an 8-way multiplexer, etc. + + +Consumers +--------- + +Mux controller consumers should specify a list of mux controllers that they +want to use with a property containing a 'mux-ctrl-list': + + mux-ctrl-list ::= <single-mux-ctrl> [mux-ctrl-list] + single-mux-ctrl ::= <mux-ctrl-phandle> [mux-ctrl-specifier] + mux-ctrl-phandle : phandle to mux controller node + mux-ctrl-specifier : array of #mux-control-cells specifying the + given mux controller (controller specific) + +Mux controller properties should be named "mux-controls". The exact meaning of +each mux controller property must be documented in the device tree binding for +each consumer. An optional property "mux-control-names" may contain a list of +strings to label each of the mux controllers listed in the "mux-controls" +property. + +Drivers for devices that use more than a single mux controller can use the +"mux-control-names" property to map the name of the requested mux controller +to an index into the list given by the "mux-controls" property. + +mux-ctrl-specifier typically encodes the chip-relative mux controller number. +If the mux controller chip only provides a single mux controller, the +mux-ctrl-specifier can typically be left out. + +Example: + + /* One consumer of a 2-way mux controller (one GPIO-line) */ + mux: mux-controller { + compatible = "gpio-mux"; + #mux-control-cells = <0>; + + mux-gpios = <&pioA 0 GPIO_ACTIVE_HIGH>; + }; + + adc-mux { + compatible = "io-channel-mux"; + io-channels = <&adc 0>; + io-channel-names = "parent"; + + mux-controls = <&mux>; + mux-control-names = "adc"; + + channels = "sync", "in"; + }; + +Note that in the example above, specifying the "mux-control-names" is redundant +because there is only one mux controller in the list. However, if the driver +for the consumer node in fact asks for a named mux controller, that name is of +course still required. + + /* + * Two consumers (one for an ADC line and one for an i2c bus) of + * parallel 4-way multiplexers controlled by the same two GPIO-lines. + */ + mux: mux-controller { + compatible = "gpio-mux"; + #mux-control-cells = <0>; + + mux-gpios = <&pioA 0 GPIO_ACTIVE_HIGH>, + <&pioA 1 GPIO_ACTIVE_HIGH>; + }; + + adc-mux { + compatible = "io-channel-mux"; + io-channels = <&adc 0>; + io-channel-names = "parent"; + + mux-controls = <&mux>; + + channels = "sync-1", "in", "out", "sync-2"; + }; + + i2c-mux { + compatible = "i2c-mux"; + i2c-parent = <&i2c1>; + + mux-controls = <&mux>; + + #address-cells = <1>; + #size-cells = <0>; + + i2c@0 { + reg = <0>; + #address-cells = <1>; + #size-cells = <0>; + + ssd1307: oled@3c { + /* ... */ + }; + }; + + i2c@3 { + reg = <3>; + #address-cells = <1>; + #size-cells = <0>; + + pca9555: pca9555@20 { + /* ... */ + }; + }; + }; + + +Mux controller nodes +-------------------- + +Mux controller nodes must specify the number of cells used for the +specifier using the '#mux-control-cells' property. + +Optionally, mux controller nodes can also specify the state the mux should +have when it is idle. The idle-state property is used for this. If the +idle-state is not present, the mux controller is typically left as is when +it is idle. For multiplexer chips that expose several mux controllers, the +idle-state property is an array with one idle state for each mux controller. + +The special value (-1) may be used to indicate that the mux should be left +as is when it is idle. This is the default, but can still be useful for +mux controller chips with more than one mux controller, particularly when +there is a need to "step past" a mux controller and set some other idle +state for a mux controller with a higher index. + +Some mux controllers have the ability to disconnect the input/output of the +multiplexer. Using this disconnected high-impedance state as the idle state +is indicated with idle state (-2). + +These constants are available in + + #include <dt-bindings/mux/mux.h> + +as MUX_IDLE_AS_IS (-1) and MUX_IDLE_DISCONNECT (-2). + +An example mux controller node look like this (the adg972a chip is a triple +4-way multiplexer): + + mux: mux-controller@50 { + compatible = "adi,adg792a"; + reg = <0x50>; + #mux-control-cells = <1>; + + idle-state = <MUX_IDLE_DISCONNECT MUX_IDLE_AS_IS 2>; + }; diff --git a/Documentation/devicetree/bindings/net/cortina.txt b/Documentation/devicetree/bindings/net/cortina.txt new file mode 100644 index 000000000000..40d0bd984113 --- /dev/null +++ b/Documentation/devicetree/bindings/net/cortina.txt @@ -0,0 +1,21 @@ +Cortina Phy Driver Device Tree Bindings +--------------------------------------- + +CORTINA is a registered trademark of Cortina Systems, Inc. + +The driver supports the Cortina Electronic Dispersion Compensation (EDC) +devices, equipped with clock and data recovery (CDR) circuits. These +devices make use of registers that are not compatible with Clause 45 or +Clause 22, therefore they need to be described using the +"ethernet-phy-id" compatible. + +Since the driver only implements polling mode support, interrupts info +can be skipped. + +Example (CS4340 phy): + mdio { + cs4340_phy@10 { + compatible = "ethernet-phy-id13e5.1002"; + reg = <0x10>; + }; + }; diff --git a/Documentation/devicetree/bindings/net/dsa/b53.txt b/Documentation/devicetree/bindings/net/dsa/b53.txt index d6c6e41648d4..8acf51a4dfa8 100644 --- a/Documentation/devicetree/bindings/net/dsa/b53.txt +++ b/Documentation/devicetree/bindings/net/dsa/b53.txt @@ -13,6 +13,9 @@ Required properties: "brcm,bcm5397" "brcm,bcm5398" + For the BCM11360 SoC, must be: + "brcm,bcm11360-srab" and the mandatory "brcm,cygnus-srab" string + For the BCM5310x SoCs with an integrated switch, must be one of: "brcm,bcm53010-srab" "brcm,bcm53011-srab" @@ -34,7 +37,7 @@ Required properties: "brcm,bcm6328-switch" "brcm,bcm6368-switch" and the mandatory "brcm,bcm63xx-switch" -See Documentation/devicetree/bindings/dsa/dsa.txt for a list of additional +See Documentation/devicetree/bindings/net/dsa/dsa.txt for a list of additional required and optional properties. Examples: diff --git a/Documentation/devicetree/bindings/net/dsa/ksz.txt b/Documentation/devicetree/bindings/net/dsa/ksz.txt new file mode 100644 index 000000000000..0ab8b39d0b30 --- /dev/null +++ b/Documentation/devicetree/bindings/net/dsa/ksz.txt @@ -0,0 +1,72 @@ +Microchip KSZ Series Ethernet switches +================================== + +Required properties: + +- compatible: For external switch chips, compatible string must be exactly one + of: "microchip,ksz9477" + +See Documentation/devicetree/bindings/dsa/dsa.txt for a list of additional +required and optional properties. + +Examples: + +Ethernet switch connected via SPI to the host, CPU port wired to eth0: + + eth0: ethernet@10001000 { + fixed-link { + speed = <1000>; + full-duplex; + }; + }; + + spi1: spi@f8008000 { + pinctrl-0 = <&pinctrl_spi_ksz>; + cs-gpios = <&pioC 25 0>; + id = <1>; + status = "okay"; + + ksz9477: ksz9477@0 { + compatible = "microchip,ksz9477"; + reg = <0>; + + spi-max-frequency = <44000000>; + spi-cpha; + spi-cpol; + + status = "okay"; + ports { + #address-cells = <1>; + #size-cells = <0>; + port@0 { + reg = <0>; + label = "lan1"; + }; + port@1 { + reg = <1>; + label = "lan2"; + }; + port@2 { + reg = <2>; + label = "lan3"; + }; + port@3 { + reg = <3>; + label = "lan4"; + }; + port@4 { + reg = <4>; + label = "lan5"; + }; + port@5 { + reg = <5>; + label = "cpu"; + ethernet = <ð0>; + fixed-link { + speed = <1000>; + full-duplex; + }; + }; + }; + }; + }; diff --git a/Documentation/devicetree/bindings/net/dwmac-sun8i.txt b/Documentation/devicetree/bindings/net/dwmac-sun8i.txt new file mode 100644 index 000000000000..725f3b187886 --- /dev/null +++ b/Documentation/devicetree/bindings/net/dwmac-sun8i.txt @@ -0,0 +1,84 @@ +* Allwinner sun8i GMAC ethernet controller + +This device is a platform glue layer for stmmac. +Please see stmmac.txt for the other unchanged properties. + +Required properties: +- compatible: should be one of the following string: + "allwinner,sun8i-a83t-emac" + "allwinner,sun8i-h3-emac" + "allwinner,sun8i-v3s-emac" + "allwinner,sun50i-a64-emac" +- reg: address and length of the register for the device. +- interrupts: interrupt for the device +- interrupt-names: should be "macirq" +- clocks: A phandle to the reference clock for this device +- clock-names: should be "stmmaceth" +- resets: A phandle to the reset control for this device +- reset-names: should be "stmmaceth" +- phy-mode: See ethernet.txt +- phy-handle: See ethernet.txt +- #address-cells: shall be 1 +- #size-cells: shall be 0 +- syscon: A phandle to the syscon of the SoC with one of the following + compatible string: + - allwinner,sun8i-h3-system-controller + - allwinner,sun8i-v3s-system-controller + - allwinner,sun50i-a64-system-controller + - allwinner,sun8i-a83t-system-controller + +Optional properties: +- allwinner,tx-delay-ps: TX clock delay chain value in ps. Range value is 0-700. Default is 0) +- allwinner,rx-delay-ps: RX clock delay chain value in ps. Range value is 0-3100. Default is 0) +Both delay properties need to be a multiple of 100. They control the delay for +external PHY. + +Optional properties for the following compatibles: + - "allwinner,sun8i-h3-emac", + - "allwinner,sun8i-v3s-emac": +- allwinner,leds-active-low: EPHY LEDs are active low + +Required child node of emac: +- mdio bus node: should be named mdio + +Required properties of the mdio node: +- #address-cells: shall be 1 +- #size-cells: shall be 0 + +The device node referenced by "phy" or "phy-handle" should be a child node +of the mdio node. See phy.txt for the generic PHY bindings. + +Required properties of the phy node with the following compatibles: + - "allwinner,sun8i-h3-emac", + - "allwinner,sun8i-v3s-emac": +- clocks: a phandle to the reference clock for the EPHY +- resets: a phandle to the reset control for the EPHY + +Example: + +emac: ethernet@1c0b000 { + compatible = "allwinner,sun8i-h3-emac"; + syscon = <&syscon>; + reg = <0x01c0b000 0x104>; + interrupts = <GIC_SPI 82 IRQ_TYPE_LEVEL_HIGH>; + interrupt-names = "macirq"; + resets = <&ccu RST_BUS_EMAC>; + reset-names = "stmmaceth"; + clocks = <&ccu CLK_BUS_EMAC>; + clock-names = "stmmaceth"; + #address-cells = <1>; + #size-cells = <0>; + + phy-handle = <&int_mii_phy>; + phy-mode = "mii"; + allwinner,leds-active-low; + mdio: mdio { + #address-cells = <1>; + #size-cells = <0>; + int_mii_phy: ethernet-phy@1 { + reg = <1>; + clocks = <&ccu CLK_BUS_EPHY>; + resets = <&ccu RST_BUS_EPHY>; + }; + }; +}; diff --git a/Documentation/devicetree/bindings/net/ethernet.txt b/Documentation/devicetree/bindings/net/ethernet.txt index 3a6916909d90..7da86f22a13b 100644 --- a/Documentation/devicetree/bindings/net/ethernet.txt +++ b/Documentation/devicetree/bindings/net/ethernet.txt @@ -8,9 +8,11 @@ The following properties are common to the Ethernet controllers: property; - max-speed: number, specifies maximum speed in Mbit/s supported by the device; - max-frame-size: number, maximum transfer unit (IEEE defined MTU), rather than - the maximum frame size (there's contradiction in ePAPR). + the maximum frame size (there's contradiction in the Devicetree + Specification). - phy-mode: string, operation mode of the PHY interface. This is now a de-facto standard property; supported values are: + * "internal" * "mii" * "gmii" * "sgmii" @@ -32,9 +34,13 @@ The following properties are common to the Ethernet controllers: * "2000base-x", * "2500base-x", * "rxaui" -- phy-connection-type: the same as "phy-mode" property but described in ePAPR; + * "xaui" + * "10gbase-kr" (10GBASE-KR, XFI, SFI) +- phy-connection-type: the same as "phy-mode" property but described in the + Devicetree Specification; - phy-handle: phandle, specifies a reference to a node representing a PHY - device; this property is described in ePAPR and so preferred; + device; this property is described in the Devicetree Specification and so + preferred; - phy: the same as "phy-handle" property, not recommended for new bindings. - phy-device: the same as "phy-handle" property, not recommended for new bindings. diff --git a/Documentation/devicetree/bindings/powerpc/fsl/fman.txt b/Documentation/devicetree/bindings/net/fsl-fman.txt index df873d1f3b7c..df873d1f3b7c 100644 --- a/Documentation/devicetree/bindings/powerpc/fsl/fman.txt +++ b/Documentation/devicetree/bindings/net/fsl-fman.txt diff --git a/Documentation/devicetree/bindings/net/gpmc-eth.txt b/Documentation/devicetree/bindings/net/gpmc-eth.txt index ace4a64b3695..f7da3d73ca1b 100644 --- a/Documentation/devicetree/bindings/net/gpmc-eth.txt +++ b/Documentation/devicetree/bindings/net/gpmc-eth.txt @@ -9,7 +9,7 @@ the GPMC controller with an "ethernet" name. All timing relevant properties as well as generic GPMC child properties are explained in a separate documents. Please refer to -Documentation/devicetree/bindings/bus/ti-gpmc.txt +Documentation/devicetree/bindings/memory-controllers/omap-gpmc.txt For the properties relevant to the ethernet controller connected to the GPMC refer to the binding documentation of the device. For example, the documentation @@ -43,7 +43,7 @@ Required properties: Optional properties: - gpmc,XXX Additional GPMC timings and settings parameters. See - Documentation/devicetree/bindings/bus/ti-gpmc.txt + Documentation/devicetree/bindings/memory-controllers/omap-gpmc.txt Example: diff --git a/Documentation/devicetree/bindings/net/macb.txt b/Documentation/devicetree/bindings/net/macb.txt index 1506e948610c..27966ae741e0 100644 --- a/Documentation/devicetree/bindings/net/macb.txt +++ b/Documentation/devicetree/bindings/net/macb.txt @@ -22,6 +22,7 @@ Required properties: Required elements: 'pclk', 'hclk' Optional elements: 'tx_clk' Optional elements: 'rx_clk' applies to cdns,zynqmp-gem + Optional elements: 'tsu_clk' - clocks: Phandles to input clocks. Optional properties for PHY child node: diff --git a/Documentation/devicetree/bindings/net/marvell-orion-mdio.txt b/Documentation/devicetree/bindings/net/marvell-orion-mdio.txt index ccdabdcc8618..42cd81090a2c 100644 --- a/Documentation/devicetree/bindings/net/marvell-orion-mdio.txt +++ b/Documentation/devicetree/bindings/net/marvell-orion-mdio.txt @@ -1,12 +1,14 @@ * Marvell MDIO Ethernet Controller interface The Ethernet controllers of the Marvel Kirkwood, Dove, Orion5x, -MV78xx0, Armada 370 and Armada XP have an identical unit that provides -an interface with the MDIO bus. This driver handles this MDIO -interface. +MV78xx0, Armada 370, Armada XP, Armada 7k and Armada 8k have an +identical unit that provides an interface with the MDIO bus. +Additionally, Armada 7k and Armada 8k has a second unit which +provides an interface with the xMDIO bus. This driver handles +these interfaces. Required properties: -- compatible: "marvell,orion-mdio" +- compatible: "marvell,orion-mdio" or "marvell,xmdio" - reg: address and length of the MDIO registers. When an interrupt is not present, the length is the size of the SMI register (4 bytes) otherwise it must be 0x84 bytes to cover the interrupt control diff --git a/Documentation/devicetree/bindings/net/nfc/trf7970a.txt b/Documentation/devicetree/bindings/net/nfc/trf7970a.txt index c627bbb3009e..60c833d62181 100644 --- a/Documentation/devicetree/bindings/net/nfc/trf7970a.txt +++ b/Documentation/devicetree/bindings/net/nfc/trf7970a.txt @@ -13,14 +13,10 @@ Optional SoC Specific Properties: - pinctrl-names: Contains only one value - "default". - pintctrl-0: Specifies the pin control groups used for this controller. - autosuspend-delay: Specify autosuspend delay in milliseconds. -- vin-voltage-override: Specify voltage of VIN pin in microvolts. - irq-status-read-quirk: Specify that the trf7970a being used has the "IRQ Status Read" erratum. - en2-rf-quirk: Specify that the trf7970a being used has the "EN2 RF" erratum. -- t5t-rmb-extra-byte-quirk: Specify that the trf7970a has the erratum - where an extra byte is returned by Read Multiple Block commands issued - to Type 5 tags. - vdd-io-supply: Regulator specifying voltage for vdd-io - clock-frequency: Set to specify that the input frequency to the trf7970a is 13560000Hz or 27120000Hz @@ -37,15 +33,13 @@ Example (for ARM-based BeagleBone with TRF7970A on SPI1): spi-max-frequency = <2000000>; interrupt-parent = <&gpio2>; interrupts = <14 0>; - ti,enable-gpios = <&gpio2 2 GPIO_ACTIVE_LOW>, - <&gpio2 5 GPIO_ACTIVE_LOW>; + ti,enable-gpios = <&gpio2 2 GPIO_ACTIVE_HIGH>, + <&gpio2 5 GPIO_ACTIVE_HIGH>; vin-supply = <&ldo3_reg>; - vin-voltage-override = <5000000>; vdd-io-supply = <&ldo2_reg>; autosuspend-delay = <30000>; irq-status-read-quirk; en2-rf-quirk; - t5t-rmb-extra-byte-quirk; clock-frequency = <27120000>; status = "okay"; }; diff --git a/Documentation/devicetree/bindings/net/qca,qca7000.txt b/Documentation/devicetree/bindings/net/qca,qca7000.txt new file mode 100644 index 000000000000..6d9efb2eb9a5 --- /dev/null +++ b/Documentation/devicetree/bindings/net/qca,qca7000.txt @@ -0,0 +1,88 @@ +* Qualcomm QCA7000 + +The QCA7000 is a serial-to-powerline bridge with a host interface which could +be configured either as SPI or UART slave. This configuration is done by +the QCA7000 firmware. + +(a) Ethernet over SPI + +In order to use the QCA7000 as SPI device it must be defined as a child of a +SPI master in the device tree. + +Required properties: +- compatible : Should be "qca,qca7000" +- reg : Should specify the SPI chip select +- interrupts : The first cell should specify the index of the source + interrupt and the second cell should specify the trigger + type as rising edge +- spi-cpha : Must be set +- spi-cpol : Must be set + +Optional properties: +- interrupt-parent : Specify the pHandle of the source interrupt +- spi-max-frequency : Maximum frequency of the SPI bus the chip can operate at. + Numbers smaller than 1000000 or greater than 16000000 + are invalid. Missing the property will set the SPI + frequency to 8000000 Hertz. +- local-mac-address : see ./ethernet.txt +- qca,legacy-mode : Set the SPI data transfer of the QCA7000 to legacy mode. + In this mode the SPI master must toggle the chip select + between each data word. In burst mode these gaps aren't + necessary, which is faster. This setting depends on how + the QCA7000 is setup via GPIO pin strapping. If the + property is missing the driver defaults to burst mode. + +SPI Example: + +/* Freescale i.MX28 SPI master*/ +ssp2: spi@80014000 { + #address-cells = <1>; + #size-cells = <0>; + compatible = "fsl,imx28-spi"; + pinctrl-names = "default"; + pinctrl-0 = <&spi2_pins_a>; + status = "okay"; + + qca7000: ethernet@0 { + compatible = "qca,qca7000"; + reg = <0x0>; + interrupt-parent = <&gpio3>; /* GPIO Bank 3 */ + interrupts = <25 0x1>; /* Index: 25, rising edge */ + spi-cpha; /* SPI mode: CPHA=1 */ + spi-cpol; /* SPI mode: CPOL=1 */ + spi-max-frequency = <8000000>; /* freq: 8 MHz */ + local-mac-address = [ A0 B0 C0 D0 E0 F0 ]; + }; +}; + +(b) Ethernet over UART + +In order to use the QCA7000 as UART slave it must be defined as a child of a +UART master in the device tree. It is possible to preconfigure the UART +settings of the QCA7000 firmware, but it's not possible to change them during +runtime. + +Required properties: +- compatible : Should be "qca,qca7000" + +Optional properties: +- local-mac-address : see ./ethernet.txt +- current-speed : current baud rate of QCA7000 which defaults to 115200 + if absent, see also ../serial/slave-device.txt + +UART Example: + +/* Freescale i.MX28 UART */ +auart0: serial@8006a000 { + compatible = "fsl,imx28-auart", "fsl,imx23-auart"; + reg = <0x8006a000 0x2000>; + pinctrl-names = "default"; + pinctrl-0 = <&auart0_2pins_a>; + status = "okay"; + + qca7000: ethernet { + compatible = "qca,qca7000"; + local-mac-address = [ A0 B0 C0 D0 E0 F0 ]; + current-speed = <38400>; + }; +}; diff --git a/Documentation/devicetree/bindings/net/qca-qca7000-spi.txt b/Documentation/devicetree/bindings/net/qca-qca7000-spi.txt deleted file mode 100644 index c74989c0d8ac..000000000000 --- a/Documentation/devicetree/bindings/net/qca-qca7000-spi.txt +++ /dev/null @@ -1,47 +0,0 @@ -* Qualcomm QCA7000 (Ethernet over SPI protocol) - -Note: The QCA7000 is useable as a SPI device. In this case it must be defined -as a child of a SPI master in the device tree. - -Required properties: -- compatible : Should be "qca,qca7000" -- reg : Should specify the SPI chip select -- interrupts : The first cell should specify the index of the source interrupt - and the second cell should specify the trigger type as rising edge -- spi-cpha : Must be set -- spi-cpol: Must be set - -Optional properties: -- interrupt-parent : Specify the pHandle of the source interrupt -- spi-max-frequency : Maximum frequency of the SPI bus the chip can operate at. - Numbers smaller than 1000000 or greater than 16000000 are invalid. Missing - the property will set the SPI frequency to 8000000 Hertz. -- local-mac-address: 6 bytes, MAC address -- qca,legacy-mode : Set the SPI data transfer of the QCA7000 to legacy mode. - In this mode the SPI master must toggle the chip select between each data - word. In burst mode these gaps aren't necessary, which is faster. - This setting depends on how the QCA7000 is setup via GPIO pin strapping. - If the property is missing the driver defaults to burst mode. - -Example: - -/* Freescale i.MX28 SPI master*/ -ssp2: spi@80014000 { - #address-cells = <1>; - #size-cells = <0>; - compatible = "fsl,imx28-spi"; - pinctrl-names = "default"; - pinctrl-0 = <&spi2_pins_a>; - status = "okay"; - - qca7000: ethernet@0 { - compatible = "qca,qca7000"; - reg = <0x0>; - interrupt-parent = <&gpio3>; /* GPIO Bank 3 */ - interrupts = <25 0x1>; /* Index: 25, rising edge */ - spi-cpha; /* SPI mode: CPHA=1 */ - spi-cpol; /* SPI mode: CPOL=1 */ - spi-max-frequency = <8000000>; /* freq: 8 MHz */ - local-mac-address = [ A0 B0 C0 D0 E0 F0 ]; - }; -}; diff --git a/Documentation/devicetree/bindings/net/smsc911x.txt b/Documentation/devicetree/bindings/net/smsc911x.txt index 16c3a9501f5d..acfafc8e143c 100644 --- a/Documentation/devicetree/bindings/net/smsc911x.txt +++ b/Documentation/devicetree/bindings/net/smsc911x.txt @@ -27,6 +27,7 @@ Optional properties: of the device. On many systems this is wired high so the device goes out of reset at power-on, but if it is under program control, this optional GPIO can wake up in response to it. +- vdd33a-supply, vddvario-supply : 3.3V analog and IO logic power supplies Examples: diff --git a/Documentation/devicetree/bindings/net/ti,dp83867.txt b/Documentation/devicetree/bindings/net/ti,dp83867.txt index afe9630a5e7d..02c4353b5cf2 100644 --- a/Documentation/devicetree/bindings/net/ti,dp83867.txt +++ b/Documentation/devicetree/bindings/net/ti,dp83867.txt @@ -18,6 +18,13 @@ Optional property: - ti,max-output-impedance - MAC Interface Impedance control to set the programmable output impedance to maximum value (70 ohms). + - ti,dp83867-rxctrl-strap-quirk - This denotes the fact that the + board has RX_DV/RX_CTRL pin strapped in + mode 1 or 2. To ensure PHY operation, + there are specific actions that + software needs to take when this pin is + strapped in these modes. See data manual + for details. Note: ti,min-output-impedance and ti,max-output-impedance are mutually exclusive. When both properties are present ti,max-output-impedance diff --git a/Documentation/devicetree/bindings/net/ti,wilink-st.txt b/Documentation/devicetree/bindings/net/ti,wilink-st.txt index cbad73a84ac4..1649c1f66b07 100644 --- a/Documentation/devicetree/bindings/net/ti,wilink-st.txt +++ b/Documentation/devicetree/bindings/net/ti,wilink-st.txt @@ -14,6 +14,12 @@ Required properties: - compatible: should be one of the following: "ti,wl1271-st" "ti,wl1273-st" + "ti,wl1281-st" + "ti,wl1283-st" + "ti,wl1285-st" + "ti,wl1801-st" + "ti,wl1805-st" + "ti,wl1807-st" "ti,wl1831-st" "ti,wl1835-st" "ti,wl1837-st" @@ -22,6 +28,10 @@ Optional properties: - enable-gpios : GPIO signal controlling enabling of BT. Active high. - vio-supply : Vio input supply (1.8V) - vbat-supply : Vbat input supply (2.9-4.8V) + - clocks : Must contain an entry, for each entry in clock-names. + See ../clocks/clock-bindings.txt for details. + - clock-names : Must include the following entry: + "ext_clock" (External clock provided to the TI combo chip). Example: @@ -31,5 +41,7 @@ Example: bluetooth { compatible = "ti,wl1835-st"; enable-gpios = <&gpio1 7 GPIO_ACTIVE_HIGH>; + clocks = <&clk32k_wl18xx>; + clock-names = "ext_clock"; }; }; diff --git a/Documentation/devicetree/bindings/net/wireless/brcm,bcm43xx-fmac.txt b/Documentation/devicetree/bindings/net/wireless/brcm,bcm43xx-fmac.txt index 5dbf169cd81c..590f622188de 100644 --- a/Documentation/devicetree/bindings/net/wireless/brcm,bcm43xx-fmac.txt +++ b/Documentation/devicetree/bindings/net/wireless/brcm,bcm43xx-fmac.txt @@ -31,7 +31,7 @@ mmc3: mmc@01c12000 { non-removable; status = "okay"; - brcmf: bcrmf@1 { + brcmf: wifi@1 { reg = <1>; compatible = "brcm,bcm4329-fmac"; interrupt-parent = <&pio>; diff --git a/Documentation/devicetree/bindings/net/wireless/ti,wlcore.txt b/Documentation/devicetree/bindings/net/wireless/ti,wlcore.txt index 2a3d90de18ee..7b2cbb14113e 100644 --- a/Documentation/devicetree/bindings/net/wireless/ti,wlcore.txt +++ b/Documentation/devicetree/bindings/net/wireless/ti,wlcore.txt @@ -10,6 +10,7 @@ Required properties: * "ti,wl1273" * "ti,wl1281" * "ti,wl1283" + * "ti,wl1285" * "ti,wl1801" * "ti,wl1805" * "ti,wl1807" diff --git a/Documentation/devicetree/bindings/nvmem/rockchip-efuse.txt b/Documentation/devicetree/bindings/nvmem/rockchip-efuse.txt index 94aeeeabadd5..194926f77194 100644 --- a/Documentation/devicetree/bindings/nvmem/rockchip-efuse.txt +++ b/Documentation/devicetree/bindings/nvmem/rockchip-efuse.txt @@ -4,6 +4,7 @@ Required properties: - compatible: Should be one of the following. - "rockchip,rk3066a-efuse" - for RK3066a SoCs. - "rockchip,rk3188-efuse" - for RK3188 SoCs. + - "rockchip,rk322x-efuse" - for RK322x SoCs. - "rockchip,rk3288-efuse" - for RK3288 SoCs. - "rockchip,rk3399-efuse" - for RK3399 SoCs. - reg: Should contain the registers location and exact eFuse size diff --git a/Documentation/devicetree/bindings/opp/opp.txt b/Documentation/devicetree/bindings/opp/opp.txt index 63725498bd20..e36d261b9ba6 100644 --- a/Documentation/devicetree/bindings/opp/opp.txt +++ b/Documentation/devicetree/bindings/opp/opp.txt @@ -186,20 +186,20 @@ Example 1: Single cluster Dual-core ARM cortex A9, switch DVFS states together. compatible = "operating-points-v2"; opp-shared; - opp@1000000000 { + opp-1000000000 { opp-hz = /bits/ 64 <1000000000>; opp-microvolt = <975000 970000 985000>; opp-microamp = <70000>; clock-latency-ns = <300000>; opp-suspend; }; - opp@1100000000 { + opp-1100000000 { opp-hz = /bits/ 64 <1100000000>; opp-microvolt = <1000000 980000 1010000>; opp-microamp = <80000>; clock-latency-ns = <310000>; }; - opp@1200000000 { + opp-1200000000 { opp-hz = /bits/ 64 <1200000000>; opp-microvolt = <1025000>; clock-latency-ns = <290000>; @@ -265,20 +265,20 @@ independently. * independently. */ - opp@1000000000 { + opp-1000000000 { opp-hz = /bits/ 64 <1000000000>; opp-microvolt = <975000 970000 985000>; opp-microamp = <70000>; clock-latency-ns = <300000>; opp-suspend; }; - opp@1100000000 { + opp-1100000000 { opp-hz = /bits/ 64 <1100000000>; opp-microvolt = <1000000 980000 1010000>; opp-microamp = <80000>; clock-latency-ns = <310000>; }; - opp@1200000000 { + opp-1200000000 { opp-hz = /bits/ 64 <1200000000>; opp-microvolt = <1025000>; opp-microamp = <90000; @@ -341,20 +341,20 @@ DVFS state together. compatible = "operating-points-v2"; opp-shared; - opp@1000000000 { + opp-1000000000 { opp-hz = /bits/ 64 <1000000000>; opp-microvolt = <975000 970000 985000>; opp-microamp = <70000>; clock-latency-ns = <300000>; opp-suspend; }; - opp@1100000000 { + opp-1100000000 { opp-hz = /bits/ 64 <1100000000>; opp-microvolt = <1000000 980000 1010000>; opp-microamp = <80000>; clock-latency-ns = <310000>; }; - opp@1200000000 { + opp-1200000000 { opp-hz = /bits/ 64 <1200000000>; opp-microvolt = <1025000>; opp-microamp = <90000>; @@ -367,20 +367,20 @@ DVFS state together. compatible = "operating-points-v2"; opp-shared; - opp@1300000000 { + opp-1300000000 { opp-hz = /bits/ 64 <1300000000>; opp-microvolt = <1050000 1045000 1055000>; opp-microamp = <95000>; clock-latency-ns = <400000>; opp-suspend; }; - opp@1400000000 { + opp-1400000000 { opp-hz = /bits/ 64 <1400000000>; opp-microvolt = <1075000>; opp-microamp = <100000>; clock-latency-ns = <400000>; }; - opp@1500000000 { + opp-1500000000 { opp-hz = /bits/ 64 <1500000000>; opp-microvolt = <1100000 1010000 1110000>; opp-microamp = <95000>; @@ -409,7 +409,7 @@ Example 4: Handling multiple regulators compatible = "operating-points-v2"; opp-shared; - opp@1000000000 { + opp-1000000000 { opp-hz = /bits/ 64 <1000000000>; opp-microvolt = <970000>, /* Supply 0 */ <960000>, /* Supply 1 */ @@ -422,7 +422,7 @@ Example 4: Handling multiple regulators /* OR */ - opp@1000000000 { + opp-1000000000 { opp-hz = /bits/ 64 <1000000000>; opp-microvolt = <975000 970000 985000>, /* Supply 0 */ <965000 960000 975000>, /* Supply 1 */ @@ -435,7 +435,7 @@ Example 4: Handling multiple regulators /* OR */ - opp@1000000000 { + opp-1000000000 { opp-hz = /bits/ 64 <1000000000>; opp-microvolt = <975000 970000 985000>, /* Supply 0 */ <965000 960000 975000>, /* Supply 1 */ @@ -467,7 +467,7 @@ Example 5: opp-supported-hw status = "okay"; opp-shared; - opp@600000000 { + opp-600000000 { /* * Supports all substrate and process versions for 0xF * cuts, i.e. only first four cuts. @@ -478,7 +478,7 @@ Example 5: opp-supported-hw ... }; - opp@800000000 { + opp-800000000 { /* * Supports: * - cuts: only one, 6th cut (represented by 6th bit). @@ -510,7 +510,7 @@ Example 6: opp-microvolt-<name>, opp-microamp-<name>: compatible = "operating-points-v2"; opp-shared; - opp@1000000000 { + opp-1000000000 { opp-hz = /bits/ 64 <1000000000>; opp-microvolt-slow = <915000 900000 925000>; opp-microvolt-fast = <975000 970000 985000>; @@ -518,7 +518,7 @@ Example 6: opp-microvolt-<name>, opp-microamp-<name>: opp-microamp-fast = <71000>; }; - opp@1200000000 { + opp-1200000000 { opp-hz = /bits/ 64 <1200000000>; opp-microvolt-slow = <915000 900000 925000>, /* Supply vcc0 */ <925000 910000 935000>; /* Supply vcc1 */ diff --git a/Documentation/devicetree/bindings/pci/faraday,ftpci100.txt b/Documentation/devicetree/bindings/pci/faraday,ftpci100.txt index 35d4a979bb7b..89a84f8aa621 100644 --- a/Documentation/devicetree/bindings/pci/faraday,ftpci100.txt +++ b/Documentation/devicetree/bindings/pci/faraday,ftpci100.txt @@ -30,6 +30,13 @@ Mandatory properties: 128MB, 256MB, 512MB, 1GB or 2GB in size. The memory should be marked as pre-fetchable. +Optional properties: +- clocks: when present, this should contain the peripheral clock (PCLK) and the + PCI clock (PCICLK). If these are not present, they are assumed to be + hard-wired enabled and always on. The PCI clock will be 33 or 66 MHz. +- clock-names: when present, this should contain "PCLK" for the peripheral + clock and "PCICLK" for the PCI-side clock. + Mandatory subnodes: - For "faraday,ftpci100" a node representing the interrupt-controller inside the host bridge is mandatory. It has the following mandatory properties: diff --git a/Documentation/devicetree/bindings/pci/fsl,imx6q-pcie.txt b/Documentation/devicetree/bindings/pci/fsl,imx6q-pcie.txt index e3d5680875b1..cf92d3ba5a26 100644 --- a/Documentation/devicetree/bindings/pci/fsl,imx6q-pcie.txt +++ b/Documentation/devicetree/bindings/pci/fsl,imx6q-pcie.txt @@ -33,6 +33,10 @@ Optional properties: - reset-gpio-active-high: If present then the reset sequence using the GPIO specified in the "reset-gpio" property is reversed (H=reset state, L=operation state). +- vpcie-supply: Should specify the regulator in charge of PCIe port power. + The regulator will be enabled when initializing the PCIe host and + disabled either as part of the init process or when shutting down the + host. Additional required properties for imx6sx-pcie: - clock names: Must include the following additional entries: diff --git a/Documentation/devicetree/bindings/pci/kirin-pcie.txt b/Documentation/devicetree/bindings/pci/kirin-pcie.txt new file mode 100644 index 000000000000..68ffa0fbcd73 --- /dev/null +++ b/Documentation/devicetree/bindings/pci/kirin-pcie.txt @@ -0,0 +1,50 @@ +HiSilicon Kirin SoCs PCIe host DT description + +Kirin PCIe host controller is based on Designware PCI core. +It shares common functions with PCIe Designware core driver +and inherits common properties defined in +Documentation/devicetree/bindings/pci/designware-pci.txt. + +Additional properties are described here: + +Required properties +- compatible: + "hisilicon,kirin960-pcie" for PCIe of Kirin960 SoC +- reg: Should contain rc_dbi, apb, phy, config registers location and length. +- reg-names: Must include the following entries: + "dbi": controller configuration registers; + "apb": apb Ctrl register defined by Kirin; + "phy": apb PHY register defined by Kirin; + "config": PCIe configuration space registers. +- reset-gpios: The gpio to generate PCIe perst assert and deassert signal. + +Optional properties: + +Example based on kirin960: + + pcie@f4000000 { + compatible = "hisilicon,kirin-pcie"; + reg = <0x0 0xf4000000 0x0 0x1000>, <0x0 0xff3fe000 0x0 0x1000>, + <0x0 0xf3f20000 0x0 0x40000>, <0x0 0xF4000000 0 0x2000>; + reg-names = "dbi","apb","phy", "config"; + bus-range = <0x0 0x1>; + #address-cells = <3>; + #size-cells = <2>; + device_type = "pci"; + ranges = <0x02000000 0x0 0x00000000 0x0 0xf5000000 0x0 0x2000000>; + num-lanes = <1>; + #interrupt-cells = <1>; + interrupt-map-mask = <0xf800 0 0 7>; + interrupt-map = <0x0 0 0 1 &gic 0 0 0 282 4>, + <0x0 0 0 2 &gic 0 0 0 283 4>, + <0x0 0 0 3 &gic 0 0 0 284 4>, + <0x0 0 0 4 &gic 0 0 0 285 4>; + clocks = <&crg_ctrl HI3660_PCIEPHY_REF>, + <&crg_ctrl HI3660_CLK_GATE_PCIEAUX>, + <&crg_ctrl HI3660_PCLK_GATE_PCIE_PHY>, + <&crg_ctrl HI3660_PCLK_GATE_PCIE_SYS>, + <&crg_ctrl HI3660_ACLK_GATE_PCIE>; + clock-names = "pcie_phy_ref", "pcie_aux", + "pcie_apb_phy", "pcie_apb_sys", "pcie_aclk"; + reset-gpios = <&gpio11 1 0 >; + }; diff --git a/Documentation/devicetree/bindings/pci/mediatek,mt7623-pcie.txt b/Documentation/devicetree/bindings/pci/mediatek,mt7623-pcie.txt new file mode 100644 index 000000000000..fe80dda9bf73 --- /dev/null +++ b/Documentation/devicetree/bindings/pci/mediatek,mt7623-pcie.txt @@ -0,0 +1,130 @@ +MediaTek Gen2 PCIe controller which is available on MT7623 series SoCs + +PCIe subsys supports single root complex (RC) with 3 Root Ports. Each root +ports supports a Gen2 1-lane Link and has PIPE interface to PHY. + +Required properties: +- compatible: Should contain "mediatek,mt7623-pcie". +- device_type: Must be "pci" +- reg: Base addresses and lengths of the PCIe controller. +- #address-cells: Address representation for root ports (must be 3) +- #size-cells: Size representation for root ports (must be 2) +- #interrupt-cells: Size representation for interrupts (must be 1) +- interrupt-map-mask and interrupt-map: Standard PCI IRQ mapping properties + Please refer to the standard PCI bus binding document for a more detailed + explanation. +- clocks: Must contain an entry for each entry in clock-names. + See ../clocks/clock-bindings.txt for details. +- clock-names: Must include the following entries: + - free_ck :for reference clock of PCIe subsys + - sys_ck0 :for clock of Port0 + - sys_ck1 :for clock of Port1 + - sys_ck2 :for clock of Port2 +- resets: Must contain an entry for each entry in reset-names. + See ../reset/reset.txt for details. +- reset-names: Must include the following entries: + - pcie-rst0 :port0 reset + - pcie-rst1 :port1 reset + - pcie-rst2 :port2 reset +- phys: List of PHY specifiers (used by generic PHY framework). +- phy-names : Must be "pcie-phy0", "pcie-phy1", "pcie-phyN".. based on the + number of PHYs as specified in *phys* property. +- power-domains: A phandle and power domain specifier pair to the power domain + which is responsible for collapsing and restoring power to the peripheral. +- bus-range: Range of bus numbers associated with this controller. +- ranges: Ranges for the PCI memory and I/O regions. + +In addition, the device tree node must have sub-nodes describing each +PCIe port interface, having the following mandatory properties: + +Required properties: +- device_type: Must be "pci" +- reg: Only the first four bytes are used to refer to the correct bus number + and device number. +- #address-cells: Must be 3 +- #size-cells: Must be 2 +- #interrupt-cells: Must be 1 +- interrupt-map-mask and interrupt-map: Standard PCI IRQ mapping properties + Please refer to the standard PCI bus binding document for a more detailed + explanation. +- ranges: Sub-ranges distributed from the PCIe controller node. An empty + property is sufficient. +- num-lanes: Number of lanes to use for this port. + +Examples: + + hifsys: syscon@1a000000 { + compatible = "mediatek,mt7623-hifsys", + "mediatek,mt2701-hifsys", + "syscon"; + reg = <0 0x1a000000 0 0x1000>; + #clock-cells = <1>; + #reset-cells = <1>; + }; + + pcie: pcie-controller@1a140000 { + compatible = "mediatek,mt7623-pcie"; + device_type = "pci"; + reg = <0 0x1a140000 0 0x1000>, /* PCIe shared registers */ + <0 0x1a142000 0 0x1000>, /* Port0 registers */ + <0 0x1a143000 0 0x1000>, /* Port1 registers */ + <0 0x1a144000 0 0x1000>; /* Port2 registers */ + #address-cells = <3>; + #size-cells = <2>; + #interrupt-cells = <1>; + interrupt-map-mask = <0xf800 0 0 0>; + interrupt-map = <0x0000 0 0 0 &sysirq GIC_SPI 193 IRQ_TYPE_LEVEL_LOW>, + <0x0800 0 0 0 &sysirq GIC_SPI 194 IRQ_TYPE_LEVEL_LOW>, + <0x1000 0 0 0 &sysirq GIC_SPI 195 IRQ_TYPE_LEVEL_LOW>; + clocks = <&topckgen CLK_TOP_ETHIF_SEL>, + <&hifsys CLK_HIFSYS_PCIE0>, + <&hifsys CLK_HIFSYS_PCIE1>, + <&hifsys CLK_HIFSYS_PCIE2>; + clock-names = "free_ck", "sys_ck0", "sys_ck1", "sys_ck2"; + resets = <&hifsys MT2701_HIFSYS_PCIE0_RST>, + <&hifsys MT2701_HIFSYS_PCIE1_RST>, + <&hifsys MT2701_HIFSYS_PCIE2_RST>; + reset-names = "pcie-rst0", "pcie-rst1", "pcie-rst2"; + phys = <&pcie0_phy>, <&pcie1_phy>, <&pcie2_phy>; + phy-names = "pcie-phy0", "pcie-phy1", "pcie-phy2"; + power-domains = <&scpsys MT2701_POWER_DOMAIN_HIF>; + bus-range = <0x00 0xff>; + ranges = <0x81000000 0 0x1a160000 0 0x1a160000 0 0x00010000 /* I/O space */ + 0x83000000 0 0x60000000 0 0x60000000 0 0x10000000>; /* memory space */ + + pcie@0,0 { + device_type = "pci"; + reg = <0x0000 0 0 0 0>; + #address-cells = <3>; + #size-cells = <2>; + #interrupt-cells = <1>; + interrupt-map-mask = <0 0 0 0>; + interrupt-map = <0 0 0 0 &sysirq GIC_SPI 193 IRQ_TYPE_LEVEL_LOW>; + ranges; + num-lanes = <1>; + }; + + pcie@1,0 { + device_type = "pci"; + reg = <0x0800 0 0 0 0>; + #address-cells = <3>; + #size-cells = <2>; + #interrupt-cells = <1>; + interrupt-map-mask = <0 0 0 0>; + interrupt-map = <0 0 0 0 &sysirq GIC_SPI 194 IRQ_TYPE_LEVEL_LOW>; + ranges; + num-lanes = <1>; + }; + + pcie@2,0 { + device_type = "pci"; + reg = <0x1000 0 0 0 0>; + #address-cells = <3>; + #size-cells = <2>; + #interrupt-cells = <1>; + interrupt-map-mask = <0 0 0 0>; + interrupt-map = <0 0 0 0 &sysirq GIC_SPI 195 IRQ_TYPE_LEVEL_LOW>; + ranges; + num-lanes = <1>; + }; + }; diff --git a/Documentation/devicetree/bindings/pci/qcom,pcie.txt b/Documentation/devicetree/bindings/pci/qcom,pcie.txt index e15f9b19901f..9d418b71774f 100644 --- a/Documentation/devicetree/bindings/pci/qcom,pcie.txt +++ b/Documentation/devicetree/bindings/pci/qcom,pcie.txt @@ -8,6 +8,7 @@ - "qcom,pcie-apq8064" for apq8064 - "qcom,pcie-apq8084" for apq8084 - "qcom,pcie-msm8996" for msm8996 or apq8096 + - "qcom,pcie-ipq4019" for ipq4019 - reg: Usage: required @@ -87,7 +88,7 @@ - "core" Clocks the pcie hw block - "phy" Clocks the pcie PHY block - clock-names: - Usage: required for apq8084 + Usage: required for apq8084/ipq4019 Value type: <stringlist> Definition: Should contain the following entries - "aux" Auxiliary (AUX) clock @@ -126,6 +127,23 @@ Definition: Should contain the following entries - "core" Core reset +- reset-names: + Usage: required for ipq/apq8064 + Value type: <stringlist> + Definition: Should contain the following entries + - "axi_m" AXI master reset + - "axi_s" AXI slave reset + - "pipe" PIPE reset + - "axi_m_vmid" VMID reset + - "axi_s_xpu" XPU reset + - "parf" PARF reset + - "phy" PHY reset + - "axi_m_sticky" AXI sticky reset + - "pipe_sticky" PIPE sticky reset + - "pwr" PWR reset + - "ahb" AHB reset + - "phy_ahb" PHY AHB reset + - power-domains: Usage: required for apq8084 and msm8996/apq8096 Value type: <prop-encoded-array> diff --git a/Documentation/devicetree/bindings/pci/rcar-pci.txt b/Documentation/devicetree/bindings/pci/rcar-pci.txt index 34712d6fd253..bd27428dda61 100644 --- a/Documentation/devicetree/bindings/pci/rcar-pci.txt +++ b/Documentation/devicetree/bindings/pci/rcar-pci.txt @@ -1,4 +1,4 @@ -* Renesas RCar PCIe interface +* Renesas R-Car PCIe interface Required properties: compatible: "renesas,pcie-r8a7779" for the R8A7779 SoC; diff --git a/Documentation/devicetree/bindings/pci/tango-pcie.txt b/Documentation/devicetree/bindings/pci/tango-pcie.txt new file mode 100644 index 000000000000..244683836a79 --- /dev/null +++ b/Documentation/devicetree/bindings/pci/tango-pcie.txt @@ -0,0 +1,29 @@ +Sigma Designs Tango PCIe controller + +Required properties: + +- compatible: "sigma,smp8759-pcie" +- reg: address/size of PCI configuration space, address/size of register area +- bus-range: defined by size of PCI configuration space +- device_type: "pci" +- #size-cells: <2> +- #address-cells: <3> +- msi-controller +- ranges: translation from system to bus addresses +- interrupts: spec for misc interrupts, spec for MSI + +Example: + + pcie@2e000 { + compatible = "sigma,smp8759-pcie"; + reg = <0x50000000 0x400000>, <0x2e000 0x100>; + bus-range = <0 3>; + device_type = "pci"; + #size-cells = <2>; + #address-cells = <3>; + msi-controller; + ranges = <0x02000000 0x0 0x00400000 0x50400000 0x0 0x3c00000>; + interrupts = + <54 IRQ_TYPE_LEVEL_HIGH>, /* misc interrupts */ + <55 IRQ_TYPE_LEVEL_HIGH>; /* MSI */ + }; diff --git a/Documentation/devicetree/bindings/phy/bcm-ns-usb3-phy.txt b/Documentation/devicetree/bindings/phy/bcm-ns-usb3-phy.txt index 09aeba94538d..32f057260351 100644 --- a/Documentation/devicetree/bindings/phy/bcm-ns-usb3-phy.txt +++ b/Documentation/devicetree/bindings/phy/bcm-ns-usb3-phy.txt @@ -3,9 +3,10 @@ Driver for Broadcom Northstar USB 3.0 PHY Required properties: - compatible: one of: "brcm,ns-ax-usb3-phy", "brcm,ns-bx-usb3-phy". -- reg: register mappings for DMP (Device Management Plugin) and ChipCommon B - MMI. -- reg-names: "dmp" and "ccb-mii" +- reg: address of MDIO bus device +- usb3-dmp-syscon: phandle to syscon with DMP (Device Management Plugin) + registers +- #phy-cells: must be 0 Initialization of USB 3.0 PHY depends on Northstar version. There are currently three known series: Ax, Bx and Cx. @@ -15,9 +16,19 @@ Known B1: BCM4707 rev 6 Known C0: BCM47094 rev 0 Example: - usb3-phy { - compatible = "brcm,ns-ax-usb3-phy"; - reg = <0x18105000 0x1000>, <0x18003000 0x1000>; - reg-names = "dmp", "ccb-mii"; - #phy-cells = <0>; + mdio: mdio@0 { + reg = <0x0>; + #size-cells = <1>; + #address-cells = <0>; + + usb3-phy@10 { + compatible = "brcm,ns-ax-usb3-phy"; + reg = <0x10>; + usb3-dmp-syscon = <&usb3_dmp>; + #phy-cells = <0>; + }; + }; + + usb3_dmp: syscon@18105000 { + reg = <0x18105000 0x1000>; }; diff --git a/Documentation/devicetree/bindings/phy/brcm,ns2-drd-phy.txt b/Documentation/devicetree/bindings/phy/brcm,ns2-drd-phy.txt new file mode 100644 index 000000000000..04f063aa7883 --- /dev/null +++ b/Documentation/devicetree/bindings/phy/brcm,ns2-drd-phy.txt @@ -0,0 +1,30 @@ +BROADCOM NORTHSTAR2 USB2 (DUAL ROLE DEVICE) PHY + +Required properties: + - compatible: brcm,ns2-drd-phy + - reg: offset and length of the NS2 PHY related registers. + - reg-names + The below registers must be provided. + icfg - for DRD ICFG configurations + rst-ctrl - for DRD IDM reset + crmu-ctrl - for CRMU core vdd, PHY and PHY PLL reset + usb2-strap - for port over current polarity reversal + - #phy-cells: Must be 0. No args required. + - vbus-gpios: vbus gpio binding + - id-gpios: id gpio binding + +Refer to phy/phy-bindings.txt for the generic PHY binding properties + +Example: + usbdrd_phy: phy@66000960 { + #phy-cells = <0>; + compatible = "brcm,ns2-drd-phy"; + reg = <0x66000960 0x24>, + <0x67012800 0x4>, + <0x6501d148 0x4>, + <0x664d0700 0x4>; + reg-names = "icfg", "rst-ctrl", + "crmu-ctrl", "usb2-strap"; + id-gpios = <&gpio_g 30 0>; + vbus-gpios = <&gpio_g 31 0>; + }; diff --git a/Documentation/devicetree/bindings/phy/brcm-sata-phy.txt b/Documentation/devicetree/bindings/phy/brcm-sata-phy.txt index 6ccce09d8bbf..97977cd29a98 100644 --- a/Documentation/devicetree/bindings/phy/brcm-sata-phy.txt +++ b/Documentation/devicetree/bindings/phy/brcm-sata-phy.txt @@ -7,12 +7,13 @@ Required properties: "brcm,iproc-ns2-sata-phy" "brcm,iproc-nsp-sata-phy" "brcm,phy-sata3" + "brcm,iproc-sr-sata-phy" - address-cells: should be 1 - size-cells: should be 0 - reg: register ranges for the PHY PCB interface - reg-names: should be "phy" and "phy-ctrl" The "phy-ctrl" registers are only required for - "brcm,iproc-ns2-sata-phy". + "brcm,iproc-ns2-sata-phy" and "brcm,iproc-sr-sata-phy". Sub-nodes: Each port's PHY should be represented as a sub-node. @@ -23,8 +24,8 @@ Sub-nodes required properties: Sub-nodes optional properties: - brcm,enable-ssc: use spread spectrum clocking (SSC) on this port - This property is not applicable for "brcm,iproc-ns2-sata-phy" and - "brcm,iproc-nsp-sata-phy". + This property is not applicable for "brcm,iproc-ns2-sata-phy", + "brcm,iproc-nsp-sata-phy" and "brcm,iproc-sr-sata-phy". Example: sata-phy@f0458100 { diff --git a/Documentation/devicetree/bindings/phy/meson-gxl-usb2-phy.txt b/Documentation/devicetree/bindings/phy/meson-gxl-usb2-phy.txt new file mode 100644 index 000000000000..a105494a0fc9 --- /dev/null +++ b/Documentation/devicetree/bindings/phy/meson-gxl-usb2-phy.txt @@ -0,0 +1,17 @@ +* Amlogic Meson GXL and GXM USB2 PHY binding + +Required properties: +- compatible: Should be "amlogic,meson-gxl-usb2-phy" +- reg: The base address and length of the registers +- #phys-cells: must be 0 (see phy-bindings.txt in this directory) + +Optional properties: +- phy-supply: see phy-bindings.txt in this directory + + +Example: + usb2_phy0: phy@78000 { + compatible = "amlogic,meson-gxl-usb2-phy"; + #phy-cells = <0>; + reg = <0x0 0x78000 0x0 0x20>; + }; diff --git a/Documentation/devicetree/bindings/phy/meson8b-usb2-phy.txt b/Documentation/devicetree/bindings/phy/meson8b-usb2-phy.txt index 5fa73b9d20f5..d81d73aea608 100644 --- a/Documentation/devicetree/bindings/phy/meson8b-usb2-phy.txt +++ b/Documentation/devicetree/bindings/phy/meson8b-usb2-phy.txt @@ -1,7 +1,8 @@ -* Amlogic Meson8b and GXBB USB2 PHY +* Amlogic Meson8, Meson8b and GXBB USB2 PHY Required properties: - compatible: Depending on the platform this should be one of: + "amlogic,meson8-usb2-phy" "amlogic,meson8b-usb2-phy" "amlogic,meson-gxbb-usb2-phy" - reg: The base address and length of the registers diff --git a/Documentation/devicetree/bindings/phy/phy-cpcap-usb.txt b/Documentation/devicetree/bindings/phy/phy-cpcap-usb.txt new file mode 100644 index 000000000000..2eb9b2b69037 --- /dev/null +++ b/Documentation/devicetree/bindings/phy/phy-cpcap-usb.txt @@ -0,0 +1,40 @@ +Motorola CPCAP PMIC USB PHY binding + +Required properties: +compatible: Shall be either "motorola,cpcap-usb-phy" or + "motorola,mapphone-cpcap-usb-phy" +#phy-cells: Shall be 0 +interrupts: CPCAP PMIC interrupts used by the USB PHY +interrupt-names: Interrupt names +io-channels: IIO ADC channels used by the USB PHY +io-channel-names: IIO ADC channel names +vusb-supply: Regulator for the PHY + +Optional properties: +pinctrl: Optional alternate pin modes for the PHY +pinctrl-names: Names for optional pin modes +mode-gpios: Optional GPIOs for configuring alternate modes + +Example: +cpcap_usb2_phy: phy { + compatible = "motorola,mapphone-cpcap-usb-phy"; + pinctrl-0 = <&usb_gpio_mux_sel1 &usb_gpio_mux_sel2>; + pinctrl-1 = <&usb_ulpi_pins>; + pinctrl-2 = <&usb_utmi_pins>; + pinctrl-3 = <&uart3_pins>; + pinctrl-names = "default", "ulpi", "utmi", "uart"; + #phy-cells = <0>; + interrupts-extended = < + &cpcap 15 0 &cpcap 14 0 &cpcap 28 0 &cpcap 19 0 + &cpcap 18 0 &cpcap 17 0 &cpcap 16 0 &cpcap 49 0 + &cpcap 48 1 + >; + interrupt-names = + "id_ground", "id_float", "se0conn", "vbusvld", + "sessvld", "sessend", "se1", "dm", "dp"; + mode-gpios = <&gpio2 28 GPIO_ACTIVE_HIGH + &gpio1 0 GPIO_ACTIVE_HIGH>; + io-channels = <&cpcap_adc 2>, <&cpcap_adc 7>; + io-channel-names = "vbus", "id"; + vusb-supply = <&vusb>; +}; diff --git a/Documentation/devicetree/bindings/phy/phy-rockchip-inno-usb2.txt b/Documentation/devicetree/bindings/phy/phy-rockchip-inno-usb2.txt index e71a8d23f4a8..84d59b0db8df 100644 --- a/Documentation/devicetree/bindings/phy/phy-rockchip-inno-usb2.txt +++ b/Documentation/devicetree/bindings/phy/phy-rockchip-inno-usb2.txt @@ -2,6 +2,7 @@ ROCKCHIP USB2.0 PHY WITH INNO IP BLOCK Required properties (phy (parent) node): - compatible : should be one of the listed compatibles: + * "rockchip,rk3228-usb2phy" * "rockchip,rk3328-usb2phy" * "rockchip,rk3366-usb2phy" * "rockchip,rk3399-usb2phy" diff --git a/Documentation/devicetree/bindings/phy/rcar-gen3-phy-usb3.txt b/Documentation/devicetree/bindings/phy/rcar-gen3-phy-usb3.txt new file mode 100644 index 000000000000..f94cea48f6b1 --- /dev/null +++ b/Documentation/devicetree/bindings/phy/rcar-gen3-phy-usb3.txt @@ -0,0 +1,46 @@ +* Renesas R-Car generation 3 USB 3.0 PHY + +This file provides information on what the device node for the R-Car generation +3 USB 3.0 PHY contains. +If you want to enable spread spectrum clock (ssc), you should use USB_EXTAL +instead of USB3_CLK. However, if you don't want to these features, you don't +need this driver. + +Required properties: +- compatible: "renesas,r8a7795-usb3-phy" if the device is a part of an R8A7795 + SoC. + "renesas,r8a7796-usb3-phy" if the device is a part of an R8A7796 + SoC. + "renesas,rcar-gen3-usb3-phy" for a generic R-Car Gen3 compatible + device. + + When compatible with the generic version, nodes must list the + SoC-specific version corresponding to the platform first + followed by the generic version. + +- reg: offset and length of the USB 3.0 PHY register block. +- clocks: A list of phandles and clock-specifier pairs. +- clock-names: Name of the clocks. + - The funcional clock must be "usb3-if". + - The usb3's external clock must be "usb3s_clk". + - The usb2's external clock must be "usb_extal". If you want to use the ssc, + the clock-frequency must not be 0. +- #phy-cells: see phy-bindings.txt in the same directory, must be <0>. + +Optional properties: +- renesas,ssc-range: Enable/disable spread spectrum clock (ssc) by using + the following values as u32: + - 0 (or the property doesn't exist): disable the ssc + - 4980: enable the ssc as -4980 ppm + - 4492: enable the ssc as -4492 ppm + - 4003: enable the ssc as -4003 ppm + +Example (R-Car H3): + + usb-phy@e65ee000 { + compatible = "renesas,r8a7795-usb3-phy", + "renesas,rcar-gen3-usb3-phy"; + reg = <0 0xe65ee000 0 0x90>; + clocks = <&cpg CPG_MOD 328>, <&usb3s0_clk>, <&usb_extal>; + clock-names = "usb3-if", "usb3s_clk", "usb_extal"; + }; diff --git a/Documentation/devicetree/bindings/pinctrl/allwinner,sunxi-pinctrl.txt b/Documentation/devicetree/bindings/pinctrl/allwinner,sunxi-pinctrl.txt index b53224473672..6f2ec9af0de2 100644 --- a/Documentation/devicetree/bindings/pinctrl/allwinner,sunxi-pinctrl.txt +++ b/Documentation/devicetree/bindings/pinctrl/allwinner,sunxi-pinctrl.txt @@ -20,8 +20,10 @@ Required properties: "allwinner,sun9i-a80-pinctrl" "allwinner,sun9i-a80-r-pinctrl" "allwinner,sun8i-a83t-pinctrl" + "allwinner,sun8i-a83t-r-pinctrl" "allwinner,sun8i-h3-pinctrl" "allwinner,sun8i-h3-r-pinctrl" + "allwinner,sun8i-r40-pinctrl" "allwinner,sun50i-a64-pinctrl" "allwinner,sun50i-a64-r-pinctrl" "allwinner,sun50i-h5-pinctrl" diff --git a/Documentation/devicetree/bindings/pinctrl/ingenic,pinctrl.txt b/Documentation/devicetree/bindings/pinctrl/ingenic,pinctrl.txt new file mode 100644 index 000000000000..ca313a7aeaff --- /dev/null +++ b/Documentation/devicetree/bindings/pinctrl/ingenic,pinctrl.txt @@ -0,0 +1,41 @@ +Ingenic jz47xx pin controller + +Please refer to pinctrl-bindings.txt in this directory for details of the +common pinctrl bindings used by client devices, including the meaning of the +phrase "pin configuration node". + +For the jz47xx SoCs, pin control is tightly bound with GPIO ports. All pins may +be used as GPIOs, multiplexed device functions are configured within the +GPIO port configuration registers and it is typical to refer to pins using the +naming scheme "PxN" where x is a character identifying the GPIO port with +which the pin is associated and N is an integer from 0 to 31 identifying the +pin within that GPIO port. For example PA0 is the first pin in GPIO port A, and +PB31 is the last pin in GPIO port B. The jz4740 contains 4 GPIO ports, PA to +PD, for a total of 128 pins. The jz4780 contains 6 GPIO ports, PA to PF, for a +total of 192 pins. + + +Required properties: +-------------------- + + - compatible: One of: + - "ingenic,jz4740-pinctrl" + - "ingenic,jz4770-pinctrl" + - "ingenic,jz4780-pinctrl" + - reg: Address range of the pinctrl registers. + + +GPIO sub-nodes +-------------- + +The pinctrl node can have optional sub-nodes for the Ingenic GPIO driver; +please refer to ../gpio/ingenic,gpio.txt. + + +Example: +-------- + +pinctrl: pin-controller@10010000 { + compatible = "ingenic,jz4740-pinctrl"; + reg = <0x10010000 0x400>; +}; diff --git a/Documentation/devicetree/bindings/pinctrl/pinctrl-bindings.txt b/Documentation/devicetree/bindings/pinctrl/pinctrl-bindings.txt index f01d154090da..62d0f33fa65e 100644 --- a/Documentation/devicetree/bindings/pinctrl/pinctrl-bindings.txt +++ b/Documentation/devicetree/bindings/pinctrl/pinctrl-bindings.txt @@ -204,21 +204,22 @@ each single pin the number of required sub-nodes containing "pin" and maintain. For cases like this, the pin controller driver may use the pinmux helper -property, where the pin identifier is packed with mux configuration settings -in a single integer. +property, where the pin identifier is provided with mux configuration settings +in a pinmux group. A pinmux group consists of the pin identifier and mux +settings represented as a single integer or an array of integers. -The pinmux property accepts an array of integers, each of them describing +The pinmux property accepts an array of pinmux groups, each of them describing a single pin multiplexing configuration. pincontroller { state_0_node_a { - pinmux = <PIN_ID_AND_MUX>, <PIN_ID_AND_MUX>, ...; + pinmux = <PINMUX_GROUP>, <PINMUX_GROUP>, ...; }; }; Each individual pin controller driver bindings documentation shall specify -how those values (pin IDs and pin multiplexing configuration) are defined and -assembled together. +how pin IDs and pin multiplexing configuration are defined and assembled +together in a pinmux group. == Generic pin configuration node content == @@ -251,14 +252,20 @@ drive-push-pull - drive actively high and low drive-open-drain - drive with open drain drive-open-source - drive with open source drive-strength - sink or source at most X mA -input-enable - enable input on pin (no effect on output) -input-disable - disable input on pin (no effect on output) +input-enable - enable input on pin (no effect on output, such as + enabling an input buffer) +input-disable - disable input on pin (no effect on output, such as + disabling an input buffer) input-schmitt-enable - enable schmitt-trigger mode input-schmitt-disable - disable schmitt-trigger mode input-debounce - debounce mode with debound time X power-source - select between different power supplies low-power-enable - enable low power mode low-power-disable - disable low power mode +output-disable - disable output on a pin (such as disable an output + buffer) +output-enable - enable output on a pin without actively driving it + (such as enabling an output buffer) output-low - set the pin to output mode with low level output-high - set the pin to output mode with high level slew-rate - set the slew rate @@ -300,7 +307,7 @@ arguments are described below. - pinmux takes a list of pin IDs and mux settings as required argument. The specific bindings for the hardware defines: - How pin IDs and mux settings are defined and assembled together in a single - integer. + integer or an array of integers. - bias-pull-up, -down and -pin-default take as optional argument on hardware supporting it the pull strength in Ohm. bias-disable will disable the pull. diff --git a/Documentation/devicetree/bindings/pinctrl/pinctrl-zx.txt b/Documentation/devicetree/bindings/pinctrl/pinctrl-zx.txt new file mode 100644 index 000000000000..e219849b21ca --- /dev/null +++ b/Documentation/devicetree/bindings/pinctrl/pinctrl-zx.txt @@ -0,0 +1,85 @@ +* ZTE ZX Pin Controller + +The pin controller on ZTE ZX platforms is kinda of hybrid. It consists of +a main controller and an auxiliary one. For example, on ZX296718 SoC, the +main controller is TOP_PMM and the auxiliary one is AON_IOCFG. Both +controllers work together to control pin multiplexing and configuration in +the way illustrated as below. + + + GMII_RXD3 ---+ + | + DVI1_HS ---+----------------------------- GMII_RXD3 (TOP pin) + | + BGPIO16 ---+ ^ + | pinconf + ^ | + | pinmux | + | | + + TOP_PMM (main) AON_IOCFG (aux) + + | | | + | pinmux | | + | pinmux v | + v | pinconf + KEY_ROW2 ---+ v + PORT1_LCD_TE ---+ | + | AGPIO10 ---+------ KEY_ROW2 (AON pin) + I2S0_DOUT3 ---+ | + |-----------------------+ + PWM_OUT3 ---+ + | + VGA_VS1 ---+ + + +For most of pins like GMII_RXD3 in the figure, the pinmux function is +controlled by TOP_PMM block only, and this type of pins are meant by term +'TOP pins'. For pins like KEY_ROW2, the pinmux is controlled by both +TOP_PMM and AON_IOCFG blocks, as the available multiplexing functions for +the pin spread in both controllers. This type of pins are called 'AON pins'. +Though pinmux implementation is quite different, pinconf is same for both +types of pins. Both are controlled by auxiliary controller, i.e. AON_IOCFG +on ZX296718. + +Required properties: +- compatible: should be "zte,zx296718-pmm". +- reg: the register physical address and length. +- zte,auxiliary-controller: phandle to the auxiliary pin controller which + implements pinmux for AON pins and pinconf for all pins. + +The following pin configuration are supported. Please refer to +pinctrl-bindings.txt in this directory for more details of the common +pinctrl bindings used by client devices. + +- bias-pull-up +- bias-pull-down +- drive-strength +- input-enable +- slew-rate + +Examples: + +iocfg: pin-controller@119000 { + compatible = "zte,zx296718-iocfg"; + reg = <0x119000 0x1000>; +}; + +pmm: pin-controller@1462000 { + compatible = "zte,zx296718-pmm"; + reg = <0x1462000 0x1000>; + zte,auxiliary-controller = <&iocfg>; +}; + +&pmm { + vga_pins: vga { + pins = "KEY_COL1", "KEY_COL2", "KEY_ROW1", "KEY_ROW2"; + function = "VGA"; + }; +}; + +&vga { + pinctrl-names = "default"; + pinctrl-0 = <&vga_pins>; + status = "okay"; +}; diff --git a/Documentation/devicetree/bindings/pinctrl/qcom,ipq8074-pinctrl.txt b/Documentation/devicetree/bindings/pinctrl/qcom,ipq8074-pinctrl.txt new file mode 100644 index 000000000000..407b9443629d --- /dev/null +++ b/Documentation/devicetree/bindings/pinctrl/qcom,ipq8074-pinctrl.txt @@ -0,0 +1,172 @@ +Qualcomm Technologies, Inc. IPQ8074 TLMM block + +This binding describes the Top Level Mode Multiplexer block found in the +IPQ8074 platform. + +- compatible: + Usage: required + Value type: <string> + Definition: must be "qcom,ipq8074-pinctrl" + +- reg: + Usage: required + Value type: <prop-encoded-array> + Definition: the base address and size of the TLMM register space. + +- interrupts: + Usage: required + Value type: <prop-encoded-array> + Definition: should specify the TLMM summary IRQ. + +- interrupt-controller: + Usage: required + Value type: <none> + Definition: identifies this node as an interrupt controller + +- #interrupt-cells: + Usage: required + Value type: <u32> + Definition: must be 2. Specifying the pin number and flags, as defined + in <dt-bindings/interrupt-controller/irq.h> + +- gpio-controller: + Usage: required + Value type: <none> + Definition: identifies this node as a gpio controller + +- #gpio-cells: + Usage: required + Value type: <u32> + Definition: must be 2. Specifying the pin number and flags, as defined + in <dt-bindings/gpio/gpio.h> + +Please refer to ../gpio/gpio.txt and ../interrupt-controller/interrupts.txt for +a general description of GPIO and interrupt bindings. + +Please refer to pinctrl-bindings.txt in this directory for details of the +common pinctrl bindings used by client devices, including the meaning of the +phrase "pin configuration node". + +The pin configuration nodes act as a container for an arbitrary number of +subnodes. Each of these subnodes represents some desired configuration for a +pin, a group, or a list of pins or groups. This configuration can include the +mux function to select on those pin(s)/group(s), and various pin configuration +parameters, such as pull-up, drive strength, etc. + + +PIN CONFIGURATION NODES: + +The name of each subnode is not important; all subnodes should be enumerated +and processed purely based on their content. + +Each subnode only affects those parameters that are explicitly listed. In +other words, a subnode that lists a mux function but no pin configuration +parameters implies no information about any pin configuration parameters. +Similarly, a pin subnode that describes a pullup parameter implies no +information about e.g. the mux function. + + +The following generic properties as defined in pinctrl-bindings.txt are valid +to specify in a pin configuration subnode: + +- pins: + Usage: required + Value type: <string-array> + Definition: List of gpio pins affected by the properties specified in + this subnode. Valid pins are: + gpio0-gpio69 + +- function: + Usage: required + Value type: <string> + Definition: Specify the alternative function to be configured for the + specified pins. Functions are only valid for gpio pins. + Valid values are: + atest_char, atest_char0, atest_char1, atest_char2, + atest_char3, audio_rxbclk, audio_rxd, audio_rxfsync, + audio_rxmclk, audio_txbclk, audio_txd, audio_txfsync, + audio_txmclk, blsp0_i2c, blsp0_spi, blsp0_uart, blsp1_i2c, + blsp1_spi, blsp1_uart, blsp2_i2c, blsp2_spi, blsp2_uart, + blsp3_i2c, blsp3_spi, blsp3_spi0, blsp3_spi1, blsp3_spi2, + blsp3_spi3, blsp3_uart, blsp4_i2c0, blsp4_i2c1, blsp4_spi0, + blsp4_spi1, blsp4_uart0, blsp4_uart1, blsp5_i2c, blsp5_spi, + blsp5_uart, burn0, burn1, cri_trng, cri_trng0, cri_trng1, + cxc0, cxc1, dbg_out, gcc_plltest, gcc_tlmm, gpio, ldo_en, + ldo_update, led0, led1, led2, mac0_sa0, mac0_sa1, mac1_sa0, + mac1_sa1, mac1_sa2, mac1_sa3, mac2_sa0, mac2_sa1, mdc, + mdio, pcie0_clk, pcie0_rst, pcie0_wake, pcie1_clk, + pcie1_rst, pcie1_wake, pcm_drx, pcm_dtx, pcm_fsync, + pcm_pclk, pcm_zsi0, pcm_zsi1, prng_rosc, pta1_0, pta1_1, + pta1_2, pta2_0, pta2_1, pta2_2, pwm0, pwm1, pwm2, pwm3, + qdss_cti_trig_in_a0, qdss_cti_trig_in_a1, + qdss_cti_trig_in_b0, qdss_cti_trig_in_b1, + qdss_cti_trig_out_a0, qdss_cti_trig_out_a1, + qdss_cti_trig_out_b0, qdss_cti_trig_out_b1, + qdss_traceclk_a, qdss_traceclk_b, qdss_tracectl_a, + qdss_tracectl_b, qdss_tracedata_a, qdss_tracedata_b, + qpic, rx0, rx1, rx2, sd_card, sd_write, tsens_max, wci2a, + wci2b, wci2c, wci2d + +- bias-disable: + Usage: optional + Value type: <none> + Definition: The specified pins should be configued as no pull. + +- bias-pull-down: + Usage: optional + Value type: <none> + Definition: The specified pins should be configued as pull down. + +- bias-pull-up: + Usage: optional + Value type: <none> + Definition: The specified pins should be configued as pull up. + +- output-high: + Usage: optional + Value type: <none> + Definition: The specified pins are configured in output mode, driven + high. + +- output-low: + Usage: optional + Value type: <none> + Definition: The specified pins are configured in output mode, driven + low. + +- drive-strength: + Usage: optional + Value type: <u32> + Definition: Selects the drive strength for the specified pins, in mA. + Valid values are: 2, 4, 6, 8, 10, 12, 14 and 16 + +Example: + + tlmm: pinctrl@1000000 { + compatible = "qcom,ipq8074-pinctrl"; + reg = <0x1000000 0x300000>; + interrupts = <GIC_SPI 208 IRQ_TYPE_LEVEL_HIGH>; + gpio-controller; + #gpio-cells = <2>; + interrupt-controller; + #interrupt-cells = <2>; + + uart2: uart2-default { + mux { + pins = "gpio23", "gpio24"; + function = "blsp4_uart1"; + }; + + rx { + pins = "gpio23"; + drive-strength = <4>; + bias-disable; + }; + + tx { + pins = "gpio24"; + drive-strength = <2>; + bias-pull-up; + }; + }; + }; diff --git a/Documentation/devicetree/bindings/pinctrl/renesas,pfc-pinctrl.txt b/Documentation/devicetree/bindings/pinctrl/renesas,pfc-pinctrl.txt index 13df9498311a..645082f03259 100644 --- a/Documentation/devicetree/bindings/pinctrl/renesas,pfc-pinctrl.txt +++ b/Documentation/devicetree/bindings/pinctrl/renesas,pfc-pinctrl.txt @@ -13,6 +13,8 @@ Required Properties: - "renesas,pfc-emev2": for EMEV2 (EMMA Mobile EV2) compatible pin-controller. - "renesas,pfc-r8a73a4": for R8A73A4 (R-Mobile APE6) compatible pin-controller. - "renesas,pfc-r8a7740": for R8A7740 (R-Mobile A1) compatible pin-controller. + - "renesas,pfc-r8a7743": for R8A7743 (RZ/G1M) compatible pin-controller. + - "renesas,pfc-r8a7745": for R8A7745 (RZ/G1E) compatible pin-controller. - "renesas,pfc-r8a7778": for R8A7778 (R-Mobile M1) compatible pin-controller. - "renesas,pfc-r8a7779": for R8A7779 (R-Car H1) compatible pin-controller. - "renesas,pfc-r8a7790": for R8A7790 (R-Car H2) compatible pin-controller. diff --git a/Documentation/devicetree/bindings/pinctrl/renesas,rza1-pinctrl.txt b/Documentation/devicetree/bindings/pinctrl/renesas,rza1-pinctrl.txt new file mode 100644 index 000000000000..43e21474528a --- /dev/null +++ b/Documentation/devicetree/bindings/pinctrl/renesas,rza1-pinctrl.txt @@ -0,0 +1,221 @@ +Renesas RZ/A1 combined Pin and GPIO controller + +The Renesas SoCs of the RZ/A1 family feature a combined Pin and GPIO controller, +named "Ports" in the hardware reference manual. +Pin multiplexing and GPIO configuration is performed on a per-pin basis +writing configuration values to per-port register sets. +Each "port" features up to 16 pins, each of them configurable for GPIO +function (port mode) or in alternate function mode. +Up to 8 different alternate function modes exist for each single pin. + +Pin controller node +------------------- + +Required properties: + - compatible + this shall be "renesas,r7s72100-ports". + + - reg + address base and length of the memory area where the pin controller + hardware is mapped to. + +Example: +Pin controller node for RZ/A1H SoC (r7s72100) + +pinctrl: pin-controller@fcfe3000 { + compatible = "renesas,r7s72100-ports"; + + reg = <0xfcfe3000 0x4230>; +}; + +Sub-nodes +--------- + +The child nodes of the pin controller node describe a pin multiplexing +function or a GPIO controller alternatively. + +- Pin multiplexing sub-nodes: + A pin multiplexing sub-node describes how to configure a set of + (or a single) pin in some desired alternate function mode. + A single sub-node may define several pin configurations. + A few alternate function require special pin configuration flags to be + supplied along with the alternate function configuration number. + The hardware reference manual specifies when a pin function requires + "software IO driven" mode to be specified. To do so use the generic + properties from the <include/linux/pinctrl/pinconf_generic.h> header file + to instruct the pin controller to perform the desired pin configuration + operation. + Please refer to pinctrl-bindings.txt to get to know more on generic + pin properties usage. + + The allowed generic formats for a pin multiplexing sub-node are the + following ones: + + node-1 { + pinmux = <PIN_ID_AND_MUX>, <PIN_ID_AND_MUX>, ... ; + GENERIC_PINCONFIG; + }; + + node-2 { + sub-node-1 { + pinmux = <PIN_ID_AND_MUX>, <PIN_ID_AND_MUX>, ... ; + GENERIC_PINCONFIG; + }; + + sub-node-2 { + pinmux = <PIN_ID_AND_MUX>, <PIN_ID_AND_MUX>, ... ; + GENERIC_PINCONFIG; + }; + + ... + + sub-node-n { + pinmux = <PIN_ID_AND_MUX>, <PIN_ID_AND_MUX>, ... ; + GENERIC_PINCONFIG; + }; + }; + + Use the second format when pins part of the same logical group need to have + different generic pin configuration flags applied. + + Client sub-nodes shall refer to pin multiplexing sub-nodes using the phandle + of the most external one. + + Eg. + + client-1 { + ... + pinctrl-0 = <&node-1>; + ... + }; + + client-2 { + ... + pinctrl-0 = <&node-2>; + ... + }; + + Required properties: + - pinmux: + integer array representing pin number and pin multiplexing configuration. + When a pin has to be configured in alternate function mode, use this + property to identify the pin by its global index, and provide its + alternate function configuration number along with it. + When multiple pins are required to be configured as part of the same + alternate function they shall be specified as members of the same + argument list of a single "pinmux" property. + Helper macros to ease assembling the pin index from its position + (port where it sits on and pin number) and alternate function identifier + are provided by the pin controller header file at: + <include/dt-bindings/pinctrl/r7s72100-pinctrl.h> + Integers values in "pinmux" argument list are assembled as: + ((PORT * 16 + PIN) | MUX_FUNC << 16) + + Optional generic properties: + - input-enable: + enable input bufer for pins requiring software driven IO input + operations. + - output-high: + enable output buffer for pins requiring software driven IO output + operations. output-low can be used alternatively, as line value is + ignored by the driver. + + The hardware reference manual specifies when a pin has to be configured to + work in bi-directional mode and when the IO direction has to be specified + by software. Bi-directional pins are managed by the pin controller driver + internally, while software driven IO direction has to be explicitly + selected when multiple options are available. + + Example: + A serial communication interface with a TX output pin and an RX input pin. + + &pinctrl { + scif2_pins: serial2 { + pinmux = <RZA1_PINMUX(3, 0, 6)>, <RZA1_PINMUX(3, 2, 4)>; + }; + }; + + Pin #0 on port #3 is configured as alternate function #6. + Pin #2 on port #3 is configured as alternate function #4. + + Example 2: + I2c master: both SDA and SCL pins need bi-directional operations + + &pinctrl { + i2c2_pins: i2c2 { + pinmux = <RZA1_PINMUX(1, 4, 1)>, <RZA1_PINMUX(1, 5, 1)>; + }; + }; + + Pin #4 on port #1 is configured as alternate function #1. + Pin #5 on port #1 is configured as alternate function #1. + Both need to work in bi-directional mode, the driver manages this internally. + + Example 3: + Multi-function timer input and output compare pins. + Configure TIOC0A as software driven input and TIOC0B as software driven + output. + + &pinctrl { + tioc0_pins: tioc0 { + tioc0_input_pins { + pinumx = <RZA1_PINMUX(4, 0, 2)>; + input-enable; + }; + + tioc0_output_pins { + pinmux = <RZA1_PINMUX(4, 1, 1)>; + output-enable; + }; + }; + }; + + &tioc0 { + ... + pinctrl-0 = <&tioc0_pins>; + ... + }; + + Pin #0 on port #4 is configured as alternate function #2 with IO direction + specified by software as input. + Pin #1 on port #4 is configured as alternate function #1 with IO direction + specified by software as output. + +- GPIO controller sub-nodes: + Each port of the r7s72100 pin controller hardware is itself a GPIO controller. + Different SoCs have different numbers of available pins per port, but + generally speaking, each of them can be configured in GPIO ("port") mode + on this hardware. + Describe GPIO controllers using sub-nodes with the following properties. + + Required properties: + - gpio-controller + empty property as defined by the GPIO bindings documentation. + - #gpio-cells + number of cells required to identify and configure a GPIO. + Shall be 2. + - gpio-ranges + Describes a GPIO controller specifying its specific pin base, the pin + base in the global pin numbering space, and the number of controlled + pins, as defined by the GPIO bindings documentation. Refer to + Documentation/devicetree/bindings/gpio/gpio.txt file for a more detailed + description. + + Example: + A GPIO controller node, controlling 16 pins indexed from 0. + The GPIO controller base in the global pin indexing space is pin 48, thus + pins [0 - 15] on this controller map to pins [48 - 63] in the global pin + indexing space. + + port3: gpio-3 { + gpio-controller; + #gpio-cells = <2>; + gpio-ranges = <&pinctrl 0 48 16>; + }; + + A device node willing to use pins controlled by this GPIO controller, shall + refer to it as follows: + + led1 { + gpios = <&port3 10 GPIO_ACTIVE_LOW>; + }; diff --git a/Documentation/devicetree/bindings/power/actions,owl-sps.txt b/Documentation/devicetree/bindings/power/actions,owl-sps.txt new file mode 100644 index 000000000000..007b9a7ae723 --- /dev/null +++ b/Documentation/devicetree/bindings/power/actions,owl-sps.txt @@ -0,0 +1,17 @@ +Actions Semi Owl Smart Power System (SPS) + +Required properties: +- compatible : "actions,s500-sps" for S500 +- reg : Offset and length of the register set for the device. +- #power-domain-cells : Must be 1. + See macros in: + include/dt-bindings/power/owl-s500-powergate.h for S500 + + +Example: + + sps: power-controller@b01b0100 { + compatible = "actions,s500-sps"; + reg = <0xb01b0100 0x100>; + #power-domain-cells = <1>; + }; diff --git a/Documentation/devicetree/bindings/power/rockchip-io-domain.txt b/Documentation/devicetree/bindings/power/rockchip-io-domain.txt index d3a5a93a65cd..43c21fb04564 100644 --- a/Documentation/devicetree/bindings/power/rockchip-io-domain.txt +++ b/Documentation/devicetree/bindings/power/rockchip-io-domain.txt @@ -32,6 +32,7 @@ SoC is on the same page. Required properties: - compatible: should be one of: - "rockchip,rk3188-io-voltage-domain" for rk3188 + - "rockchip,rk3228-io-voltage-domain" for rk3228 - "rockchip,rk3288-io-voltage-domain" for rk3288 - "rockchip,rk3328-io-voltage-domain" for rk3328 - "rockchip,rk3368-io-voltage-domain" for rk3368 @@ -59,6 +60,12 @@ Possible supplies for rk3188: - vccio1-supply: The supply connected to VCCIO1. Sometimes also labeled VCCIO1 and VCCIO2. +Possible supplies for rk3228: +- vccio1-supply: The supply connected to VCCIO1. +- vccio2-supply: The supply connected to VCCIO2. +- vccio3-supply: The supply connected to VCCIO3. +- vccio4-supply: The supply connected to VCCIO4. + Possible supplies for rk3288: - audio-supply: The supply connected to APIO4_VDD. - bb-supply: The supply connected to APIO5_VDD. diff --git a/Documentation/devicetree/bindings/power/supply/battery.txt b/Documentation/devicetree/bindings/power/supply/battery.txt new file mode 100644 index 000000000000..f4d3b4a10b43 --- /dev/null +++ b/Documentation/devicetree/bindings/power/supply/battery.txt @@ -0,0 +1,57 @@ +Battery Characteristics + +The devicetree battery node provides static battery characteristics. +In smart batteries, these are typically stored in non-volatile memory +on a fuel gauge chip. The battery node should be used where there is +no appropriate non-volatile memory, or it is unprogrammed/incorrect. + +Upstream dts files should not include battery nodes, unless the battery +represented cannot easily be replaced in the system by one of a +different type. This prevents unpredictable, potentially harmful, +behavior should a replacement that changes the battery type occur +without a corresponding update to the dtb. + +Required Properties: + - compatible: Must be "simple-battery" + +Optional Properties: + - voltage-min-design-microvolt: drained battery voltage + - energy-full-design-microwatt-hours: battery design energy + - charge-full-design-microamp-hours: battery design capacity + - precharge-current-microamp: current for pre-charge phase + - charge-term-current-microamp: current for charge termination phase + - constant-charge-current-max-microamp: maximum constant input current + - constant-charge-voltage-max-microvolt: maximum constant input voltage + +Battery properties are named, where possible, for the corresponding +elements in enum power_supply_property, defined in +https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/power_supply.h + +Batteries must be referenced by chargers and/or fuel-gauges +using a phandle. The phandle's property should be named +"monitored-battery". + +Example: + + bat: battery { + compatible = "simple-battery"; + voltage-min-design-microvolt = <3200000>; + energy-full-design-microwatt-hours = <5290000>; + charge-full-design-microamp-hours = <1430000>; + precharge-current-microamp = <256000>; + charge-term-current-microamp = <128000>; + constant-charge-current-max-microamp = <900000>; + constant-charge-voltage-max-microvolt = <4200000>; + }; + + charger: charger@11 { + .... + monitored-battery = <&bat>; + ... + }; + + fuel_gauge: fuel-gauge@22 { + .... + monitored-battery = <&bat>; + ... + }; diff --git a/Documentation/devicetree/bindings/power/supply/bq27xxx.txt b/Documentation/devicetree/bindings/power/supply/bq27xxx.txt index b0c95ef63e68..6858e1a804ad 100644 --- a/Documentation/devicetree/bindings/power/supply/bq27xxx.txt +++ b/Documentation/devicetree/bindings/power/supply/bq27xxx.txt @@ -1,7 +1,7 @@ -Binding for TI BQ27XXX fuel gauge family +TI BQ27XXX fuel gauge family Required properties: -- compatible: Should contain one of the following: +- compatible: contains one of the following: * "ti,bq27200" - BQ27200 * "ti,bq27210" - BQ27210 * "ti,bq27500" - deprecated, use revision specific property below @@ -26,11 +26,28 @@ Required properties: * "ti,bq27425" - BQ27425 * "ti,bq27441" - BQ27441 * "ti,bq27621" - BQ27621 -- reg: integer, i2c address of the device. +- reg: integer, I2C address of the fuel gauge. + +Optional properties: +- monitored-battery: phandle of battery characteristics node + The fuel gauge uses the following battery properties: + + energy-full-design-microwatt-hours + + charge-full-design-microamp-hours + + voltage-min-design-microvolt + Both or neither of the *-full-design-*-hours properties must be set. + See Documentation/devicetree/bindings/power/supply/battery.txt Example: -bq27510g3 { - compatible = "ti,bq27510g3"; - reg = <0x55>; -}; + bat: battery { + compatible = "simple-battery"; + voltage-min-design-microvolt = <3200000>; + energy-full-design-microwatt-hours = <5290000>; + charge-full-design-microamp-hours = <1430000>; + }; + + bq27510g3: fuel-gauge@55 { + compatible = "ti,bq27510g3"; + reg = <0x55>; + monitored-battery = <&bat>; + }; diff --git a/Documentation/devicetree/bindings/power/supply/cpcap-battery.txt b/Documentation/devicetree/bindings/power/supply/cpcap-battery.txt new file mode 100644 index 000000000000..a04efa22da01 --- /dev/null +++ b/Documentation/devicetree/bindings/power/supply/cpcap-battery.txt @@ -0,0 +1,31 @@ +Motorola CPCAP PMIC battery driver binding + +Required properties: +- compatible: Shall be "motorola,cpcap-battery" +- interrupts: Interrupt specifier for each name in interrupt-names +- interrupt-names: Should contain the following entries: + "lowbph", "lowbpl", "chrgcurr1", "battdetb" +- io-channels: IIO ADC channel specifier for each name in io-channel-names +- io-channel-names: Should contain the following entries: + "battdetb", "battp", "chg_isense", "batti" +- power-supplies: List of phandles for power-supplying devices, as + described in power_supply.txt. Typically a reference + to cpcap_charger. + +Example: + +cpcap_battery: battery { + compatible = "motorola,cpcap-battery"; + interrupts-extended = < + &cpcap 5 0 &cpcap 3 0 + &cpcap 20 0 &cpcap 54 0 + >; + interrupt-names = + "lowbph", "lowbpl", + "chrgcurr1", "battdetb"; + io-channels = <&cpcap_adc 0 &cpcap_adc 1 + &cpcap_adc 5 &cpcap_adc 6>; + io-channel-names = "battdetb", "battp", + "chg_isense", "batti"; + power-supplies = <&cpcap_charger>; +}; diff --git a/Documentation/devicetree/bindings/power/supply/ltc3651-charger.txt b/Documentation/devicetree/bindings/power/supply/ltc3651-charger.txt new file mode 100644 index 000000000000..71f2840e8209 --- /dev/null +++ b/Documentation/devicetree/bindings/power/supply/ltc3651-charger.txt @@ -0,0 +1,27 @@ +ltc3651-charger + +Required properties: + - compatible: "lltc,ltc3651-charger" + - lltc,acpr-gpios: Connect to ACPR output. See remark below. + +Optional properties: + - lltc,fault-gpios: Connect to FAULT output. See remark below. + - lltc,chrg-gpios: Connect to CHRG output. See remark below. + +The ltc3651 outputs are open-drain type and active low. The driver assumes the +GPIO reports "active" when the output is asserted, so if the pins have been +connected directly, the GPIO flags should be set to active low also. + +The driver will attempt to aquire interrupts for all GPIOs to detect changes in +line state. If the system is not capabale of providing interrupts, the driver +cannot report changes and userspace will need to periodically read the sysfs +attributes to detect changes. + +Example: + + charger: battery-charger { + compatible = "lltc,ltc3651-charger"; + lltc,acpr-gpios = <&gpio0 68 GPIO_ACTIVE_LOW>; + lltc,fault-gpios = <&gpio0 64 GPIO_ACTIVE_LOW>; + lltc,chrg-gpios = <&gpio0 63 GPIO_ACTIVE_LOW>; + }; diff --git a/Documentation/devicetree/bindings/power/max8903-charger.txt b/Documentation/devicetree/bindings/power/supply/max8903-charger.txt index f0f4e12b076e..f0f4e12b076e 100644 --- a/Documentation/devicetree/bindings/power/max8903-charger.txt +++ b/Documentation/devicetree/bindings/power/supply/max8903-charger.txt diff --git a/Documentation/devicetree/bindings/power_supply/maxim,max14656.txt b/Documentation/devicetree/bindings/power/supply/maxim,max14656.txt index e03e85ae6572..e03e85ae6572 100644 --- a/Documentation/devicetree/bindings/power_supply/maxim,max14656.txt +++ b/Documentation/devicetree/bindings/power/supply/maxim,max14656.txt diff --git a/Documentation/devicetree/bindings/powerpc/fsl/cpus.txt b/Documentation/devicetree/bindings/powerpc/fsl/cpus.txt index f8cd2397aa04..d63ab1dec16d 100644 --- a/Documentation/devicetree/bindings/powerpc/fsl/cpus.txt +++ b/Documentation/devicetree/bindings/powerpc/fsl/cpus.txt @@ -3,10 +3,10 @@ Power Architecture CPU Binding Copyright 2013 Freescale Semiconductor Inc. Power Architecture CPUs in Freescale SOCs are represented in device trees as -per the definition in ePAPR. +per the definition in the Devicetree Specification. -In addition to the ePAPR definitions, the properties defined below may be -present on CPU nodes. +In addition to the the Devicetree Specification definitions, the properties +defined below may be present on CPU nodes. PROPERTIES diff --git a/Documentation/devicetree/bindings/powerpc/fsl/l2cache.txt b/Documentation/devicetree/bindings/powerpc/fsl/l2cache.txt index dc9bb3182525..8a70696395a7 100644 --- a/Documentation/devicetree/bindings/powerpc/fsl/l2cache.txt +++ b/Documentation/devicetree/bindings/powerpc/fsl/l2cache.txt @@ -1,7 +1,7 @@ Freescale L2 Cache Controller L2 cache is present in Freescale's QorIQ and QorIQ Qonverge platforms. -The cache bindings explained below are ePAPR compliant +The cache bindings explained below are Devicetree Specification compliant Required Properties: diff --git a/Documentation/devicetree/bindings/powerpc/fsl/srio-rmu.txt b/Documentation/devicetree/bindings/powerpc/fsl/srio-rmu.txt index b9a8a2bcfae7..0496ada4bba4 100644 --- a/Documentation/devicetree/bindings/powerpc/fsl/srio-rmu.txt +++ b/Documentation/devicetree/bindings/powerpc/fsl/srio-rmu.txt @@ -124,8 +124,8 @@ Port-Write Unit: A single IRQ that handles port-write conditions is specified by this property. (Typically shared with error). - Note: All other standard properties (see the ePAPR) are allowed - but are optional. + Note: All other standard properties (see the Devicetree Specification) + are allowed but are optional. Example: rmu: rmu@d3000 { diff --git a/Documentation/devicetree/bindings/powerpc/fsl/srio.txt b/Documentation/devicetree/bindings/powerpc/fsl/srio.txt index 07abf0f2f440..86ee6ea73754 100644 --- a/Documentation/devicetree/bindings/powerpc/fsl/srio.txt +++ b/Documentation/devicetree/bindings/powerpc/fsl/srio.txt @@ -72,7 +72,8 @@ the following properties: represents the LIODN associated with maintenance transactions for the port. -Note: All other standard properties (see ePAPR) are allowed but are optional. +Note: All other standard properties (see the Devicetree Specification) +are allowed but are optional. Example: diff --git a/Documentation/devicetree/bindings/property-units.txt b/Documentation/devicetree/bindings/property-units.txt index 7c9f6ee918f1..45ce054d844d 100644 --- a/Documentation/devicetree/bindings/property-units.txt +++ b/Documentation/devicetree/bindings/property-units.txt @@ -25,8 +25,10 @@ Distance Electricity ---------------------------------------- -microamp : micro amps +-microamp-hours : micro amp-hours -ohms : Ohms -micro-ohms : micro Ohms +-microwatt-hours: micro Watt-hours -microvolt : micro volts -picofarads : picofarads diff --git a/Documentation/devicetree/bindings/ptp/brcm,ptp-dte.txt b/Documentation/devicetree/bindings/ptp/brcm,ptp-dte.txt new file mode 100644 index 000000000000..07590bcdad15 --- /dev/null +++ b/Documentation/devicetree/bindings/ptp/brcm,ptp-dte.txt @@ -0,0 +1,13 @@ +* Broadcom Digital Timing Engine(DTE) based PTP clock driver + +Required properties: +- compatible: should be "brcm,ptp-dte" +- reg: address and length of the DTE block's NCO registers + +Example: + +ptp_dte: ptp_dte@180af650 { + compatible = "brcm,ptp-dte"; + reg = <0x180af650 0x10>; + status = "okay"; +}; diff --git a/Documentation/devicetree/bindings/pwm/pwm-meson.txt b/Documentation/devicetree/bindings/pwm/pwm-meson.txt index 5376a4468cb6..5b07bebbf6f7 100644 --- a/Documentation/devicetree/bindings/pwm/pwm-meson.txt +++ b/Documentation/devicetree/bindings/pwm/pwm-meson.txt @@ -2,7 +2,9 @@ Amlogic Meson PWM Controller ============================ Required properties: -- compatible: Shall contain "amlogic,meson8b-pwm" or "amlogic,meson-gxbb-pwm". +- compatible: Shall contain "amlogic,meson8b-pwm" + or "amlogic,meson-gxbb-pwm" + or "amlogic,meson-gxbb-ao-pwm" - #pwm-cells: Should be 3. See pwm.txt in this directory for a description of the cells format. diff --git a/Documentation/devicetree/bindings/pwm/pwm-stm32.txt b/Documentation/devicetree/bindings/pwm/pwm-stm32.txt index 6dd040363e5e..3e6d55018d7a 100644 --- a/Documentation/devicetree/bindings/pwm/pwm-stm32.txt +++ b/Documentation/devicetree/bindings/pwm/pwm-stm32.txt @@ -24,7 +24,7 @@ Example: compatible = "st,stm32-timers"; reg = <0x40010000 0x400>; clocks = <&rcc 0 160>; - clock-names = "clk_int"; + clock-names = "int"; pwm { compatible = "st,stm32-pwm"; diff --git a/Documentation/devicetree/bindings/pwm/renesas,pwm-rcar.txt b/Documentation/devicetree/bindings/pwm/renesas,pwm-rcar.txt index d6de64335022..7e94b802395d 100644 --- a/Documentation/devicetree/bindings/pwm/renesas,pwm-rcar.txt +++ b/Documentation/devicetree/bindings/pwm/renesas,pwm-rcar.txt @@ -8,6 +8,7 @@ Required Properties: - "renesas,pwm-r8a7791": for R-Car M2-W - "renesas,pwm-r8a7794": for R-Car E2 - "renesas,pwm-r8a7795": for R-Car H3 + - "renesas,pwm-r8a7796": for R-Car M3-W - reg: base address and length of the registers block for the PWM. - #pwm-cells: should be 2. See pwm.txt in this directory for a description of the cells format. diff --git a/Documentation/devicetree/bindings/regulator/regulator.txt b/Documentation/devicetree/bindings/regulator/regulator.txt index d18edb075e1c..378f6dc8b8bd 100644 --- a/Documentation/devicetree/bindings/regulator/regulator.txt +++ b/Documentation/devicetree/bindings/regulator/regulator.txt @@ -24,6 +24,14 @@ Optional properties: - regulator-settling-time-us: Settling time, in microseconds, for voltage change if regulator have the constant time for any level voltage change. This is useful when regulator have exponential voltage change. +- regulator-settling-time-up-us: Settling time, in microseconds, for voltage + increase if the regulator needs a constant time to settle after voltage + increases of any level. This is useful for regulators with exponential + voltage changes. +- regulator-settling-time-down-us: Settling time, in microseconds, for voltage + decrease if the regulator needs a constant time to settle after voltage + decreases of any level. This is useful for regulators with exponential + voltage changes. - regulator-soft-start: Enable soft start so that voltage ramps slowly - regulator-state-mem sub-root node for Suspend-to-RAM mode : suspend to memory, the device goes to sleep, but all data stored in memory, diff --git a/Documentation/devicetree/bindings/remoteproc/ti,keystone-rproc.txt b/Documentation/devicetree/bindings/remoteproc/ti,keystone-rproc.txt new file mode 100644 index 000000000000..2aac1aa4123d --- /dev/null +++ b/Documentation/devicetree/bindings/remoteproc/ti,keystone-rproc.txt @@ -0,0 +1,133 @@ +TI Keystone DSP devices +======================= + +The TI Keystone 2 family of SoCs usually have one or more (upto 8) TI DSP Core +sub-systems that are used to offload some of the processor-intensive tasks or +algorithms, for achieving various system level goals. + +These processor sub-systems usually contain additional sub-modules like L1 +and/or L2 caches/SRAMs, an Interrupt Controller, an external memory controller, +a dedicated local power/sleep controller etc. The DSP processor core in +Keystone 2 SoCs is usually a TMS320C66x CorePac processor. + +DSP Device Node: +================ +Each DSP Core sub-system is represented as a single DT node, and should also +have an alias with the stem 'rproc' defined. Each node has a number of required +or optional properties that enable the OS running on the host processor (ARM +CorePac) to perform the device management of the remote processor and to +communicate with the remote processor. + +Required properties: +-------------------- +The following are the mandatory properties: + +- compatible: Should be one of the following, + "ti,k2hk-dsp" for DSPs on Keystone 2 66AK2H/K SoCs + "ti,k2l-dsp" for DSPs on Keystone 2 66AK2L SoCs + "ti,k2e-dsp" for DSPs on Keystone 2 66AK2E SoCs + +- reg: Should contain an entry for each value in 'reg-names'. + Each entry should have the memory region's start address + and the size of the region, the representation matching + the parent node's '#address-cells' and '#size-cells' values. + +- reg-names: Should contain strings with the following names, each + representing a specific internal memory region, and + should be defined in this order, + "l2sram", "l1pram", "l1dram" + +- clocks: Should contain the device's input clock, and should be + defined as per the bindings in, + Documentation/devicetree/bindings/clock/keystone-gate.txt + +- ti,syscon-dev: Should be a pair of the phandle to the Keystone Device + State Control node, and the register offset of the DSP + boot address register within that node's address space. + +- resets: Should contain the phandle to the reset controller node + managing the resets for this device, and a reset + specifier. Please refer to the following reset bindings + for the reset argument specifier as per SoC, + Documentation/devicetree/bindings/reset/ti-syscon-reset.txt + for 66AK2HK/66AK2L/66AK2E SoCs + +- interrupt-parent: Should contain a phandle to the Keystone 2 IRQ controller + IP node that is used by the ARM CorePac processor to + receive interrupts from the DSP remote processors. See + Documentation/devicetree/bindings/interrupt-controller/ti,keystone-irq.txt + for details. + +- interrupts: Should contain an entry for each value in 'interrupt-names'. + Each entry should have the interrupt source number used by + the remote processor to the host processor. The values should + follow the interrupt-specifier format as dictated by the + 'interrupt-parent' node. The purpose of each is as per the + description in the 'interrupt-names' property. + +- interrupt-names: Should contain strings with the following names, each + representing a specific interrupt, + "vring" - interrupt for virtio based IPC + "exception" - interrupt for exception notification + +- kick-gpios: Should specify the gpio device needed for the virtio IPC + stack. This will be used to interrupt the remote processor. + The gpio device to be used is as per the bindings in, + Documentation/devicetree/bindings/gpio/gpio-dsp-keystone.txt + +Optional properties: +-------------------- + +- memory-region: phandle to the reserved memory node to be associated + with the remoteproc device. The reserved memory node + can be a CMA memory node, and should be defined as + per the bindings in + Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt + + +Example: +-------- + /* 66AK2H/K DSP aliases */ + aliases { + rproc0 = &dsp0; + rproc1 = &dsp1; + rproc2 = &dsp2; + rproc3 = &dsp3; + rproc4 = &dsp4; + rproc5 = &dsp5; + rproc6 = &dsp6; + rproc7 = &dsp7; + }; + + /* 66AK2H/K DSP memory node */ + reserved-memory { + #address-cells = <2>; + #size-cells = <2>; + ranges; + + dsp_common_memory: dsp-common-memory@81f800000 { + compatible = "shared-dma-pool"; + reg = <0x00000008 0x1f800000 0x00000000 0x800000>; + reusable; + }; + }; + + /* 66AK2H/K DSP node */ + soc { + dsp0: dsp@10800000 { + compatible = "ti,k2hk-dsp"; + reg = <0x10800000 0x00100000>, + <0x10e00000 0x00008000>, + <0x10f00000 0x00008000>; + reg-names = "l2sram", "l1pram", "l1dram"; + clocks = <&clkgem0>; + ti,syscon-dev = <&devctrl 0x40>; + resets = <&pscrst 0>; + interrupt-parent = <&kirq0>; + interrupts = <0 8>; + interrupt-names = "vring", "exception"; + kick-gpios = <&dspgpio0 27 0>; + memory-region = <&dsp_common_memory>; + }; + + }; diff --git a/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt b/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt index 3da0ebdba8d9..16291f2a4688 100644 --- a/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt +++ b/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt @@ -68,6 +68,9 @@ Linux implementation note: - If a "linux,cma-default" property is present, then Linux will use the region for the default pool of the contiguous memory allocator. +- If a "linux,dma-default" property is present, then Linux will use the + region for the default pool of the consistent DMA allocator. + Device node references to reserved memory ----------------------------------------- Regions in the /reserved-memory node may be referenced by other device diff --git a/Documentation/devicetree/bindings/reset/ti,sci-reset.txt b/Documentation/devicetree/bindings/reset/ti,sci-reset.txt new file mode 100644 index 000000000000..8b1cf022f18a --- /dev/null +++ b/Documentation/devicetree/bindings/reset/ti,sci-reset.txt @@ -0,0 +1,62 @@ +Texas Instruments System Control Interface (TI-SCI) Reset Controller +===================================================================== + +Some TI SoCs contain a system controller (like the Power Management Micro +Controller (PMMC) on Keystone 66AK2G SoC) that are responsible for controlling +the state of the various hardware modules present on the SoC. Communication +between the host processor running an OS and the system controller happens +through a protocol called TI System Control Interface (TI-SCI protocol). +For TI SCI details, please refer to the document, +Documentation/devicetree/bindings/arm/keystone/ti,sci.txt + +TI-SCI Reset Controller Node +============================ +This reset controller node uses the TI SCI protocol to perform the reset +management of various hardware modules present on the SoC. Must be a child +node of the associated TI-SCI system controller node. + +Required properties: +-------------------- + - compatible : Should be "ti,sci-reset" + - #reset-cells : Should be 2. Please see the reset consumer node below for + usage details. + +TI-SCI Reset Consumer Nodes +=========================== +Each of the reset consumer nodes should have the following properties, +in addition to their own properties. + +Required properties: +-------------------- + - resets : A phandle and reset specifier pair, one pair for each reset + signal that affects the device, or that the device manages. + The phandle should point to the TI-SCI reset controller node, + and the reset specifier should have 2 cell-values. The first + cell should contain the device ID. The second cell should + contain the reset mask value used by system controller. + Please refer to the protocol documentation for these values + to be used for different devices, + http://processors.wiki.ti.com/index.php/TISCI#66AK2G02_Data + +Please also refer to Documentation/devicetree/bindings/reset/reset.txt for +common reset controller usage by consumers. + +Example: +-------- +The following example demonstrates both a TI-SCI reset controller node and a +consumer (a DSP device) on the 66AK2G SoC. + +pmmc: pmmc { + compatible = "ti,k2g-sci"; + + k2g_reset: reset-controller { + compatible = "ti,sci-reset"; + #reset-cells = <2>; + }; +}; + +dsp0: dsp@10800000 { + ... + resets = <&k2g_reset 0x0046 0x1>; + ... +}; diff --git a/Documentation/devicetree/bindings/rng/mtk-rng.txt b/Documentation/devicetree/bindings/rng/mtk-rng.txt index a6d62a2abd39..366b99bff8cd 100644 --- a/Documentation/devicetree/bindings/rng/mtk-rng.txt +++ b/Documentation/devicetree/bindings/rng/mtk-rng.txt @@ -2,7 +2,9 @@ Device-Tree bindings for Mediatek random number generator found in Mediatek SoC family Required properties: -- compatible : Should be "mediatek,mt7623-rng" +- compatible : Should be + "mediatek,mt7622-rng", "mediatek,mt7623-rng" : for MT7622 + "mediatek,mt7623-rng" : for MT7623 - clocks : list of clock specifiers, corresponding to entries in clock-names property; - clock-names : Should contain "rng" entries; diff --git a/Documentation/devicetree/bindings/rng/timeriomem_rng.txt b/Documentation/devicetree/bindings/rng/timeriomem_rng.txt index 6616d15866a3..214940093b55 100644 --- a/Documentation/devicetree/bindings/rng/timeriomem_rng.txt +++ b/Documentation/devicetree/bindings/rng/timeriomem_rng.txt @@ -5,6 +5,13 @@ Required properties: - reg : base address to sample from - period : wait time in microseconds to use between samples +Optional properties: +- quality : estimated number of bits of true entropy per 1024 bits read from the + rng. Defaults to zero which causes the kernel's default quality to + be used instead. Note that the default quality is usually zero + which disables using this rng to automatically fill the kernel's + entropy pool. + N.B. currently 'reg' must be four bytes wide and aligned Example: diff --git a/Documentation/devicetree/bindings/rtc/brcm,brcmstb-waketimer.txt b/Documentation/devicetree/bindings/rtc/brcm,brcmstb-waketimer.txt new file mode 100644 index 000000000000..1d990bcc0baf --- /dev/null +++ b/Documentation/devicetree/bindings/rtc/brcm,brcmstb-waketimer.txt @@ -0,0 +1,22 @@ +Broadcom STB wake-up Timer + +The Broadcom STB wake-up timer provides a 27Mhz resolution timer, with the +ability to wake up the system from low-power suspend/standby modes. + +Required properties: +- compatible : should contain "brcm,brcmstb-waketimer" +- reg : the register start and length for the WKTMR block +- interrupts : The TIMER interrupt +- interrupt-parent: The phandle to the Always-On (AON) Power Management (PM) L2 + interrupt controller node +- clocks : The phandle to the UPG fixed clock (27Mhz domain) + +Example: + +waketimer@f0411580 { + compatible = "brcm,brcmstb-waketimer"; + reg = <0xf0411580 0x14>; + interrupts = <0x3>; + interrupt-parent = <&aon_pm_l2_intc>; + clocks = <&upg_fixed>; +}; diff --git a/Documentation/devicetree/bindings/rtc/cortina,gemini.txt b/Documentation/devicetree/bindings/rtc/cortina,gemini.txt deleted file mode 100644 index 4ce4e794ddbb..000000000000 --- a/Documentation/devicetree/bindings/rtc/cortina,gemini.txt +++ /dev/null @@ -1,14 +0,0 @@ -* Cortina Systems Gemini RTC - -Gemini SoC real-time clock. - -Required properties: -- compatible : Should be "cortina,gemini-rtc" - -Examples: - -rtc@45000000 { - compatible = "cortina,gemini-rtc"; - reg = <0x45000000 0x100>; - interrupts = <17 IRQ_TYPE_LEVEL_HIGH>; -}; diff --git a/Documentation/devicetree/bindings/rtc/faraday,ftrtc010.txt b/Documentation/devicetree/bindings/rtc/faraday,ftrtc010.txt new file mode 100644 index 000000000000..e3938f5e0b6c --- /dev/null +++ b/Documentation/devicetree/bindings/rtc/faraday,ftrtc010.txt @@ -0,0 +1,28 @@ +* Faraday Technology FTRTC010 Real Time Clock + +This RTC appears in for example the Storlink Gemini family of +SoCs. + +Required properties: +- compatible : Should be one of: + "faraday,ftrtc010" + "cortina,gemini-rtc", "faraday,ftrtc010" + +Optional properties: +- clocks: when present should contain clock references to the + PCLK and EXTCLK clocks. Faraday calls the later CLK1HZ and + says the clock should be 1 Hz, but implementers actually seem + to choose different clocks here, like Cortina who chose + 32768 Hz (a typical low-power clock). +- clock-names: should name the clocks "PCLK" and "EXTCLK" + respectively. + +Examples: + +rtc@45000000 { + compatible = "cortina,gemini-rtc"; + reg = <0x45000000 0x100>; + interrupts = <17 IRQ_TYPE_LEVEL_HIGH>; + clocks = <&foo 0>, <&foo 1>; + clock-names = "PCLK", "EXTCLK"; +}; diff --git a/Documentation/devicetree/bindings/rtc/st,stm32-rtc.txt b/Documentation/devicetree/bindings/rtc/st,stm32-rtc.txt index e2837b951237..0a4c371a9b7a 100644 --- a/Documentation/devicetree/bindings/rtc/st,stm32-rtc.txt +++ b/Documentation/devicetree/bindings/rtc/st,stm32-rtc.txt @@ -1,17 +1,25 @@ STM32 Real Time Clock Required properties: -- compatible: "st,stm32-rtc". +- compatible: can be either "st,stm32-rtc" or "st,stm32h7-rtc", depending on + the device is compatible with stm32(f4/f7) or stm32h7. - reg: address range of rtc register set. -- clocks: reference to the clock entry ck_rtc. +- clocks: can use up to two clocks, depending on part used: + - "rtc_ck": RTC clock source. + It is required on stm32(f4/f7) and stm32h7. + - "pclk": RTC APB interface clock. + It is not present on stm32(f4/f7). + It is required on stm32h7. +- clock-names: must be "rtc_ck" and "pclk". + It is required only on stm32h7. - interrupt-parent: phandle for the interrupt controller. - interrupts: rtc alarm interrupt. - st,syscfg: phandle for pwrcfg, mandatory to disable/enable backup domain (RTC registers) write protection. -Optional properties (to override default ck_rtc parent clock): -- assigned-clocks: reference to the ck_rtc clock entry. -- assigned-clock-parents: phandle of the new parent clock of ck_rtc. +Optional properties (to override default rtc_ck parent clock): +- assigned-clocks: reference to the rtc_ck clock entry. +- assigned-clock-parents: phandle of the new parent clock of rtc_ck. Example: @@ -25,3 +33,17 @@ Example: interrupts = <17 1>; st,syscfg = <&pwrcfg>; }; + + rtc: rtc@58004000 { + compatible = "st,stm32h7-rtc"; + reg = <0x58004000 0x400>; + clocks = <&rcc RTCAPB_CK>, <&rcc RTC_CK>; + clock-names = "pclk", "rtc_ck"; + assigned-clocks = <&rcc RTC_CK>; + assigned-clock-parents = <&rcc LSE_CK>; + interrupt-parent = <&exti>; + interrupts = <17 1>; + interrupt-names = "alarm"; + st,syscfg = <&pwrcfg>; + status = "disabled"; + }; diff --git a/Documentation/devicetree/bindings/serial/8250.txt b/Documentation/devicetree/bindings/serial/8250.txt index 10276a46ecef..419ff6c0a47f 100644 --- a/Documentation/devicetree/bindings/serial/8250.txt +++ b/Documentation/devicetree/bindings/serial/8250.txt @@ -20,6 +20,8 @@ Required properties: - "fsl,16550-FIFO64" - "fsl,ns16550" - "ti,da830-uart" + - "aspeed,ast2400-vuart" + - "aspeed,ast2500-vuart" - "serial" if the port type is unknown. - reg : offset and length of the register set for the device. - interrupts : should contain uart interrupt. @@ -45,6 +47,7 @@ Optional properties: property. - tx-threshold: Specify the TX FIFO low water indication for parts with programmable TX FIFO thresholds. +- resets : phandle + reset specifier pairs Note: * fsl,ns16550: diff --git a/Documentation/devicetree/bindings/serial/actions,owl-uart.txt b/Documentation/devicetree/bindings/serial/actions,owl-uart.txt new file mode 100644 index 000000000000..aa873eada02d --- /dev/null +++ b/Documentation/devicetree/bindings/serial/actions,owl-uart.txt @@ -0,0 +1,16 @@ +Actions Semi Owl UART + +Required properties: +- compatible : "actions,s500-uart", "actions,owl-uart" for S500 + "actions,s900-uart", "actions,owl-uart" for S900 +- reg : Offset and length of the register set for the device. +- interrupts : Should contain UART interrupt. + + +Example: + + uart3: serial@b0126000 { + compatible = "actions,s500-uart", "actions,owl-uart"; + reg = <0xb0126000 0x1000>; + interrupts = <GIC_SPI 32 IRQ_TYPE_LEVEL_HIGH>; + }; diff --git a/Documentation/devicetree/bindings/serial/amlogic,meson-uart.txt b/Documentation/devicetree/bindings/serial/amlogic,meson-uart.txt new file mode 100644 index 000000000000..8ff65fa632fd --- /dev/null +++ b/Documentation/devicetree/bindings/serial/amlogic,meson-uart.txt @@ -0,0 +1,38 @@ +Amlogic Meson SoC UART Serial Interface +======================================= + +The Amlogic Meson SoC UART Serial Interface is present on a large range +of SoCs, and can be present either in the "Always-On" power domain or the +"Everything-Else" power domain. + +The particularity of the "Always-On" Serial Interface is that the hardware +is active since power-on and does not need any clock gating and is usable +as very early serial console. + +Required properties: +- compatible : compatible: value should be different for each SoC family as : + - Meson6 : "amlogic,meson6-uart" + - Meson8 : "amlogic,meson8-uart" + - Meson8b : "amlogic,meson8b-uart" + - GX (GXBB, GXL, GXM) : "amlogic,meson-gx-uart" + eventually followed by : "amlogic,meson-ao-uart" if this UART interface + is in the "Always-On" power domain. +- reg : offset and length of the register set for the device. +- interrupts : identifier to the device interrupt +- clocks : a list of phandle + clock-specifier pairs, one for each + entry in clock names. +- clocks-names : + * "xtal" for external xtal clock identifier + * "pclk" for the bus core clock, either the clk81 clock or the gate clock + * "baud" for the source of the baudrate generator, can be either the xtal + or the pclk. + +e.g. +uart_A: serial@84c0 { + compatible = "amlogic,meson-gx-uart"; + reg = <0x0 0x84c0 0x0 0x14>; + interrupts = <GIC_SPI 26 IRQ_TYPE_EDGE_RISING>; + /* Use xtal as baud rate clock source */ + clocks = <&xtal>, <&clkc CLKID_UART0>, <&xtal>; + clock-names = "xtal", "pclk", "baud"; +}; diff --git a/Documentation/devicetree/bindings/serial/fsl-imx-uart.txt b/Documentation/devicetree/bindings/serial/fsl-imx-uart.txt index 574c3a2c77d5..e6b572409cf5 100644 --- a/Documentation/devicetree/bindings/serial/fsl-imx-uart.txt +++ b/Documentation/devicetree/bindings/serial/fsl-imx-uart.txt @@ -9,6 +9,7 @@ Optional properties: - fsl,irda-mode : Indicate the uart supports irda mode - fsl,dte-mode : Indicate the uart works in DTE mode. The uart works in DCE mode by default. +- fsl,dma-size : Indicate the size of the DMA buffer and its periods Please check Documentation/devicetree/bindings/serial/serial.txt for the complete list of generic properties. @@ -28,4 +29,5 @@ uart1: serial@73fbc000 { interrupts = <31>; uart-has-rtscts; fsl,dte-mode; + fsl,dma-size = <1024 4>; }; diff --git a/Documentation/devicetree/bindings/serial/fsl-lpuart.txt b/Documentation/devicetree/bindings/serial/fsl-lpuart.txt index c95005efbcb8..a1252a047f78 100644 --- a/Documentation/devicetree/bindings/serial/fsl-lpuart.txt +++ b/Documentation/devicetree/bindings/serial/fsl-lpuart.txt @@ -6,6 +6,8 @@ Required properties: on Vybrid vf610 SoC with 8-bit register organization - "fsl,ls1021a-lpuart" for lpuart compatible with the one integrated on LS1021A SoC with 32-bit big-endian register organization + - "fsl,imx7ulp-lpuart" for lpuart compatible with the one integrated + on i.MX7ULP SoC with 32-bit little-endian register organization - reg : Address and length of the register set for the device - interrupts : Should contain uart interrupt - clocks : phandle + clock specifier pairs, one for each entry in clock-names diff --git a/Documentation/devicetree/bindings/serial/mtk-uart.txt b/Documentation/devicetree/bindings/serial/mtk-uart.txt index 0015c722be7b..b6cf384597e1 100644 --- a/Documentation/devicetree/bindings/serial/mtk-uart.txt +++ b/Documentation/devicetree/bindings/serial/mtk-uart.txt @@ -8,6 +8,8 @@ Required properties: * "mediatek,mt6589-uart" for MT6589 compatible UARTS * "mediatek,mt6755-uart" for MT6755 compatible UARTS * "mediatek,mt6795-uart" for MT6795 compatible UARTS + * "mediatek,mt6797-uart" for MT6797 compatible UARTS + * "mediatek,mt7622-uart" for MT7622 compatible UARTS * "mediatek,mt7623-uart" for MT7623 compatible UARTS * "mediatek,mt8127-uart" for MT8127 compatible UARTS * "mediatek,mt8135-uart" for MT8135 compatible UARTS diff --git a/Documentation/devicetree/bindings/serial/slave-device.txt b/Documentation/devicetree/bindings/serial/slave-device.txt index f66037928f5f..40110e019620 100644 --- a/Documentation/devicetree/bindings/serial/slave-device.txt +++ b/Documentation/devicetree/bindings/serial/slave-device.txt @@ -21,6 +21,15 @@ Optional Properties: can support. For example, a particular board has some signal quality issue or the host processor can't support higher baud rates. +- current-speed : The current baud rate the device operates at. This should + only be present in case a driver has no chance to know + the baud rate of the slave device. + Examples: + * device supports auto-baud + * the rate is setup by a bootloader and there is no + way to reset the device + * device baud rate is configured by its firmware but + there is no way to request the actual settings Example: diff --git a/Documentation/devicetree/bindings/soc/mediatek/scpsys.txt b/Documentation/devicetree/bindings/soc/mediatek/scpsys.txt index 16fe94d7783c..b1d165b4d4b3 100644 --- a/Documentation/devicetree/bindings/soc/mediatek/scpsys.txt +++ b/Documentation/devicetree/bindings/soc/mediatek/scpsys.txt @@ -9,11 +9,14 @@ domain control. The driver implements the Generic PM domain bindings described in power/power_domain.txt. It provides the power domains defined in -include/dt-bindings/power/mt8173-power.h and mt2701-power.h. +- include/dt-bindings/power/mt8173-power.h +- include/dt-bindings/power/mt6797-power.h +- include/dt-bindings/power/mt2701-power.h Required properties: - compatible: Should be one of: - "mediatek,mt2701-scpsys" + - "mediatek,mt6797-scpsys" - "mediatek,mt8173-scpsys" - #power-domain-cells: Must be 1 - reg: Address range of the SCPSYS unit @@ -22,6 +25,7 @@ Required properties: These are clocks which hardware needs to be enabled before enabling certain power domains. Required clocks for MT2701: "mm", "mfg", "ethif" + Required clocks for MT6797: "mm", "mfg", "vdec" Required clocks for MT8173: "mm", "mfg", "venc", "venc_lt" Optional properties: diff --git a/Documentation/devicetree/bindings/soc/qcom/qcom,glink.txt b/Documentation/devicetree/bindings/soc/qcom/qcom,glink.txt new file mode 100644 index 000000000000..50fc20c6ce91 --- /dev/null +++ b/Documentation/devicetree/bindings/soc/qcom/qcom,glink.txt @@ -0,0 +1,73 @@ +Qualcomm RPM GLINK binding + +This binding describes the Qualcomm RPM GLINK, a fifo based mechanism for +communication with the Resource Power Management system on various Qualcomm +platforms. + +- compatible: + Usage: required + Value type: <stringlist> + Definition: must be "qcom,glink-rpm" + +- interrupts: + Usage: required + Value type: <prop-encoded-array> + Definition: should specify the IRQ used by the remote processor to + signal this processor about communication related events + +- qcom,rpm-msg-ram: + Usage: required + Value type: <prop-encoded-array> + Definition: handle to RPM message memory resource + +- mboxes: + Usage: required + Value type: <prop-encoded-array> + Definition: reference to the "rpm_hlos" mailbox in APCS, as described + in mailbox/mailbox.txt + += GLINK DEVICES +Each subnode of the GLINK node represent function tied to a virtual +communication channel. The name of the nodes are not important. The properties +of these nodes are defined by the individual bindings for the specific function +- but must contain the following property: + +- qcom,glink-channels: + Usage: required + Value type: <stringlist> + Definition: a list of channels tied to this function, used for matching + the function to a set of virtual channels + += EXAMPLE +The following example represents the GLINK RPM node on a MSM8996 device, with +the function for the "rpm_request" channel defined, which is used for +regualtors and root clocks. + + apcs_glb: mailbox@9820000 { + compatible = "qcom,msm8996-apcs-hmss-global"; + reg = <0x9820000 0x1000>; + + #mbox-cells = <1>; + }; + + rpm_msg_ram: memory@68000 { + compatible = "qcom,rpm-msg-ram"; + reg = <0x68000 0x6000>; + }; + + rpm-glink { + compatible = "qcom,glink-rpm"; + + interrupts = <GIC_SPI 168 IRQ_TYPE_EDGE_RISING>; + + qcom,rpm-msg-ram = <&rpm_msg_ram>; + + mboxes = <&apcs_glb 0>; + + rpm-requests { + compatible = "qcom,rpm-msm8996"; + qcom,glink-channels = "rpm_requests"; + + ... + }; + }; diff --git a/Documentation/devicetree/bindings/sound/audio-graph-card.txt b/Documentation/devicetree/bindings/sound/audio-graph-card.txt new file mode 100644 index 000000000000..6e6720aa33f1 --- /dev/null +++ b/Documentation/devicetree/bindings/sound/audio-graph-card.txt @@ -0,0 +1,129 @@ +Audio Graph Card: + +Audio Graph Card specifies audio DAI connections of SoC <-> codec. +It is based on common bindings for device graphs. +see ${LINUX}/Documentation/devicetree/bindings/graph.txt + +Basically, Audio Graph Card property is same as Simple Card. +see ${LINUX}/Documentation/devicetree/bindings/sound/simple-card.txt + +Below are same as Simple-Card. + +- label +- widgets +- routing +- dai-format +- frame-master +- bitclock-master +- bitclock-inversion +- frame-inversion +- dai-tdm-slot-num +- dai-tdm-slot-width +- clocks / system-clock-frequency + +Required properties: + +- compatible : "audio-graph-card"; +- dais : list of CPU DAI port{s} + +Optional properties: +- pa-gpios: GPIO used to control external amplifier. + +Example: Single DAI case + + sound_card { + compatible = "audio-graph-card"; + + dais = <&cpu_port>; + }; + + dai-controller { + ... + cpu_port: port { + cpu_endpoint: endpoint { + remote-endpoint = <&codec_endpoint>; + + dai-format = "left_j"; + ... + }; + }; + }; + + audio-codec { + ... + port { + codec_endpoint: endpoint { + remote-endpoint = <&cpu_endpoint>; + }; + }; + }; + +Example: Multi DAI case + + sound-card { + compatible = "audio-graph-card"; + + label = "sound-card"; + + dais = <&cpu_port0 + &cpu_port1 + &cpu_port2>; + }; + + audio-codec@0 { + ... + port { + codec0_endpoint: endpoint { + remote-endpoint = <&cpu_endpoint0>; + }; + }; + }; + + audio-codec@1 { + ... + port { + codec1_endpoint: endpoint { + remote-endpoint = <&cpu_endpoint1>; + }; + }; + }; + + audio-codec@2 { + ... + port { + codec2_endpoint: endpoint { + remote-endpoint = <&cpu_endpoint2>; + }; + }; + }; + + dai-controller { + ... + ports { + cpu_port0: port@0 { + cpu_endpoint0: endpoint { + remote-endpoint = <&codec0_endpoint>; + + dai-format = "left_j"; + ... + }; + }; + cpu_port1: port@1 { + cpu_endpoint1: endpoint { + remote-endpoint = <&codec1_endpoint>; + + dai-format = "i2s"; + ... + }; + }; + cpu_port2: port@2 { + cpu_endpoint2: endpoint { + remote-endpoint = <&codec2_endpoint>; + + dai-format = "i2s"; + ... + }; + }; + }; + }; + diff --git a/Documentation/devicetree/bindings/sound/audio-graph-scu-card.txt b/Documentation/devicetree/bindings/sound/audio-graph-scu-card.txt new file mode 100644 index 000000000000..8b8afe9fcb31 --- /dev/null +++ b/Documentation/devicetree/bindings/sound/audio-graph-scu-card.txt @@ -0,0 +1,122 @@ +Audio-Graph-SCU-Card: + +Audio-Graph-SCU-Card is "Audio-Graph-Card" + "ALSA DPCM". + +It is based on common bindings for device graphs. +see ${LINUX}/Documentation/devicetree/bindings/graph.txt + +Basically, Audio-Graph-SCU-Card property is same as +Simple-Card / Simple-SCU-Card / Audio-Graph-Card. +see ${LINUX}/Documentation/devicetree/bindings/sound/simple-card.txt + ${LINUX}/Documentation/devicetree/bindings/sound/simple-scu-card.txt + ${LINUX}/Documentation/devicetree/bindings/sound/audio-graph-card.txt + +Below are same as Simple-Card / Audio-Graph-Card. + +- label +- dai-format +- frame-master +- bitclock-master +- bitclock-inversion +- frame-inversion +- dai-tdm-slot-num +- dai-tdm-slot-width +- clocks / system-clock-frequency + +Below are same as Simple-SCU-Card. + +- convert-rate +- convert-channels +- prefix +- routing + +Required properties: + +- compatible : "audio-graph-scu-card"; +- dais : list of CPU DAI port{s} + +Example 1. Sampling Rate Conversion + + sound_card { + compatible = "audio-graph-scu-card"; + + label = "sound-card"; + prefix = "codec"; + routing = "codec Playback", "DAI0 Playback", + "codec Playback", "DAI1 Playback"; + convert-rate = <48000>; + + dais = <&cpu_port>; + }; + + audio-codec { + ... + + port { + codec_endpoint: endpoint { + remote-endpoint = <&cpu_endpoint>; + }; + }; + }; + + dai-controller { + ... + cpu_port: port { + cpu_endpoint: endpoint { + remote-endpoint = <&codec_endpoint>; + + dai-format = "left_j"; + ... + }; + }; + }; + +Example 2. 2 CPU 1 Codec (Mixing) + + sound_card { + compatible = "audio-graph-scu-card"; + + label = "sound-card"; + prefix = "codec"; + routing = "codec Playback", "DAI0 Playback", + "codec Playback", "DAI1 Playback"; + convert-rate = <48000>; + + dais = <&cpu_port0 + &cpu_port1>; + }; + + audio-codec { + ... + + port { + codec_endpoint0: endpoint { + remote-endpoint = <&cpu_endpoint0>; + }; + codec_endpoint1: endpoint { + remote-endpoint = <&cpu_endpoint1>; + }; + }; + }; + + dai-controller { + ... + ports { + cpu_port0: port { + cpu_endpoint0: endpoint { + remote-endpoint = <&codec_endpoint0>; + + dai-format = "left_j"; + ... + }; + }; + cpu_port1: port { + cpu_endpoint1: endpoint { + remote-endpoint = <&codec_endpoint1>; + + dai-format = "left_j"; + ... + }; + }; + }; + }; diff --git a/Documentation/devicetree/bindings/sound/cs35l35.txt b/Documentation/devicetree/bindings/sound/cs35l35.txt index 016b768bc722..77ee75c39233 100644 --- a/Documentation/devicetree/bindings/sound/cs35l35.txt +++ b/Documentation/devicetree/bindings/sound/cs35l35.txt @@ -16,6 +16,9 @@ Required properties: (See Documentation/devicetree/bindings/interrupt-controller/interrupts.txt for further information relating to interrupt properties) + - cirrus,boost-ind-nanohenry: Inductor value for boost converter. The value is + in nH and they can be values of 1000nH, 1200nH, 1500nH, and 2200nH. + Optional properties: - reset-gpios : gpio used to reset the amplifier diff --git a/Documentation/devicetree/bindings/sound/nau8825.txt b/Documentation/devicetree/bindings/sound/nau8825.txt index d3374231c871..2f5e973285a6 100644 --- a/Documentation/devicetree/bindings/sound/nau8825.txt +++ b/Documentation/devicetree/bindings/sound/nau8825.txt @@ -69,6 +69,8 @@ Optional properties: - nuvoton,jack-insert-debounce: number from 0 to 7 that sets debounce time to 2^(n+2) ms - nuvoton,jack-eject-debounce: number from 0 to 7 that sets debounce time to 2^(n+2) ms + - nuvoton,crosstalk-bypass: make crosstalk function bypass if set. + - clocks: list of phandle and clock specifier pairs according to common clock bindings for the clocks described in clock-names - clock-names: should include "mclk" for the MCLK master clock @@ -96,6 +98,7 @@ Example: nuvoton,short-key-debounce = <2>; nuvoton,jack-insert-debounce = <7>; nuvoton,jack-eject-debounce = <7>; + nuvoton,crosstalk-bypass; clock-names = "mclk"; clocks = <&tegra_car TEGRA210_CLK_CLK_OUT_2>; diff --git a/Documentation/devicetree/bindings/sound/renesas,rsnd.txt b/Documentation/devicetree/bindings/sound/renesas,rsnd.txt index 15a7316e4c91..7246bb268bf9 100644 --- a/Documentation/devicetree/bindings/sound/renesas,rsnd.txt +++ b/Documentation/devicetree/bindings/sound/renesas,rsnd.txt @@ -83,11 +83,11 @@ SRC can convert [xx]Hz to [yy]Hz. Then, it has below 2 modes ** Asynchronous mode ------------------ -You need to use "renesas,rsrc-card" sound card for it. +You need to use "simple-scu-audio-card" sound card for it. example) sound { - compatible = "renesas,rsrc-card"; + compatible = "simple-scu-audio-card"; ... /* * SRC Asynchronous mode setting @@ -97,12 +97,12 @@ example) * Inputed 48kHz data will be converted to * system specified Hz */ - convert-rate = <48000>; + simple-audio-card,convert-rate = <48000>; ... - cpu { + simple-audio-card,cpu { sound-dai = <&rcar_sound>; }; - codec { + simple-audio-card,codec { ... }; }; @@ -141,23 +141,23 @@ For more detail information, see below ${LINUX}/sound/soc/sh/rcar/ctu.c - comment of header -You need to use "renesas,rsrc-card" sound card for it. +You need to use "simple-scu-audio-card" sound card for it. example) sound { - compatible = "renesas,rsrc-card"; + compatible = "simple-scu-audio-card"; ... /* * CTU setting * All input data will be converted to 2ch * as output data */ - convert-channels = <2>; + simple-audio-card,convert-channels = <2>; ... - cpu { + simple-audio-card,cpu { sound-dai = <&rcar_sound>; }; - codec { + simple-audio-card,codec { ... }; }; @@ -190,22 +190,22 @@ and these sounds will be merged by MIX. aplay -D plughw:0,0 xxxx.wav & aplay -D plughw:0,1 yyyy.wav -You need to use "renesas,rsrc-card" sound card for it. +You need to use "simple-scu-audio-card" sound card for it. Ex) [MEM] -> [SRC1] -> [CTU02] -+-> [MIX0] -> [DVC0] -> [SSI0] | [MEM] -> [SRC2] -> [CTU03] -+ sound { - compatible = "renesas,rsrc-card"; + compatible = "simple-scu-audio-card"; ... - cpu@0 { + simple-audio-card,cpu@0 { sound-dai = <&rcar_sound 0>; }; - cpu@1 { + simple-audio-card,cpu@1 { sound-dai = <&rcar_sound 1>; }; - codec { + simple-audio-card,codec { ... }; }; @@ -368,6 +368,10 @@ Required properties: see below for detail. - #sound-dai-cells : it must be 0 if your system is using single DAI it must be 1 if your system is using multi DAI +- clocks : References to SSI/SRC/MIX/CTU/DVC/AUDIO_CLK clocks. +- clock-names : List of necessary clock names. + "ssi-all", "ssi.X", "src.X", "mix.X", "ctu.X", + "dvc.X", "clk_a", "clk_b", "clk_c", "clk_i" Optional properties: - #clock-cells : it must be 0 if your system has audio_clkout @@ -375,6 +379,9 @@ Optional properties: - clock-frequency : for all audio_clkout0/1/2/3 - clkout-lr-asynchronous : boolean property. it indicates that audio_clkoutn is asynchronizes with lr-clock. +- resets : References to SSI resets. +- reset-names : List of valid reset names. + "ssi-all", "ssi.X" SSI subnode properties: - interrupts : Should contain SSI interrupt for PIO transfer diff --git a/Documentation/devicetree/bindings/sound/rockchip,pdm.txt b/Documentation/devicetree/bindings/sound/rockchip,pdm.txt new file mode 100644 index 000000000000..921729de7346 --- /dev/null +++ b/Documentation/devicetree/bindings/sound/rockchip,pdm.txt @@ -0,0 +1,39 @@ +* Rockchip PDM controller + +Required properties: + +- compatible: "rockchip,pdm" +- reg: physical base address of the controller and length of memory mapped + region. +- dmas: DMA specifiers for rx dma. See the DMA client binding, + Documentation/devicetree/bindings/dma/dma.txt +- dma-names: should include "rx". +- clocks: a list of phandle + clock-specifer pairs, one for each entry in clock-names. +- clock-names: should contain following: + - "pdm_hclk": clock for PDM BUS + - "pdm_clk" : clock for PDM controller +- pinctrl-names: Must contain a "default" entry. +- pinctrl-N: One property must exist for each entry in + pinctrl-names. See ../pinctrl/pinctrl-bindings.txt + for details of the property values. + +Example for rk3328 PDM controller: + +pdm: pdm@ff040000 { + compatible = "rockchip,pdm"; + reg = <0x0 0xff040000 0x0 0x1000>; + clocks = <&clk_pdm>, <&clk_gates28 0>; + clock-names = "pdm_clk", "pdm_hclk"; + dmas = <&pdma 16>; + #dma-cells = <1>; + dma-names = "rx"; + pinctrl-names = "default", "sleep"; + pinctrl-0 = <&pdmm0_clk + &pdmm0_fsync + &pdmm0_sdi0 + &pdmm0_sdi1 + &pdmm0_sdi2 + &pdmm0_sdi3>; + pinctrl-1 = <&pdmm0_sleep>; + status = "disabled"; +}; diff --git a/Documentation/devicetree/bindings/sound/rockchip-spdif.txt b/Documentation/devicetree/bindings/sound/rockchip-spdif.txt index 11046429a118..4706b96d450b 100644 --- a/Documentation/devicetree/bindings/sound/rockchip-spdif.txt +++ b/Documentation/devicetree/bindings/sound/rockchip-spdif.txt @@ -9,7 +9,9 @@ Required properties: - compatible: should be one of the following: - "rockchip,rk3066-spdif" - "rockchip,rk3188-spdif" + - "rockchip,rk3228-spdif" - "rockchip,rk3288-spdif" + - "rockchip,rk3328-spdif" - "rockchip,rk3366-spdif" - "rockchip,rk3368-spdif" - "rockchip,rk3399-spdif" diff --git a/Documentation/devicetree/bindings/sound/samsung,odroid.txt b/Documentation/devicetree/bindings/sound/samsung,odroid.txt index c1ac70cb0afb..c30934dd975b 100644 --- a/Documentation/devicetree/bindings/sound/samsung,odroid.txt +++ b/Documentation/devicetree/bindings/sound/samsung,odroid.txt @@ -5,11 +5,6 @@ Required properties: - compatible - "samsung,odroidxu3-audio" - for Odroid XU3 board, "samsung,odroidxu4-audio" - for Odroid XU4 board - model - the user-visible name of this sound complex - - 'cpu' subnode with a 'sound-dai' property containing the phandle of the I2S - controller - - 'codec' subnode with a 'sound-dai' property containing list of phandles - to the CODEC nodes, first entry must be corresponding to the MAX98090 - CODEC and the second entry must be the phandle of the HDMI IP block node - clocks - should contain entries matching clock names in the clock-names property - clock-names - should contain following entries: @@ -32,12 +27,18 @@ Required properties: For Odroid XU4: no entries +Required sub-nodes: + + - 'cpu' subnode with a 'sound-dai' property containing the phandle of the I2S + controller + - 'codec' subnode with a 'sound-dai' property containing list of phandles + to the CODEC nodes, first entry must be corresponding to the MAX98090 + CODEC and the second entry must be the phandle of the HDMI IP block node + Example: sound { compatible = "samsung,odroidxu3-audio"; - samsung,cpu-dai = <&i2s0>; - samsung,codec-dai = <&max98090>; model = "Odroid-XU3"; samsung,audio-routing = "Headphone Jack", "HPL", diff --git a/Documentation/devicetree/bindings/sound/simple-scu-card.txt b/Documentation/devicetree/bindings/sound/simple-scu-card.txt index d6fe47ed09af..327d229a51b2 100644 --- a/Documentation/devicetree/bindings/sound/simple-scu-card.txt +++ b/Documentation/devicetree/bindings/sound/simple-scu-card.txt @@ -1,35 +1,29 @@ -ASoC simple SCU Sound Card +ASoC Simple SCU Sound Card -Simple-Card specifies audio DAI connections of SoC <-> codec. +Simple SCU Sound Card is "Simple Sound Card" + "ALSA DPCM". +For example, you can use this driver if you want to exchange sampling rate convert, +Mixing, etc... Required properties: - compatible : "simple-scu-audio-card" "renesas,rsrc-card" - Optional properties: -- simple-audio-card,name : User specified audio sound card name, one string - property. -- simple-audio-card,cpu : CPU sub-node -- simple-audio-card,codec : CODEC sub-node +- simple-audio-card,name : see simple-audio-card.txt +- simple-audio-card,cpu : see simple-audio-card.txt +- simple-audio-card,codec : see simple-audio-card.txt Optional subnode properties: -- simple-audio-card,format : CPU/CODEC common audio format. - "i2s", "right_j", "left_j" , "dsp_a" - "dsp_b", "ac97", "pdm", "msb", "lsb" -- simple-audio-card,frame-master : Indicates dai-link frame master. - phandle to a cpu or codec subnode. -- simple-audio-card,bitclock-master : Indicates dai-link bit clock master. - phandle to a cpu or codec subnode. -- simple-audio-card,bitclock-inversion : bool property. Add this if the - dai-link uses bit clock inversion. -- simple-audio-card,frame-inversion : bool property. Add this if the - dai-link uses frame clock inversion. +- simple-audio-card,format : see simple-audio-card.txt +- simple-audio-card,frame-master : see simple-audio-card.txt +- simple-audio-card,bitclock-master : see simple-audio-card.txt +- simple-audio-card,bitclock-inversion : see simple-audio-card.txt +- simple-audio-card,frame-inversion : see simple-audio-card.txt - simple-audio-card,convert-rate : platform specified sampling rate convert - simple-audio-card,convert-channels : platform specified converted channel size (2 - 8 ch) -- simple-audio-card,prefix : see audio-routing +- simple-audio-card,prefix : see routing - simple-audio-card,routing : A list of the connections between audio components. Each entry is a pair of strings, the first being the connection's sink, the second being the connection's source. Valid names for sources. @@ -38,32 +32,23 @@ Optional subnode properties: Required CPU/CODEC subnodes properties: -- sound-dai : phandle and port of CPU/CODEC +- sound-dai : see simple-audio-card.txt Optional CPU/CODEC subnodes properties: -- clocks / system-clock-frequency : specify subnode's clock if needed. - it can be specified via "clocks" if system has - clock node (= common clock), or "system-clock-frequency" - (if system doens't support common clock) - If a clock is specified, it is - enabled with clk_prepare_enable() - in dai startup() and disabled with - clk_disable_unprepare() in dai - shutdown(). +- clocks / system-clock-frequency : see simple-audio-card.txt -Example 1. Sampling Rate Covert +Example 1. Sampling Rate Conversion sound { compatible = "simple-scu-audio-card"; simple-audio-card,name = "rsnd-ak4643"; simple-audio-card,format = "left_j"; - simple-audio-card,format = "left_j"; simple-audio-card,bitclock-master = <&sndcodec>; simple-audio-card,frame-master = <&sndcodec>; - simple-audio-card,convert-rate = <48000>; /* see audio_clk_a */ + simple-audio-card,convert-rate = <48000>; simple-audio-card,prefix = "ak4642"; simple-audio-card,routing = "ak4642 Playback", "DAI0 Playback", @@ -79,20 +64,18 @@ sound { }; }; -Example 2. 2 CPU 1 Codec +Example 2. 2 CPU 1 Codec (Mixing) sound { - compatible = "renesas,rsrc-card"; - - card-name = "rsnd-ak4643"; - format = "left_j"; - bitclock-master = <&dpcmcpu>; - frame-master = <&dpcmcpu>; + compatible = "simple-scu-audio-card"; - convert-rate = <48000>; /* see audio_clk_a */ + simple-audio-card,name = "rsnd-ak4643"; + simple-audio-card,format = "left_j"; + simple-audio-card,bitclock-master = <&dpcmcpu>; + simple-audio-card,frame-master = <&dpcmcpu>; - audio-prefix = "ak4642"; - audio-routing = "ak4642 Playback", "DAI0 Playback", + simple-audio-card,prefix = "ak4642"; + simple-audio-card,routing = "ak4642 Playback", "DAI0 Playback", "ak4642 Playback", "DAI1 Playback"; dpcmcpu: cpu@0 { diff --git a/Documentation/devicetree/bindings/sound/st,stm32-i2s.txt b/Documentation/devicetree/bindings/sound/st,stm32-i2s.txt new file mode 100644 index 000000000000..4bda52042402 --- /dev/null +++ b/Documentation/devicetree/bindings/sound/st,stm32-i2s.txt @@ -0,0 +1,62 @@ +STMicroelectronics STM32 SPI/I2S Controller + +The SPI/I2S block supports I2S/PCM protocols when configured on I2S mode. +Only some SPI instances support I2S. + +Required properties: + - compatible: Must be "st,stm32h7-i2s" + - reg: Offset and length of the device's register set. + - interrupts: Must contain the interrupt line id. + - clocks: Must contain phandle and clock specifier pairs for each entry + in clock-names. + - clock-names: Must contain "i2sclk", "pclk", "x8k" and "x11k". + "i2sclk": clock which feeds the internal clock generator + "pclk": clock which feeds the peripheral bus interface + "x8k": I2S parent clock for sampling rates multiple of 8kHz. + "x11k": I2S parent clock for sampling rates multiple of 11.025kHz. + - dmas: DMA specifiers for tx and rx dma. + See Documentation/devicetree/bindings/dma/stm32-dma.txt. + - dma-names: Identifier for each DMA request line. Must be "tx" and "rx". + - pinctrl-names: should contain only value "default" + - pinctrl-0: see Documentation/devicetree/bindings/pinctrl/pinctrl-stm32.txt + +Optional properties: + - resets: Reference to a reset controller asserting the reset controller + +The device node should contain one 'port' child node with one child 'endpoint' +node, according to the bindings defined in Documentation/devicetree/bindings/ +graph.txt. + +Example: +sound_card { + compatible = "audio-graph-card"; + dais = <&i2s2_port>; +}; + +i2s2: audio-controller@40003800 { + compatible = "st,stm32h7-i2s"; + reg = <0x40003800 0x400>; + interrupts = <36>; + clocks = <&rcc PCLK1>, <&rcc SPI2_CK>, <&rcc PLL1_Q>, <&rcc PLL2_P>; + clock-names = "pclk", "i2sclk", "x8k", "x11k"; + dmas = <&dmamux2 2 39 0x400 0x1>, + <&dmamux2 3 40 0x400 0x1>; + dma-names = "rx", "tx"; + pinctrl-names = "default"; + pinctrl-0 = <&pinctrl_i2s2>; + + i2s2_port: port@0 { + cpu_endpoint: endpoint { + remote-endpoint = <&codec_endpoint>; + format = "i2s"; + }; + }; +}; + +audio-codec { + codec_port: port@0 { + codec_endpoint: endpoint { + remote-endpoint = <&cpu_endpoint>; + }; + }; +}; diff --git a/Documentation/devicetree/bindings/sound/st,stm32-sai.txt b/Documentation/devicetree/bindings/sound/st,stm32-sai.txt index c59a3d779e06..f1c5ae59e7c9 100644 --- a/Documentation/devicetree/bindings/sound/st,stm32-sai.txt +++ b/Documentation/devicetree/bindings/sound/st,stm32-sai.txt @@ -6,7 +6,7 @@ The SAI contains two independent audio sub-blocks. Each sub-block has its own clock generator and I/O lines controller. Required properties: - - compatible: Should be "st,stm32f4-sai" + - compatible: Should be "st,stm32f4-sai" or "st,stm32h7-sai" - reg: Base address and size of SAI common register set. - clocks: Must contain phandle and clock specifier pairs for each entry in clock-names. @@ -36,6 +36,10 @@ SAI subnodes required properties: - pinctrl-names: should contain only value "default" - pinctrl-0: see Documentation/devicetree/bindings/pinctrl/pinctrl-stm32.txt +The device node should contain one 'port' child node with one child 'endpoint' +node, according to the bindings defined in Documentation/devicetree/bindings/ +graph.txt. + Example: sound_card { compatible = "audio-graph-card"; @@ -43,38 +47,29 @@ sound_card { }; sai1: sai1@40015800 { - compatible = "st,stm32f4-sai"; + compatible = "st,stm32h7-sai"; #address-cells = <1>; #size-cells = <1>; - ranges; + ranges = <0 0x40015800 0x400>; reg = <0x40015800 0x4>; - clocks = <&rcc 1 CLK_SAIQ_PDIV>, <&rcc 1 CLK_I2SQ_PDIV>; + clocks = <&rcc PLL1_Q>, <&rcc PLL2_P>; clock-names = "x8k", "x11k"; interrupts = <87>; - sai1b: audio-controller@40015824 { - #sound-dai-cells = <0>; - compatible = "st,stm32-sai-sub-b"; - reg = <0x40015824 0x1C>; - clocks = <&rcc 1 CLK_SAI2>; + sai1a: audio-controller@40015804 { + compatible = "st,stm32-sai-sub-a"; + reg = <0x4 0x1C>; + clocks = <&rcc SAI1_CK>; clock-names = "sai_ck"; - dmas = <&dma2 5 0 0x400 0x0>; + dmas = <&dmamux1 1 87 0x400 0x0>; dma-names = "tx"; pinctrl-names = "default"; - pinctrl-0 = <&pinctrl_sai1b>; - - ports { - #address-cells = <1>; - #size-cells = <0>; + pinctrl-0 = <&pinctrl_sai1a>; - sai1b_port: port@0 { - reg = <0>; - cpu_endpoint: endpoint { - remote-endpoint = <&codec_endpoint>; - audio-graph-card,format = "i2s"; - audio-graph-card,bitclock-master = <&codec_endpoint>; - audio-graph-card,frame-master = <&codec_endpoint>; - }; + sai1b_port: port { + cpu_endpoint: endpoint { + remote-endpoint = <&codec_endpoint>; + format = "i2s"; }; }; }; diff --git a/Documentation/devicetree/bindings/sound/st,stm32-spdifrx.txt b/Documentation/devicetree/bindings/sound/st,stm32-spdifrx.txt new file mode 100644 index 000000000000..33826f2459fa --- /dev/null +++ b/Documentation/devicetree/bindings/sound/st,stm32-spdifrx.txt @@ -0,0 +1,56 @@ +STMicroelectronics STM32 S/PDIF receiver (SPDIFRX). + +The SPDIFRX peripheral, is designed to receive an S/PDIF flow compliant with +IEC-60958 and IEC-61937. + +Required properties: + - compatible: should be "st,stm32h7-spdifrx" + - reg: cpu DAI IP base address and size + - clocks: must contain an entry for kclk (used as S/PDIF signal reference) + - clock-names: must contain "kclk" + - interrupts: cpu DAI interrupt line + - dmas: DMA specifiers for audio data DMA and iec control flow DMA + See STM32 DMA bindings, Documentation/devicetree/bindings/dma/stm32-dma.txt + - dma-names: two dmas have to be defined, "rx" and "rx-ctrl" + +Optional properties: + - resets: Reference to a reset controller asserting the SPDIFRX + +The device node should contain one 'port' child node with one child 'endpoint' +node, according to the bindings defined in Documentation/devicetree/bindings/ +graph.txt. + +Example: +spdifrx: spdifrx@40004000 { + compatible = "st,stm32h7-spdifrx"; + reg = <0x40004000 0x400>; + clocks = <&rcc SPDIFRX_CK>; + clock-names = "kclk"; + interrupts = <97>; + dmas = <&dmamux1 2 93 0x400 0x0>, + <&dmamux1 3 94 0x400 0x0>; + dma-names = "rx", "rx-ctrl"; + pinctrl-0 = <&spdifrx_pins>; + pinctrl-names = "default"; + + spdifrx_port: port { + cpu_endpoint: endpoint { + remote-endpoint = <&codec_endpoint>; + }; + }; +}; + +spdif_in: spdif-in { + compatible = "linux,spdif-dir"; + + codec_port: port { + codec_endpoint: endpoint { + remote-endpoint = <&cpu_endpoint>; + }; + }; +}; + +soundcard { + compatible = "audio-graph-card"; + dais = <&spdifrx_port>; +}; diff --git a/Documentation/devicetree/bindings/sound/sun4i-codec.txt b/Documentation/devicetree/bindings/sound/sun4i-codec.txt index 3863531d1e6d..2d4e10deb6f4 100644 --- a/Documentation/devicetree/bindings/sound/sun4i-codec.txt +++ b/Documentation/devicetree/bindings/sound/sun4i-codec.txt @@ -7,6 +7,7 @@ Required properties: - "allwinner,sun7i-a20-codec" - "allwinner,sun8i-a23-codec" - "allwinner,sun8i-h3-codec" + - "allwinner,sun8i-v3s-codec" - reg: must contain the registers location and length - interrupts: must contain the codec interrupt - dmas: DMA channels for tx and rx dma. See the DMA client binding, @@ -25,6 +26,7 @@ Required properties for the following compatibles: - "allwinner,sun6i-a31-codec" - "allwinner,sun8i-a23-codec" - "allwinner,sun8i-h3-codec" + - "allwinner,sun8i-v3s-codec" - resets: phandle to the reset control for this device - allwinner,audio-routing: A list of the connections between audio components. Each entry is a pair of strings, the first being the @@ -34,15 +36,15 @@ Required properties for the following compatibles: Audio pins on the SoC: "HP" "HPCOM" - "LINEIN" - "LINEOUT" (not on sun8i-a23) + "LINEIN" (not on sun8i-v3s) + "LINEOUT" (not on sun8i-a23 or sun8i-v3s) "MIC1" - "MIC2" + "MIC2" (not on sun8i-v3s) "MIC3" (sun6i-a31 only) Microphone biases from the SoC: "HBIAS" - "MBIAS" + "MBIAS" (not on sun8i-v3s) Board connectors: "Headphone" @@ -55,6 +57,7 @@ Required properties for the following compatibles: Required properties for the following compatibles: - "allwinner,sun8i-a23-codec" - "allwinner,sun8i-h3-codec" + - "allwinner,sun8i-v3s-codec" - allwinner,codec-analog-controls: A phandle to the codec analog controls block in the PRCM. diff --git a/Documentation/devicetree/bindings/sound/sun8i-codec-analog.txt b/Documentation/devicetree/bindings/sound/sun8i-codec-analog.txt index 779b735781ba..1b6e7c4e50ab 100644 --- a/Documentation/devicetree/bindings/sound/sun8i-codec-analog.txt +++ b/Documentation/devicetree/bindings/sound/sun8i-codec-analog.txt @@ -4,6 +4,7 @@ Required properties: - compatible: must be one of the following compatibles: - "allwinner,sun8i-a23-codec-analog" - "allwinner,sun8i-h3-codec-analog" + - "allwinner,sun8i-v3s-codec-analog" Required properties if not a sub-node of the PRCM node: - reg: must contain the registers location and length diff --git a/Documentation/devicetree/bindings/sound/zte,zx-aud96p22.txt b/Documentation/devicetree/bindings/sound/zte,zx-aud96p22.txt new file mode 100644 index 000000000000..41bb1040eb71 --- /dev/null +++ b/Documentation/devicetree/bindings/sound/zte,zx-aud96p22.txt @@ -0,0 +1,24 @@ +ZTE ZX AUD96P22 Audio Codec + +Required properties: + - compatible: Must be "zte,zx-aud96p22" + - #sound-dai-cells: Should be 0 + - reg: I2C bus slave address of AUD96P22 + +Example: + + i2c0: i2c@1486000 { + compatible = "zte,zx296718-i2c"; + reg = <0x01486000 0x1000>; + interrupts = <GIC_SPI 35 IRQ_TYPE_LEVEL_HIGH>; + #address-cells = <1>; + #size-cells = <0>; + clocks = <&audiocrm AUDIO_I2C0_WCLK>; + clock-frequency = <1600000>; + + aud96p22: codec@22 { + compatible = "zte,zx-aud96p22"; + #sound-dai-cells = <0>; + reg = <0x22>; + }; + }; diff --git a/Documentation/devicetree/bindings/spi/sh-msiof.txt b/Documentation/devicetree/bindings/spi/sh-msiof.txt index dc975064fa27..64ee489571c4 100644 --- a/Documentation/devicetree/bindings/spi/sh-msiof.txt +++ b/Documentation/devicetree/bindings/spi/sh-msiof.txt @@ -38,6 +38,8 @@ Optional properties: specifiers, one for transmission, and one for reception. - dma-names : Must contain a list of two DMA names, "tx" and "rx". +- spi-slave : Empty property indicating the SPI controller is used + in slave mode. - renesas,dtdl : delay sync signal (setup) in transmit mode. Must contain one of the following values: 0 (no bit delay) diff --git a/Documentation/devicetree/bindings/spi/spi-bus.txt b/Documentation/devicetree/bindings/spi/spi-bus.txt index 4b1d6e74c744..1f6e86f787ef 100644 --- a/Documentation/devicetree/bindings/spi/spi-bus.txt +++ b/Documentation/devicetree/bindings/spi/spi-bus.txt @@ -1,17 +1,23 @@ SPI (Serial Peripheral Interface) busses -SPI busses can be described with a node for the SPI master device -and a set of child nodes for each SPI slave on the bus. For this -discussion, it is assumed that the system's SPI controller is in -SPI master mode. This binding does not describe SPI controllers -in slave mode. +SPI busses can be described with a node for the SPI controller device +and a set of child nodes for each SPI slave on the bus. The system's SPI +controller may be described for use in SPI master mode or in SPI slave mode, +but not for both at the same time. -The SPI master node requires the following properties: +The SPI controller node requires the following properties: +- compatible - Name of SPI bus controller following generic names + recommended practice. + +In master mode, the SPI controller node requires the following additional +properties: - #address-cells - number of cells required to define a chip select address on the SPI bus. - #size-cells - should be zero. -- compatible - name of SPI bus controller following generic names - recommended practice. + +In slave mode, the SPI controller node requires one additional property: +- spi-slave - Empty property. + No other properties are required in the SPI bus node. It is assumed that a driver for an SPI bus device will understand that it is an SPI bus. However, the binding does not attempt to define the specific method for @@ -21,7 +27,7 @@ assumption that board specific platform code will be used to manage chip selects. Individual drivers can define additional properties to support describing the chip select layout. -Optional properties: +Optional properties (master mode only): - cs-gpios - gpios chip select. - num-cs - total number of chipselects. @@ -41,28 +47,36 @@ cs1 : native cs2 : &gpio1 1 0 cs3 : &gpio1 2 0 -SPI slave nodes must be children of the SPI master node and can -contain the following properties. -- reg - (required) chip select address of device. -- compatible - (required) name of SPI device following generic names - recommended practice. -- spi-max-frequency - (required) Maximum SPI clocking speed of device in Hz. -- spi-cpol - (optional) Empty property indicating device requires - inverse clock polarity (CPOL) mode. -- spi-cpha - (optional) Empty property indicating device requires - shifted clock phase (CPHA) mode. -- spi-cs-high - (optional) Empty property indicating device requires - chip select active high. -- spi-3wire - (optional) Empty property indicating device requires - 3-wire mode. -- spi-lsb-first - (optional) Empty property indicating device requires - LSB first mode. -- spi-tx-bus-width - (optional) The bus width (number of data wires) that is - used for MOSI. Defaults to 1 if not present. -- spi-rx-bus-width - (optional) The bus width (number of data wires) that is - used for MISO. Defaults to 1 if not present. -- spi-rx-delay-us - (optional) Microsecond delay after a read transfer. -- spi-tx-delay-us - (optional) Microsecond delay after a write transfer. + +SPI slave nodes must be children of the SPI controller node. + +In master mode, one or more slave nodes (up to the number of chip selects) can +be present. Required properties are: +- compatible - Name of SPI device following generic names recommended + practice. +- reg - Chip select address of device. +- spi-max-frequency - Maximum SPI clocking speed of device in Hz. + +In slave mode, the (single) slave node is optional. +If present, it must be called "slave". Required properties are: +- compatible - Name of SPI device following generic names recommended + practice. + +All slave nodes can contain the following optional properties: +- spi-cpol - Empty property indicating device requires inverse clock + polarity (CPOL) mode. +- spi-cpha - Empty property indicating device requires shifted clock + phase (CPHA) mode. +- spi-cs-high - Empty property indicating device requires chip select + active high. +- spi-3wire - Empty property indicating device requires 3-wire mode. +- spi-lsb-first - Empty property indicating device requires LSB first mode. +- spi-tx-bus-width - The bus width (number of data wires) that is used for MOSI. + Defaults to 1 if not present. +- spi-rx-bus-width - The bus width (number of data wires) that is used for MISO. + Defaults to 1 if not present. +- spi-rx-delay-us - Microsecond delay after a read transfer. +- spi-tx-delay-us - Microsecond delay after a write transfer. Some SPI controllers and devices support Dual and Quad SPI transfer mode. It allows data in the SPI system to be transferred using 2 wires (DUAL) or 4 diff --git a/Documentation/devicetree/bindings/spi/spi-meson.txt b/Documentation/devicetree/bindings/spi/spi-meson.txt index dc6d0313324a..825c39cae74a 100644 --- a/Documentation/devicetree/bindings/spi/spi-meson.txt +++ b/Documentation/devicetree/bindings/spi/spi-meson.txt @@ -20,3 +20,34 @@ Required properties: #address-cells = <1>; #size-cells = <0>; }; + +* SPICC (SPI Communication Controller) + +The Meson SPICC is generic SPI controller for general purpose Full-Duplex +communications with dedicated 16 words RX/TX PIO FIFOs. + +Required properties: + - compatible: should be "amlogic,meson-gx-spicc" on Amlogic GX SoCs. + - reg: physical base address and length of the controller registers + - interrupts: The interrupt specifier + - clock-names: Must contain "core" + - clocks: phandle of the input clock for the baud rate generator + - #address-cells: should be 1 + - #size-cells: should be 0 + +Optional properties: + - resets: phandle of the internal reset line + +See ../spi/spi-bus.txt for more details on SPI bus master and slave devices +required and optional properties. + +Example : + spi@c1108d80 { + compatible = "amlogic,meson-gx-spicc"; + reg = <0xc1108d80 0x80>; + interrupts = <GIC_SPI 112 IRQ_TYPE_LEVEL_HIGH>; + clock-names = "core"; + clocks = <&clk81>; + #address-cells = <1>; + #size-cells = <0>; + }; diff --git a/Documentation/devicetree/bindings/spi/spi-mt65xx.txt b/Documentation/devicetree/bindings/spi/spi-mt65xx.txt index e43f4cf4cf35..e0318cf92d73 100644 --- a/Documentation/devicetree/bindings/spi/spi-mt65xx.txt +++ b/Documentation/devicetree/bindings/spi/spi-mt65xx.txt @@ -3,7 +3,9 @@ Binding for MTK SPI controller Required properties: - compatible: should be one of the following. - mediatek,mt2701-spi: for mt2701 platforms + - mediatek,mt2712-spi: for mt2712 platforms - mediatek,mt6589-spi: for mt6589 platforms + - mediatek,mt7622-spi: for mt7622 platforms - mediatek,mt8135-spi: for mt8135 platforms - mediatek,mt8173-spi: for mt8173 platforms diff --git a/Documentation/devicetree/bindings/spi/spi-stm32.txt b/Documentation/devicetree/bindings/spi/spi-stm32.txt new file mode 100644 index 000000000000..1b3fa2c119d5 --- /dev/null +++ b/Documentation/devicetree/bindings/spi/spi-stm32.txt @@ -0,0 +1,59 @@ +STMicroelectronics STM32 SPI Controller + +The STM32 SPI controller is used to communicate with external devices using +the Serial Peripheral Interface. It supports full-duplex, half-duplex and +simplex synchronous serial communication with external devices. It supports +from 4 to 32-bit data size. Although it can be configured as master or slave, +only master is supported by the driver. + +Required properties: +- compatible: Must be "st,stm32h7-spi". +- reg: Offset and length of the device's register set. +- interrupts: Must contain the interrupt id. +- clocks: Must contain an entry for spiclk (which feeds the internal clock + generator). +- #address-cells: Number of cells required to define a chip select address. +- #size-cells: Should be zero. + +Optional properties: +- resets: Must contain the phandle to the reset controller. +- A pinctrl state named "default" may be defined to set pins in mode of + operation for SPI transfer. +- dmas: DMA specifiers for tx and rx dma. DMA fifo mode must be used. See the + STM32 DMA bindings, Documentation/devicetree/bindings/dma/stm32-dma.txt. +- dma-names: DMA request names should include "tx" and "rx" if present. +- cs-gpios: list of GPIO chip selects. See the SPI bus bindings, + Documentation/devicetree/bindings/spi/spi-bus.txt + + +Child nodes represent devices on the SPI bus + See ../spi/spi-bus.txt + +Optional properties: +- st,spi-midi-ns: (Master Inter-Data Idleness) minimum time delay in + nanoseconds inserted between two consecutive data frames. + + +Example: + spi2: spi@40003800 { + #address-cells = <1>; + #size-cells = <0>; + compatible = "st,stm32h7-spi"; + reg = <0x40003800 0x400>; + interrupts = <36>; + clocks = <&rcc SPI2_CK>; + resets = <&rcc 1166>; + dmas = <&dmamux1 0 39 0x400 0x01>, + <&dmamux1 1 40 0x400 0x01>; + dma-names = "rx", "tx"; + pinctrl-0 = <&spi2_pins_b>; + pinctrl-names = "default"; + cs-gpios = <&gpioa 11 0>; + + aardvark@0 { + compatible = "totalphase,aardvark"; + reg = <0>; + spi-max-frequency = <4000000>; + st,spi-midi-ns = <4000>; + }; + }; diff --git a/Documentation/devicetree/bindings/thermal/brcm,ns-thermal b/Documentation/devicetree/bindings/thermal/brcm,ns-thermal.txt index 68e047170039..68e047170039 100644 --- a/Documentation/devicetree/bindings/thermal/brcm,ns-thermal +++ b/Documentation/devicetree/bindings/thermal/brcm,ns-thermal.txt diff --git a/Documentation/devicetree/bindings/timer/actions,owl-timer.txt b/Documentation/devicetree/bindings/timer/actions,owl-timer.txt new file mode 100644 index 000000000000..e3c28da80cb2 --- /dev/null +++ b/Documentation/devicetree/bindings/timer/actions,owl-timer.txt @@ -0,0 +1,20 @@ +Actions Semi Owl Timer + +Required properties: +- compatible : "actions,s500-timer" for S500 + "actions,s900-timer" for S900 +- reg : Offset and length of the register set for the device. +- interrupts : Should contain the interrupts. +- interrupt-names : Valid names are: "2hz0", "2hz1", + "timer0", "timer1", "timer2", "timer3" + See ../resource-names.txt + +Example: + + timer@b0168000 { + compatible = "actions,s500-timer"; + reg = <0xb0168000 0x100>; + interrupts = <GIC_SPI 10 IRQ_TYPE_LEVEL_HIGH>, + <GIC_SPI 11 IRQ_TYPE_LEVEL_HIGH>; + interrupt-names = "timer0", "timer1"; + }; diff --git a/Documentation/devicetree/bindings/timer/faraday,fttmr010.txt b/Documentation/devicetree/bindings/timer/faraday,fttmr010.txt index b73ca6cd07f8..195792270414 100644 --- a/Documentation/devicetree/bindings/timer/faraday,fttmr010.txt +++ b/Documentation/devicetree/bindings/timer/faraday,fttmr010.txt @@ -7,7 +7,11 @@ Required properties: - compatible : Must be one of "faraday,fttmr010" - "cortina,gemini-timer" + "cortina,gemini-timer", "faraday,fttmr010" + "moxa,moxart-timer", "faraday,fttmr010" + "aspeed,ast2400-timer" + "aspeed,ast2500-timer" + - reg : Should contain registers location and length - interrupts : Should contain the three timer interrupts usually with flags for falling edge diff --git a/Documentation/devicetree/bindings/timer/moxa,moxart-timer.txt b/Documentation/devicetree/bindings/timer/moxa,moxart-timer.txt deleted file mode 100644 index e207c11630af..000000000000 --- a/Documentation/devicetree/bindings/timer/moxa,moxart-timer.txt +++ /dev/null @@ -1,19 +0,0 @@ -MOXA ART timer - -Required properties: - -- compatible : Must be one of: - - "moxa,moxart-timer" - - "aspeed,ast2400-timer" -- reg : Should contain registers location and length -- interrupts : Should contain the timer interrupt number -- clocks : Should contain phandle for the clock that drives the counter - -Example: - - timer: timer@98400000 { - compatible = "moxa,moxart-timer"; - reg = <0x98400000 0x42>; - interrupts = <19 1>; - clocks = <&coreclk>; - }; diff --git a/Documentation/devicetree/bindings/trivial-devices.txt b/Documentation/devicetree/bindings/trivial-devices.txt index 3e0a34c88e07..35f406dd86b6 100644 --- a/Documentation/devicetree/bindings/trivial-devices.txt +++ b/Documentation/devicetree/bindings/trivial-devices.txt @@ -55,6 +55,7 @@ gmt,g751 G751: Digital Temperature Sensor and Thermal Watchdog with Two-Wire In infineon,slb9635tt Infineon SLB9635 (Soft-) I2C TPM (old protocol, max 100khz) infineon,slb9645tt Infineon SLB9645 I2C TPM (new protocol, max 400khz) isil,isl29028 Intersil ISL29028 Ambient Light and Proximity Sensor +isil,isl29030 Intersil ISL29030 Ambient Light and Proximity Sensor maxim,ds1050 5 Bit Programmable, Pulse-Width Modulator maxim,max1237 Low-Power, 4-/12-Channel, 2-Wire Serial, 12-Bit ADCs maxim,max6625 9-Bit/12-Bit Temperature Sensors with IยฒC-Compatible Serial Interface diff --git a/Documentation/devicetree/bindings/usb/dwc3.txt b/Documentation/devicetree/bindings/usb/dwc3.txt index f658f394c2d3..52fb41046b34 100644 --- a/Documentation/devicetree/bindings/usb/dwc3.txt +++ b/Documentation/devicetree/bindings/usb/dwc3.txt @@ -45,6 +45,8 @@ Optional properties: a free-running PHY clock. - snps,dis-del-phy-power-chg-quirk: when set core will change PHY power from P0 to P1/P2/P3 without delay. + - snps,dis-tx-ipgap-linecheck-quirk: when set, disable u2mac linestate check + during HS transmit. - snps,is-utmi-l1-suspend: true when DWC3 asserts output signal utmi_l1_suspend_n, false when asserts utmi_sleep_n - snps,hird-threshold: HIRD threshold diff --git a/Documentation/devicetree/bindings/usb/exynos-usb.txt b/Documentation/devicetree/bindings/usb/exynos-usb.txt index 9b4dbe3b2acc..78ebebb66dad 100644 --- a/Documentation/devicetree/bindings/usb/exynos-usb.txt +++ b/Documentation/devicetree/bindings/usb/exynos-usb.txt @@ -92,6 +92,8 @@ Required properties: parent's address space - clocks: Clock IDs array as required by the controller. - clock-names: names of clocks correseponding to IDs in the clock property + - vdd10-supply: 1.0V powr supply + - vdd33-supply: 3.0V/3.3V power supply Sub-nodes: The dwc3 core should be added as subnode to Exynos dwc3 glue. @@ -107,6 +109,8 @@ Example: #address-cells = <1>; #size-cells = <1>; ranges; + vdd10-supply = <&ldo11_reg>; + vdd33-supply = <&ldo9_reg>; dwc3 { compatible = "synopsys,dwc3"; diff --git a/Documentation/devicetree/bindings/usb/iproc-udc.txt b/Documentation/devicetree/bindings/usb/iproc-udc.txt new file mode 100644 index 000000000000..272d7faf1a97 --- /dev/null +++ b/Documentation/devicetree/bindings/usb/iproc-udc.txt @@ -0,0 +1,21 @@ +Broadcom IPROC USB Device controller. + +The device node is used for UDCs integrated into Broadcom's +iProc family (Northstar2, Cygnus) of SoCs'. The UDC is based +on Synopsys Designware Cores AHB Subsystem Device Controller +IP. + +Required properties: + - compatible: Add the compatibility strings for supported platforms. + For Broadcom NS2 platform, add "brcm,ns2-udc","brcm,iproc-udc". + For Broadcom Cygnus platform, add "brcm,cygnus-udc", "brcm,iproc-udc". + - reg: Offset and length of UDC register set + - interrupts: description of interrupt line + - phys: phandle to phy node. + +Example: + udc_dwc: usb@664e0000 { + compatible = "brcm,ns2-udc", "brcm,iproc-udc"; + reg = <0x664e0000 0x2000>; + interrupts = <GIC_SPI 424 IRQ_TYPE_LEVEL_HIGH>; + phys = <&usbdrd_phy>; diff --git a/Documentation/devicetree/bindings/usb/usb-ohci.txt b/Documentation/devicetree/bindings/usb/usb-ohci.txt index 9df456968596..e8766b08c93b 100644 --- a/Documentation/devicetree/bindings/usb/usb-ohci.txt +++ b/Documentation/devicetree/bindings/usb/usb-ohci.txt @@ -10,6 +10,7 @@ Optional properties: - big-endian-desc : boolean, set this for hcds with big-endian descriptors - big-endian : boolean, for hcds with big-endian-regs + big-endian-desc - no-big-frame-no : boolean, set if frame_no lives in bits [15:0] of HCCA +- remote-wakeup-connected: remote wakeup is wired on the platform - num-ports : u32, to override the detected port count - clocks : a list of phandle + clock specifier pairs - phys : phandle + phy specifier pair diff --git a/Documentation/devicetree/bindings/vendor-prefixes.txt b/Documentation/devicetree/bindings/vendor-prefixes.txt index c03d20140366..daf465bef758 100644 --- a/Documentation/devicetree/bindings/vendor-prefixes.txt +++ b/Documentation/devicetree/bindings/vendor-prefixes.txt @@ -5,6 +5,7 @@ using them to avoid name-space collisions. abcn Abracon Corporation abilis Abilis Systems +actions Actions Semiconductor Co., Ltd. active-semi Active-Semi International Inc ad Avionic Design GmbH adapteva Adapteva, Inc. @@ -28,6 +29,7 @@ andestech Andes Technology Corporation apm Applied Micro Circuits Corporation (APM) aptina Aptina Imaging arasan Arasan Chip Systems +arctic Arctic Sand aries Aries Embedded GmbH arm ARM Ltd. armadeus ARMadeus Systems SARL @@ -44,6 +46,7 @@ avia avia semiconductor avic Shanghai AVIC Optoelectronics Co., Ltd. axentia Axentia Technologies AB axis Axis Communications AB +bananapi BIPAI KEJI LIMITED boe BOE Technology Group Co., Ltd. bosch Bosch Sensortec GmbH boundary Boundary Devices Inc. @@ -158,6 +161,8 @@ iom Iomega Corporation isee ISEE 2007 S.L. isil Intersil issi Integrated Silicon Solutions Inc. +itead ITEAD Intelligent Systems Co.Ltd +iwave iWave Systems Technologies Pvt. Ltd. jdi Japan Display Inc. jedec JEDEC Solid State Technology Association karo Ka-Ro electronics GmbH @@ -165,6 +170,7 @@ keithkoep Keith & Koep GmbH keymile Keymile GmbH khadas Khadas kinetic Kinetic Technologies +kingnovel Kingnovel Technology Co., Ltd. kosagi Sutajio Ko-Usagi PTE Ltd. kyo Kyocera Corporation lacie LaCie @@ -172,8 +178,10 @@ lantiq Lantiq Semiconductor lego LEGO Systems A/S lenovo Lenovo Group Ltd. lg LG Corporation +libretech Shenzhen Libre Technology Co., Ltd licheepi Lichee Pi linaro Linaro Limited +linksys Belkin International, Inc. (Linksys) linux Linux-specific binding lltc Linear Technology Corporation lsi LSI Corp. (LSI Logic) @@ -219,6 +227,7 @@ nexbox Nexbox newhaven Newhaven Display International ni National Instruments nintendo Nintendo +nlt NLT Technologies, Ltd. nokia Nokia nordic Nordic Semiconductor nuvoton Nuvoton Technology Corporation @@ -266,8 +275,10 @@ renesas Renesas Electronics Corporation richtek Richtek Technology Corporation ricoh Ricoh Co. Ltd. rikomagic Rikomagic Tech Corp. Ltd +riscv RISC-V Foundation rockchip Fuzhou Rockchip Electronics Co., Ltd rohm ROHM Semiconductor Co., Ltd +roofull Shenzhen Roofull Technology Co, Ltd samsung Samsung Semiconductor samtec Samtec/Softing company sandisk Sandisk Corporation @@ -331,6 +342,7 @@ tronfy Tronfy tronsmart Tronsmart truly Truly Semiconductors Limited tyan Tyan Computer Corporation +ucrobotics uCRobotics udoo Udoo uniwest United Western Technologies Corp (UniWest) upisemi uPI Semiconductor Corp. @@ -356,6 +368,7 @@ xlnx Xilinx xunlong Shenzhen Xunlong Software CO.,Limited zarlink Zarlink Semiconductor zeitec ZEITEC Semiconductor Co., LTD. +zidoo Shenzhen Zidoo Technology Co., Ltd. zii Zodiac Inflight Innovations zte ZTE Corp. zyxel ZyXEL Communications Corp. diff --git a/Documentation/devicetree/bindings/watchdog/da9062-wdt.txt b/Documentation/devicetree/bindings/watchdog/da9062-wdt.txt new file mode 100644 index 000000000000..b935b526d2f3 --- /dev/null +++ b/Documentation/devicetree/bindings/watchdog/da9062-wdt.txt @@ -0,0 +1,23 @@ +* Dialog Semiconductor DA9062/61 Watchdog Timer + +Required properties: + +- compatible: should be one of the following valid compatible string lines: + "dlg,da9061-watchdog", "dlg,da9062-watchdog" + "dlg,da9062-watchdog" + +Example: DA9062 + + pmic0: da9062@58 { + watchdog { + compatible = "dlg,da9062-watchdog"; + }; + }; + +Example: DA9061 using a fall-back compatible for the DA9062 watchdog driver + + pmic0: da9061@58 { + watchdog { + compatible = "dlg,da9061-watchdog", "dlg,da9062-watchdog"; + }; + }; diff --git a/Documentation/devicetree/bindings/watchdog/dw_wdt.txt b/Documentation/devicetree/bindings/watchdog/dw_wdt.txt index 08e16f684f2d..eb0914420c7c 100644 --- a/Documentation/devicetree/bindings/watchdog/dw_wdt.txt +++ b/Documentation/devicetree/bindings/watchdog/dw_wdt.txt @@ -10,6 +10,8 @@ Required Properties: Optional Properties: - interrupts : The interrupt used for the watchdog timeout warning. +- resets : phandle pointing to the system reset controller with + line index for the watchdog. Example: @@ -18,4 +20,5 @@ Example: reg = <0xffd02000 0x1000>; interrupts = <0 171 4>; clocks = <&per_base_clk>; + resets = <&rst WDT0_RESET>; }; diff --git a/Documentation/devicetree/bindings/watchdog/renesas-wdt.txt b/Documentation/devicetree/bindings/watchdog/renesas-wdt.txt index da24e3133417..9e306afbbd49 100644 --- a/Documentation/devicetree/bindings/watchdog/renesas-wdt.txt +++ b/Documentation/devicetree/bindings/watchdog/renesas-wdt.txt @@ -2,10 +2,11 @@ Renesas Watchdog Timer (WDT) Controller Required properties: - compatible : Should be "renesas,<soctype>-wdt", and - "renesas,rcar-gen3-wdt" as fallback. + "renesas,rcar-gen3-wdt" or "renesas,rza-wdt" as fallback. Examples with soctypes are: - "renesas,r8a7795-wdt" (R-Car H3) - "renesas,r8a7796-wdt" (R-Car M3-W) + - "renesas,r7s72100-wdt" (RZ/A1) When compatible with the generic version, nodes must list the SoC-specific version corresponding to the platform first, followed by the generic @@ -17,6 +18,7 @@ Required properties: Optional properties: - timeout-sec : Contains the watchdog timeout in seconds - power-domains : the power domain the WDT belongs to +- interrupts: Some WDTs have an interrupt when used in interval timer mode Examples: diff --git a/Documentation/devicetree/bindings/watchdog/st,stm32-iwdg.txt b/Documentation/devicetree/bindings/watchdog/st,stm32-iwdg.txt new file mode 100644 index 000000000000..cc13b10a3f82 --- /dev/null +++ b/Documentation/devicetree/bindings/watchdog/st,stm32-iwdg.txt @@ -0,0 +1,19 @@ +STM32 Independent WatchDoG (IWDG) +--------------------------------- + +Required properties: +- compatible: "st,stm32-iwdg" +- reg: physical base address and length of the registers set for the device +- clocks: must contain a single entry describing the clock input + +Optional Properties: +- timeout-sec: Watchdog timeout value in seconds. + +Example: + +iwdg: watchdog@40003000 { + compatible = "st,stm32-iwdg"; + reg = <0x40003000 0x400>; + clocks = <&clk_lsi>; + timeout-sec = <32>; +}; diff --git a/Documentation/devicetree/bindings/watchdog/uniphier-wdt.txt b/Documentation/devicetree/bindings/watchdog/uniphier-wdt.txt new file mode 100644 index 000000000000..bf6337546dd1 --- /dev/null +++ b/Documentation/devicetree/bindings/watchdog/uniphier-wdt.txt @@ -0,0 +1,20 @@ +UniPhier watchdog timer controller + +This UniPhier watchdog timer controller must be under sysctrl node. + +Required properties: +- compatible: should be "socionext,uniphier-wdt" + +Example: + + sysctrl@61840000 { + compatible = "socionext,uniphier-ld11-sysctrl", + "simple-mfd", "syscon"; + reg = <0x61840000 0x4000>; + + watchdog { + compatible = "socionext,uniphier-wdt"; + } + + other nodes ... + }; diff --git a/Documentation/devicetree/booting-without-of.txt b/Documentation/devicetree/booting-without-of.txt index 280d283304bb..fb740445199f 100644 --- a/Documentation/devicetree/booting-without-of.txt +++ b/Documentation/devicetree/booting-without-of.txt @@ -1413,7 +1413,7 @@ Optional property: from DMA operations originating from the bus. It provides a means of defining a mapping or translation between the physical address space of the bus and the physical address space of the parent of the bus. - (for more information see ePAPR specification) + (for more information see the Devicetree Specification) * DMA Bus child Optional property: diff --git a/Documentation/devicetree/overlay-notes.txt b/Documentation/devicetree/overlay-notes.txt index d418a6ce9812..eb7f2685fda1 100644 --- a/Documentation/devicetree/overlay-notes.txt +++ b/Documentation/devicetree/overlay-notes.txt @@ -3,8 +3,7 @@ Device Tree Overlay Notes This document describes the implementation of the in-kernel device tree overlay functionality residing in drivers/of/overlay.c and is a -companion document to Documentation/devicetree/dt-object-internal.txt[1] & -Documentation/devicetree/dynamic-resolution-notes.txt[2] +companion document to Documentation/devicetree/dynamic-resolution-notes.txt[1] How overlays work ----------------- @@ -16,8 +15,7 @@ Since the kernel mainly deals with devices, any new device node that result in an active device should have it created while if the device node is either disabled or removed all together, the affected device should be deregistered. -Lets take an example where we have a foo board with the following base tree -which is taken from [1]. +Lets take an example where we have a foo board with the following base tree: ---- foo.dts ----------------------------------------------------------------- /* FOO platform */ @@ -36,7 +34,7 @@ which is taken from [1]. }; ---- foo.dts ----------------------------------------------------------------- -The overlay bar.dts, when loaded (and resolved as described in [2]) should +The overlay bar.dts, when loaded (and resolved as described in [1]) should ---- bar.dts ----------------------------------------------------------------- /plugin/; /* allow undefined label references and record them */ diff --git a/Documentation/devicetree/usage-model.txt b/Documentation/devicetree/usage-model.txt index 2b6b3d3f0388..33a8aaac02a8 100644 --- a/Documentation/devicetree/usage-model.txt +++ b/Documentation/devicetree/usage-model.txt @@ -387,7 +387,7 @@ static void __init harmony_init_machine(void) of_platform_populate(NULL, of_default_bus_match_table, NULL, NULL); } -"simple-bus" is defined in the ePAPR 1.0 specification as a property +"simple-bus" is defined in the Devicetree Specification as a property meaning a simple memory mapped bus, so the of_platform_populate() code could be written to just assume simple-bus compatible nodes will always be traversed. However, we pass it in as an argument so that diff --git a/Documentation/digsig.txt b/Documentation/digsig.txt index 3f682889068b..f6a8902d3ef7 100644 --- a/Documentation/digsig.txt +++ b/Documentation/digsig.txt @@ -1,13 +1,20 @@ +================================== Digital Signature Verification API +================================== -CONTENTS +:Author: Dmitry Kasatkin +:Date: 06.10.2011 -1. Introduction -2. API -3. User-space utilities +.. CONTENTS -1. Introduction + 1. Introduction + 2. API + 3. User-space utilities + + +Introduction +============ Digital signature verification API provides a method to verify digital signature. Currently digital signatures are used by the IMA/EVM integrity protection subsystem. @@ -17,25 +24,25 @@ GnuPG multi-precision integers (MPI) library. The kernel port provides memory allocation errors handling, has been refactored according to kernel coding style, and checkpatch.pl reported errors and warnings have been fixed. -Public key and signature consist of header and MPIs. - -struct pubkey_hdr { - uint8_t version; /* key format version */ - time_t timestamp; /* key made, always 0 for now */ - uint8_t algo; - uint8_t nmpi; - char mpi[0]; -} __packed; - -struct signature_hdr { - uint8_t version; /* signature format version */ - time_t timestamp; /* signature made */ - uint8_t algo; - uint8_t hash; - uint8_t keyid[8]; - uint8_t nmpi; - char mpi[0]; -} __packed; +Public key and signature consist of header and MPIs:: + + struct pubkey_hdr { + uint8_t version; /* key format version */ + time_t timestamp; /* key made, always 0 for now */ + uint8_t algo; + uint8_t nmpi; + char mpi[0]; + } __packed; + + struct signature_hdr { + uint8_t version; /* signature format version */ + time_t timestamp; /* signature made */ + uint8_t algo; + uint8_t hash; + uint8_t keyid[8]; + uint8_t nmpi; + char mpi[0]; + } __packed; keyid equals to SHA1[12-19] over the total key content. Signature header is used as an input to generate a signature. @@ -43,31 +50,33 @@ Such approach insures that key or signature header could not be changed. It protects timestamp from been changed and can be used for rollback protection. -2. API +API +=== -API currently includes only 1 function: +API currently includes only 1 function:: digsig_verify() - digital signature verification with public key -/** - * digsig_verify() - digital signature verification with public key - * @keyring: keyring to search key in - * @sig: digital signature - * @sigen: length of the signature - * @data: data - * @datalen: length of the data - * @return: 0 on success, -EINVAL otherwise - * - * Verifies data integrity against digital signature. - * Currently only RSA is supported. - * Normally hash of the content is used as a data for this function. - * - */ -int digsig_verify(struct key *keyring, const char *sig, int siglen, - const char *data, int datalen); - -3. User-space utilities + /** + * digsig_verify() - digital signature verification with public key + * @keyring: keyring to search key in + * @sig: digital signature + * @sigen: length of the signature + * @data: data + * @datalen: length of the data + * @return: 0 on success, -EINVAL otherwise + * + * Verifies data integrity against digital signature. + * Currently only RSA is supported. + * Normally hash of the content is used as a data for this function. + * + */ + int digsig_verify(struct key *keyring, const char *sig, int siglen, + const char *data, int datalen); + +User-space utilities +==================== The signing and key management utilities evm-utils provide functionality to generate signatures, to load keys into the kernel keyring. @@ -75,22 +84,18 @@ Keys can be in PEM or converted to the kernel format. When the key is added to the kernel keyring, the keyid defines the name of the key: 5D2B05FC633EE3E8 in the example bellow. -Here is example output of the keyctl utility. - -$ keyctl show -Session Keyring - -3 --alswrv 0 0 keyring: _ses -603976250 --alswrv 0 -1 \_ keyring: _uid.0 -817777377 --alswrv 0 0 \_ user: kmk -891974900 --alswrv 0 0 \_ encrypted: evm-key -170323636 --alswrv 0 0 \_ keyring: _module -548221616 --alswrv 0 0 \_ keyring: _ima -128198054 --alswrv 0 0 \_ keyring: _evm - -$ keyctl list 128198054 -1 key in keyring: -620789745: --alswrv 0 0 user: 5D2B05FC633EE3E8 - - -Dmitry Kasatkin -06.10.2011 +Here is example output of the keyctl utility:: + + $ keyctl show + Session Keyring + -3 --alswrv 0 0 keyring: _ses + 603976250 --alswrv 0 -1 \_ keyring: _uid.0 + 817777377 --alswrv 0 0 \_ user: kmk + 891974900 --alswrv 0 0 \_ encrypted: evm-key + 170323636 --alswrv 0 0 \_ keyring: _module + 548221616 --alswrv 0 0 \_ keyring: _ima + 128198054 --alswrv 0 0 \_ keyring: _evm + + $ keyctl list 128198054 + 1 key in keyring: + 620789745: --alswrv 0 0 user: 5D2B05FC633EE3E8 diff --git a/Documentation/doc-guide/docbook.rst b/Documentation/doc-guide/docbook.rst deleted file mode 100644 index d8bf04308b43..000000000000 --- a/Documentation/doc-guide/docbook.rst +++ /dev/null @@ -1,90 +0,0 @@ -DocBook XML [DEPRECATED] -======================== - -.. attention:: - - This section describes the deprecated DocBook XML toolchain. Please do not - create new DocBook XML template files. Please consider converting existing - DocBook XML templates files to Sphinx/reStructuredText. - -Converting DocBook to Sphinx ----------------------------- - -Over time, we expect all of the documents under ``Documentation/DocBook`` to be -converted to Sphinx and reStructuredText. For most DocBook XML documents, a good -enough solution is to use the simple ``Documentation/sphinx/tmplcvt`` script, -which uses ``pandoc`` under the hood. For example:: - - $ cd Documentation/sphinx - $ ./tmplcvt ../DocBook/in.tmpl ../out.rst - -Then edit the resulting rst files to fix any remaining issues, and add the -document in the ``toctree`` in ``Documentation/index.rst``. - -Components of the kernel-doc system ------------------------------------ - -Many places in the source tree have extractable documentation in the form of -block comments above functions. The components of this system are: - -- ``scripts/kernel-doc`` - - This is a perl script that hunts for the block comments and can mark them up - directly into reStructuredText, DocBook, man, text, and HTML. (No, not - texinfo.) - -- ``Documentation/DocBook/*.tmpl`` - - These are XML template files, which are normal XML files with special - place-holders for where the extracted documentation should go. - -- ``scripts/docproc.c`` - - This is a program for converting XML template files into XML files. When a - file is referenced it is searched for symbols exported (EXPORT_SYMBOL), to be - able to distinguish between internal and external functions. - - It invokes kernel-doc, giving it the list of functions that are to be - documented. - - Additionally it is used to scan the XML template files to locate all the files - referenced herein. This is used to generate dependency information as used by - make. - -- ``Makefile`` - - The targets 'xmldocs', 'psdocs', 'pdfdocs', and 'htmldocs' are used to build - DocBook XML files, PostScript files, PDF files, and html files in - Documentation/DocBook. The older target 'sgmldocs' is equivalent to 'xmldocs'. - -- ``Documentation/DocBook/Makefile`` - - This is where C files are associated with SGML templates. - -How to use kernel-doc comments in DocBook XML template files ------------------------------------------------------------- - -DocBook XML template files (\*.tmpl) are like normal XML files, except that they -can contain escape sequences where extracted documentation should be inserted. - -``!E<filename>`` is replaced by the documentation, in ``<filename>``, for -functions that are exported using ``EXPORT_SYMBOL``: the function list is -collected from files listed in ``Documentation/DocBook/Makefile``. - -``!I<filename>`` is replaced by the documentation for functions that are **not** -exported using ``EXPORT_SYMBOL``. - -``!D<filename>`` is used to name additional files to search for functions -exported using ``EXPORT_SYMBOL``. - -``!F<filename> <function [functions...]>`` is replaced by the documentation, in -``<filename>``, for the functions listed. - -``!P<filename> <section title>`` is replaced by the contents of the ``DOC:`` -section titled ``<section title>`` from ``<filename>``. Spaces are allowed in -``<section title>``; do not quote the ``<section title>``. - -``!C<filename>`` is replaced by nothing, but makes the tools check that all DOC: -sections and documented functions, symbols, etc. are used. This makes sense to -use when you use ``!F`` or ``!P`` only and want to verify that all documentation -is included. diff --git a/Documentation/doc-guide/index.rst b/Documentation/doc-guide/index.rst index 6fff4024606e..a7f95d7d3a63 100644 --- a/Documentation/doc-guide/index.rst +++ b/Documentation/doc-guide/index.rst @@ -10,7 +10,6 @@ How to write kernel documentation sphinx.rst kernel-doc.rst parse-headers.rst - docbook.rst .. only:: subproject and html diff --git a/Documentation/doc-guide/kernel-doc.rst b/Documentation/doc-guide/kernel-doc.rst index b32e4813ff6f..b24854b5d6be 100644 --- a/Documentation/doc-guide/kernel-doc.rst +++ b/Documentation/doc-guide/kernel-doc.rst @@ -149,6 +149,16 @@ Domain`_ references. ``%CONST`` Name of a constant. (No cross-referencing, just formatting.) +````literal```` + A literal block that should be handled as-is. The output will use a + ``monospaced font``. + + Useful if you need to use special characters that would otherwise have some + meaning either by kernel-doc script of by reStructuredText. + + This is particularly useful if you need to use things like ``%ph`` inside + a function description. + ``$ENVVAR`` Name of an environment variable. (No cross-referencing, just formatting.) diff --git a/Documentation/doc-guide/sphinx.rst b/Documentation/doc-guide/sphinx.rst index 731334de3efd..84e8e8a9cbdb 100644 --- a/Documentation/doc-guide/sphinx.rst +++ b/Documentation/doc-guide/sphinx.rst @@ -15,11 +15,6 @@ are used to describe the functions and types and design of the code. The kernel-doc comments have some special structure and formatting, but beyond that they are also treated as reStructuredText. -There is also the deprecated DocBook toolchain to generate documentation from -DocBook XML template files under ``Documentation/DocBook``. The DocBook files -are to be converted to reStructuredText, and the toolchain is slated to be -removed. - Finally, there are thousands of plain text documentation files scattered around ``Documentation``. Some of these will likely be converted to reStructuredText over time, but the bulk of them will remain in plain text. diff --git a/Documentation/dontdiff b/Documentation/dontdiff index 77b92221f951..358b47c06ad4 100644 --- a/Documentation/dontdiff +++ b/Documentation/dontdiff @@ -118,7 +118,6 @@ defkeymap.c devlist.h* devicetable-offsets.h dnotify_test -docproc dslm dtc elf2ecoff @@ -207,6 +206,8 @@ r200_reg_safe.h r300_reg_safe.h r420_reg_safe.h r600_reg_safe.h +randomize_layout_hash.h +randomize_layout_seed.h recordmcount relocs rlim_names.h diff --git a/Documentation/driver-api/basics.rst b/Documentation/driver-api/basics.rst index 472e7a664d13..ab82250c7727 100644 --- a/Documentation/driver-api/basics.rst +++ b/Documentation/driver-api/basics.rst @@ -106,9 +106,6 @@ Kernel utility functions .. kernel-doc:: kernel/sys.c :export: -.. kernel-doc:: kernel/rcu/srcu.c - :export: - .. kernel-doc:: kernel/rcu/tree.c :export: diff --git a/Documentation/driver-api/firmware/request_firmware.rst b/Documentation/driver-api/firmware/request_firmware.rst index cc0aea880824..1c2c4967cd43 100644 --- a/Documentation/driver-api/firmware/request_firmware.rst +++ b/Documentation/driver-api/firmware/request_firmware.rst @@ -44,6 +44,17 @@ request_firmware_nowait .. kernel-doc:: drivers/base/firmware_class.c :functions: request_firmware_nowait +Considerations for suspend and resume +===================================== + +During suspend and resume only the built-in firmware and the firmware cache +elements of the firmware API can be used. This is managed by fw_pm_notify(). + +fw_pm_notify +------------ +.. kernel-doc:: drivers/base/firmware_class.c + :functions: fw_pm_notify + request firmware API expected driver use ======================================== diff --git a/Documentation/driver-api/i2c.rst b/Documentation/driver-api/i2c.rst index f3939f7852bd..7582c079d747 100644 --- a/Documentation/driver-api/i2c.rst +++ b/Documentation/driver-api/i2c.rst @@ -13,8 +13,8 @@ I2C is a multi-master bus; open drain signaling is used to arbitrate between masters, as well as to handshake and to synchronize clocks from slower clients. -The Linux I2C programming interfaces support only the master side of bus -interactions, not the slave side. The programming interface is +The Linux I2C programming interfaces support the master side of bus +interactions and the slave side. The programming interface is structured around two kinds of driver, and two kinds of device. An I2C "Adapter Driver" abstracts the controller hardware; it binds to a physical device (perhaps a PCI device or platform_device) and exposes a @@ -22,9 +22,8 @@ physical device (perhaps a PCI device or platform_device) and exposes a I2C bus segment it manages. On each I2C bus segment will be I2C devices represented by a :c:type:`struct i2c_client <i2c_client>`. Those devices will be bound to a :c:type:`struct i2c_driver -<i2c_driver>`, which should follow the standard Linux driver -model. (At this writing, a legacy model is more widely used.) There are -functions to perform various I2C protocol operations; at this writing +<i2c_driver>`, which should follow the standard Linux driver model. There +are functions to perform various I2C protocol operations; at this writing all such functions are usable only from task context. The System Management Bus (SMBus) is a sibling protocol. Most SMBus @@ -42,5 +41,8 @@ i2c_adapter devices which don't support those I2C operations. .. kernel-doc:: drivers/i2c/i2c-boardinfo.c :functions: i2c_register_board_info -.. kernel-doc:: drivers/i2c/i2c-core.c +.. kernel-doc:: drivers/i2c/i2c-core-base.c + :export: + +.. kernel-doc:: drivers/i2c/i2c-core-smbus.c :export: diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst index 8058a87c1c74..7c94ab50afed 100644 --- a/Documentation/driver-api/index.rst +++ b/Documentation/driver-api/index.rst @@ -32,11 +32,18 @@ available subsections can be seen below. i2c hsi edac + scsi + libata + mtdnand miscellaneous + w1 + rapidio + s390-drivers vme 80211/index uio-howto firmware/index + pinctl misc_devices .. only:: subproject and html diff --git a/Documentation/driver-api/libata.rst b/Documentation/driver-api/libata.rst new file mode 100644 index 000000000000..4adc056f7635 --- /dev/null +++ b/Documentation/driver-api/libata.rst @@ -0,0 +1,1031 @@ +======================== +libATA Developer's Guide +======================== + +:Author: Jeff Garzik + +Introduction +============ + +libATA is a library used inside the Linux kernel to support ATA host +controllers and devices. libATA provides an ATA driver API, class +transports for ATA and ATAPI devices, and SCSI<->ATA translation for ATA +devices according to the T10 SAT specification. + +This Guide documents the libATA driver API, library functions, library +internals, and a couple sample ATA low-level drivers. + +libata Driver API +================= + +:c:type:`struct ata_port_operations <ata_port_operations>` +is defined for every low-level libata +hardware driver, and it controls how the low-level driver interfaces +with the ATA and SCSI layers. + +FIS-based drivers will hook into the system with ``->qc_prep()`` and +``->qc_issue()`` high-level hooks. Hardware which behaves in a manner +similar to PCI IDE hardware may utilize several generic helpers, +defining at a bare minimum the bus I/O addresses of the ATA shadow +register blocks. + +:c:type:`struct ata_port_operations <ata_port_operations>` +---------------------------------------------------------- + +Disable ATA port +~~~~~~~~~~~~~~~~ + +:: + + void (*port_disable) (struct ata_port *); + + +Called from :c:func:`ata_bus_probe` error path, as well as when unregistering +from the SCSI module (rmmod, hot unplug). This function should do +whatever needs to be done to take the port out of use. In most cases, +:c:func:`ata_port_disable` can be used as this hook. + +Called from :c:func:`ata_bus_probe` on a failed probe. Called from +:c:func:`ata_scsi_release`. + +Post-IDENTIFY device configuration +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + void (*dev_config) (struct ata_port *, struct ata_device *); + + +Called after IDENTIFY [PACKET] DEVICE is issued to each device found. +Typically used to apply device-specific fixups prior to issue of SET +FEATURES - XFER MODE, and prior to operation. + +This entry may be specified as NULL in ata_port_operations. + +Set PIO/DMA mode +~~~~~~~~~~~~~~~~ + +:: + + void (*set_piomode) (struct ata_port *, struct ata_device *); + void (*set_dmamode) (struct ata_port *, struct ata_device *); + void (*post_set_mode) (struct ata_port *); + unsigned int (*mode_filter) (struct ata_port *, struct ata_device *, unsigned int); + + +Hooks called prior to the issue of SET FEATURES - XFER MODE command. The +optional ``->mode_filter()`` hook is called when libata has built a mask of +the possible modes. This is passed to the ``->mode_filter()`` function +which should return a mask of valid modes after filtering those +unsuitable due to hardware limits. It is not valid to use this interface +to add modes. + +``dev->pio_mode`` and ``dev->dma_mode`` are guaranteed to be valid when +``->set_piomode()`` and when ``->set_dmamode()`` is called. The timings for +any other drive sharing the cable will also be valid at this point. That +is the library records the decisions for the modes of each drive on a +channel before it attempts to set any of them. + +``->post_set_mode()`` is called unconditionally, after the SET FEATURES - +XFER MODE command completes successfully. + +``->set_piomode()`` is always called (if present), but ``->set_dma_mode()`` +is only called if DMA is possible. + +Taskfile read/write +~~~~~~~~~~~~~~~~~~~ + +:: + + void (*sff_tf_load) (struct ata_port *ap, struct ata_taskfile *tf); + void (*sff_tf_read) (struct ata_port *ap, struct ata_taskfile *tf); + + +``->tf_load()`` is called to load the given taskfile into hardware +registers / DMA buffers. ``->tf_read()`` is called to read the hardware +registers / DMA buffers, to obtain the current set of taskfile register +values. Most drivers for taskfile-based hardware (PIO or MMIO) use +:c:func:`ata_sff_tf_load` and :c:func:`ata_sff_tf_read` for these hooks. + +PIO data read/write +~~~~~~~~~~~~~~~~~~~ + +:: + + void (*sff_data_xfer) (struct ata_device *, unsigned char *, unsigned int, int); + + +All bmdma-style drivers must implement this hook. This is the low-level +operation that actually copies the data bytes during a PIO data +transfer. Typically the driver will choose one of +:c:func:`ata_sff_data_xfer_noirq`, :c:func:`ata_sff_data_xfer`, or +:c:func:`ata_sff_data_xfer32`. + +ATA command execute +~~~~~~~~~~~~~~~~~~~ + +:: + + void (*sff_exec_command)(struct ata_port *ap, struct ata_taskfile *tf); + + +causes an ATA command, previously loaded with ``->tf_load()``, to be +initiated in hardware. Most drivers for taskfile-based hardware use +:c:func:`ata_sff_exec_command` for this hook. + +Per-cmd ATAPI DMA capabilities filter +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + int (*check_atapi_dma) (struct ata_queued_cmd *qc); + + +Allow low-level driver to filter ATA PACKET commands, returning a status +indicating whether or not it is OK to use DMA for the supplied PACKET +command. + +This hook may be specified as NULL, in which case libata will assume +that atapi dma can be supported. + +Read specific ATA shadow registers +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + u8 (*sff_check_status)(struct ata_port *ap); + u8 (*sff_check_altstatus)(struct ata_port *ap); + + +Reads the Status/AltStatus ATA shadow register from hardware. On some +hardware, reading the Status register has the side effect of clearing +the interrupt condition. Most drivers for taskfile-based hardware use +:c:func:`ata_sff_check_status` for this hook. + +Write specific ATA shadow register +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + void (*sff_set_devctl)(struct ata_port *ap, u8 ctl); + + +Write the device control ATA shadow register to the hardware. Most +drivers don't need to define this. + +Select ATA device on bus +~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + void (*sff_dev_select)(struct ata_port *ap, unsigned int device); + + +Issues the low-level hardware command(s) that causes one of N hardware +devices to be considered 'selected' (active and available for use) on +the ATA bus. This generally has no meaning on FIS-based devices. + +Most drivers for taskfile-based hardware use :c:func:`ata_sff_dev_select` for +this hook. + +Private tuning method +~~~~~~~~~~~~~~~~~~~~~ + +:: + + void (*set_mode) (struct ata_port *ap); + + +By default libata performs drive and controller tuning in accordance +with the ATA timing rules and also applies blacklists and cable limits. +Some controllers need special handling and have custom tuning rules, +typically raid controllers that use ATA commands but do not actually do +drive timing. + + **Warning** + + This hook should not be used to replace the standard controller + tuning logic when a controller has quirks. Replacing the default + tuning logic in that case would bypass handling for drive and bridge + quirks that may be important to data reliability. If a controller + needs to filter the mode selection it should use the mode_filter + hook instead. + +Control PCI IDE BMDMA engine +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + void (*bmdma_setup) (struct ata_queued_cmd *qc); + void (*bmdma_start) (struct ata_queued_cmd *qc); + void (*bmdma_stop) (struct ata_port *ap); + u8 (*bmdma_status) (struct ata_port *ap); + + +When setting up an IDE BMDMA transaction, these hooks arm +(``->bmdma_setup``), fire (``->bmdma_start``), and halt (``->bmdma_stop``) the +hardware's DMA engine. ``->bmdma_status`` is used to read the standard PCI +IDE DMA Status register. + +These hooks are typically either no-ops, or simply not implemented, in +FIS-based drivers. + +Most legacy IDE drivers use :c:func:`ata_bmdma_setup` for the +:c:func:`bmdma_setup` hook. :c:func:`ata_bmdma_setup` will write the pointer +to the PRD table to the IDE PRD Table Address register, enable DMA in the DMA +Command register, and call :c:func:`exec_command` to begin the transfer. + +Most legacy IDE drivers use :c:func:`ata_bmdma_start` for the +:c:func:`bmdma_start` hook. :c:func:`ata_bmdma_start` will write the +ATA_DMA_START flag to the DMA Command register. + +Many legacy IDE drivers use :c:func:`ata_bmdma_stop` for the +:c:func:`bmdma_stop` hook. :c:func:`ata_bmdma_stop` clears the ATA_DMA_START +flag in the DMA command register. + +Many legacy IDE drivers use :c:func:`ata_bmdma_status` as the +:c:func:`bmdma_status` hook. + +High-level taskfile hooks +~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + void (*qc_prep) (struct ata_queued_cmd *qc); + int (*qc_issue) (struct ata_queued_cmd *qc); + + +Higher-level hooks, these two hooks can potentially supercede several of +the above taskfile/DMA engine hooks. ``->qc_prep`` is called after the +buffers have been DMA-mapped, and is typically used to populate the +hardware's DMA scatter-gather table. Most drivers use the standard +:c:func:`ata_qc_prep` helper function, but more advanced drivers roll their +own. + +``->qc_issue`` is used to make a command active, once the hardware and S/G +tables have been prepared. IDE BMDMA drivers use the helper function +:c:func:`ata_qc_issue_prot` for taskfile protocol-based dispatch. More +advanced drivers implement their own ``->qc_issue``. + +:c:func:`ata_qc_issue_prot` calls ``->tf_load()``, ``->bmdma_setup()``, and +``->bmdma_start()`` as necessary to initiate a transfer. + +Exception and probe handling (EH) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + void (*eng_timeout) (struct ata_port *ap); + void (*phy_reset) (struct ata_port *ap); + + +Deprecated. Use ``->error_handler()`` instead. + +:: + + void (*freeze) (struct ata_port *ap); + void (*thaw) (struct ata_port *ap); + + +:c:func:`ata_port_freeze` is called when HSM violations or some other +condition disrupts normal operation of the port. A frozen port is not +allowed to perform any operation until the port is thawed, which usually +follows a successful reset. + +The optional ``->freeze()`` callback can be used for freezing the port +hardware-wise (e.g. mask interrupt and stop DMA engine). If a port +cannot be frozen hardware-wise, the interrupt handler must ack and clear +interrupts unconditionally while the port is frozen. + +The optional ``->thaw()`` callback is called to perform the opposite of +``->freeze()``: prepare the port for normal operation once again. Unmask +interrupts, start DMA engine, etc. + +:: + + void (*error_handler) (struct ata_port *ap); + + +``->error_handler()`` is a driver's hook into probe, hotplug, and recovery +and other exceptional conditions. The primary responsibility of an +implementation is to call :c:func:`ata_do_eh` or :c:func:`ata_bmdma_drive_eh` +with a set of EH hooks as arguments: + +'prereset' hook (may be NULL) is called during an EH reset, before any +other actions are taken. + +'postreset' hook (may be NULL) is called after the EH reset is +performed. Based on existing conditions, severity of the problem, and +hardware capabilities, + +Either 'softreset' (may be NULL) or 'hardreset' (may be NULL) will be +called to perform the low-level EH reset. + +:: + + void (*post_internal_cmd) (struct ata_queued_cmd *qc); + + +Perform any hardware-specific actions necessary to finish processing +after executing a probe-time or EH-time command via +:c:func:`ata_exec_internal`. + +Hardware interrupt handling +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + irqreturn_t (*irq_handler)(int, void *, struct pt_regs *); + void (*irq_clear) (struct ata_port *); + + +``->irq_handler`` is the interrupt handling routine registered with the +system, by libata. ``->irq_clear`` is called during probe just before the +interrupt handler is registered, to be sure hardware is quiet. + +The second argument, dev_instance, should be cast to a pointer to +:c:type:`struct ata_host_set <ata_host_set>`. + +Most legacy IDE drivers use :c:func:`ata_sff_interrupt` for the irq_handler +hook, which scans all ports in the host_set, determines which queued +command was active (if any), and calls ata_sff_host_intr(ap,qc). + +Most legacy IDE drivers use :c:func:`ata_sff_irq_clear` for the +:c:func:`irq_clear` hook, which simply clears the interrupt and error flags +in the DMA status register. + +SATA phy read/write +~~~~~~~~~~~~~~~~~~~ + +:: + + int (*scr_read) (struct ata_port *ap, unsigned int sc_reg, + u32 *val); + int (*scr_write) (struct ata_port *ap, unsigned int sc_reg, + u32 val); + + +Read and write standard SATA phy registers. Currently only used if +``->phy_reset`` hook called the :c:func:`sata_phy_reset` helper function. +sc_reg is one of SCR_STATUS, SCR_CONTROL, SCR_ERROR, or SCR_ACTIVE. + +Init and shutdown +~~~~~~~~~~~~~~~~~ + +:: + + int (*port_start) (struct ata_port *ap); + void (*port_stop) (struct ata_port *ap); + void (*host_stop) (struct ata_host_set *host_set); + + +``->port_start()`` is called just after the data structures for each port +are initialized. Typically this is used to alloc per-port DMA buffers / +tables / rings, enable DMA engines, and similar tasks. Some drivers also +use this entry point as a chance to allocate driver-private memory for +``ap->private_data``. + +Many drivers use :c:func:`ata_port_start` as this hook or call it from their +own :c:func:`port_start` hooks. :c:func:`ata_port_start` allocates space for +a legacy IDE PRD table and returns. + +``->port_stop()`` is called after ``->host_stop()``. Its sole function is to +release DMA/memory resources, now that they are no longer actively being +used. Many drivers also free driver-private data from port at this time. + +``->host_stop()`` is called after all ``->port_stop()`` calls have completed. +The hook must finalize hardware shutdown, release DMA and other +resources, etc. This hook may be specified as NULL, in which case it is +not called. + +Error handling +============== + +This chapter describes how errors are handled under libata. Readers are +advised to read SCSI EH (Documentation/scsi/scsi_eh.txt) and ATA +exceptions doc first. + +Origins of commands +------------------- + +In libata, a command is represented with +:c:type:`struct ata_queued_cmd <ata_queued_cmd>` or qc. +qc's are preallocated during port initialization and repetitively used +for command executions. Currently only one qc is allocated per port but +yet-to-be-merged NCQ branch allocates one for each tag and maps each qc +to NCQ tag 1-to-1. + +libata commands can originate from two sources - libata itself and SCSI +midlayer. libata internal commands are used for initialization and error +handling. All normal blk requests and commands for SCSI emulation are +passed as SCSI commands through queuecommand callback of SCSI host +template. + +How commands are issued +----------------------- + +Internal commands + First, qc is allocated and initialized using :c:func:`ata_qc_new_init`. + Although :c:func:`ata_qc_new_init` doesn't implement any wait or retry + mechanism when qc is not available, internal commands are currently + issued only during initialization and error recovery, so no other + command is active and allocation is guaranteed to succeed. + + Once allocated qc's taskfile is initialized for the command to be + executed. qc currently has two mechanisms to notify completion. One + is via ``qc->complete_fn()`` callback and the other is completion + ``qc->waiting``. ``qc->complete_fn()`` callback is the asynchronous path + used by normal SCSI translated commands and ``qc->waiting`` is the + synchronous (issuer sleeps in process context) path used by internal + commands. + + Once initialization is complete, host_set lock is acquired and the + qc is issued. + +SCSI commands + All libata drivers use :c:func:`ata_scsi_queuecmd` as + ``hostt->queuecommand`` callback. scmds can either be simulated or + translated. No qc is involved in processing a simulated scmd. The + result is computed right away and the scmd is completed. + + For a translated scmd, :c:func:`ata_qc_new_init` is invoked to allocate a + qc and the scmd is translated into the qc. SCSI midlayer's + completion notification function pointer is stored into + ``qc->scsidone``. + + ``qc->complete_fn()`` callback is used for completion notification. ATA + commands use :c:func:`ata_scsi_qc_complete` while ATAPI commands use + :c:func:`atapi_qc_complete`. Both functions end up calling ``qc->scsidone`` + to notify upper layer when the qc is finished. After translation is + completed, the qc is issued with :c:func:`ata_qc_issue`. + + Note that SCSI midlayer invokes hostt->queuecommand while holding + host_set lock, so all above occur while holding host_set lock. + +How commands are processed +-------------------------- + +Depending on which protocol and which controller are used, commands are +processed differently. For the purpose of discussion, a controller which +uses taskfile interface and all standard callbacks is assumed. + +Currently 6 ATA command protocols are used. They can be sorted into the +following four categories according to how they are processed. + +ATA NO DATA or DMA + ATA_PROT_NODATA and ATA_PROT_DMA fall into this category. These + types of commands don't require any software intervention once + issued. Device will raise interrupt on completion. + +ATA PIO + ATA_PROT_PIO is in this category. libata currently implements PIO + with polling. ATA_NIEN bit is set to turn off interrupt and + pio_task on ata_wq performs polling and IO. + +ATAPI NODATA or DMA + ATA_PROT_ATAPI_NODATA and ATA_PROT_ATAPI_DMA are in this + category. packet_task is used to poll BSY bit after issuing PACKET + command. Once BSY is turned off by the device, packet_task + transfers CDB and hands off processing to interrupt handler. + +ATAPI PIO + ATA_PROT_ATAPI is in this category. ATA_NIEN bit is set and, as + in ATAPI NODATA or DMA, packet_task submits cdb. However, after + submitting cdb, further processing (data transfer) is handed off to + pio_task. + +How commands are completed +-------------------------- + +Once issued, all qc's are either completed with :c:func:`ata_qc_complete` or +time out. For commands which are handled by interrupts, +:c:func:`ata_host_intr` invokes :c:func:`ata_qc_complete`, and, for PIO tasks, +pio_task invokes :c:func:`ata_qc_complete`. In error cases, packet_task may +also complete commands. + +:c:func:`ata_qc_complete` does the following. + +1. DMA memory is unmapped. + +2. ATA_QCFLAG_ACTIVE is cleared from qc->flags. + +3. :c:func:`qc->complete_fn` callback is invoked. If the return value of the + callback is not zero. Completion is short circuited and + :c:func:`ata_qc_complete` returns. + +4. :c:func:`__ata_qc_complete` is called, which does + + 1. ``qc->flags`` is cleared to zero. + + 2. ``ap->active_tag`` and ``qc->tag`` are poisoned. + + 3. ``qc->waiting`` is cleared & completed (in that order). + + 4. qc is deallocated by clearing appropriate bit in ``ap->qactive``. + +So, it basically notifies upper layer and deallocates qc. One exception +is short-circuit path in #3 which is used by :c:func:`atapi_qc_complete`. + +For all non-ATAPI commands, whether it fails or not, almost the same +code path is taken and very little error handling takes place. A qc is +completed with success status if it succeeded, with failed status +otherwise. + +However, failed ATAPI commands require more handling as REQUEST SENSE is +needed to acquire sense data. If an ATAPI command fails, +:c:func:`ata_qc_complete` is invoked with error status, which in turn invokes +:c:func:`atapi_qc_complete` via ``qc->complete_fn()`` callback. + +This makes :c:func:`atapi_qc_complete` set ``scmd->result`` to +SAM_STAT_CHECK_CONDITION, complete the scmd and return 1. As the +sense data is empty but ``scmd->result`` is CHECK CONDITION, SCSI midlayer +will invoke EH for the scmd, and returning 1 makes :c:func:`ata_qc_complete` +to return without deallocating the qc. This leads us to +:c:func:`ata_scsi_error` with partially completed qc. + +:c:func:`ata_scsi_error` +------------------------ + +:c:func:`ata_scsi_error` is the current ``transportt->eh_strategy_handler()`` +for libata. As discussed above, this will be entered in two cases - +timeout and ATAPI error completion. This function calls low level libata +driver's :c:func:`eng_timeout` callback, the standard callback for which is +:c:func:`ata_eng_timeout`. It checks if a qc is active and calls +:c:func:`ata_qc_timeout` on the qc if so. Actual error handling occurs in +:c:func:`ata_qc_timeout`. + +If EH is invoked for timeout, :c:func:`ata_qc_timeout` stops BMDMA and +completes the qc. Note that as we're currently in EH, we cannot call +scsi_done. As described in SCSI EH doc, a recovered scmd should be +either retried with :c:func:`scsi_queue_insert` or finished with +:c:func:`scsi_finish_command`. Here, we override ``qc->scsidone`` with +:c:func:`scsi_finish_command` and calls :c:func:`ata_qc_complete`. + +If EH is invoked due to a failed ATAPI qc, the qc here is completed but +not deallocated. The purpose of this half-completion is to use the qc as +place holder to make EH code reach this place. This is a bit hackish, +but it works. + +Once control reaches here, the qc is deallocated by invoking +:c:func:`__ata_qc_complete` explicitly. Then, internal qc for REQUEST SENSE +is issued. Once sense data is acquired, scmd is finished by directly +invoking :c:func:`scsi_finish_command` on the scmd. Note that as we already +have completed and deallocated the qc which was associated with the +scmd, we don't need to/cannot call :c:func:`ata_qc_complete` again. + +Problems with the current EH +---------------------------- + +- Error representation is too crude. Currently any and all error + conditions are represented with ATA STATUS and ERROR registers. + Errors which aren't ATA device errors are treated as ATA device + errors by setting ATA_ERR bit. Better error descriptor which can + properly represent ATA and other errors/exceptions is needed. + +- When handling timeouts, no action is taken to make device forget + about the timed out command and ready for new commands. + +- EH handling via :c:func:`ata_scsi_error` is not properly protected from + usual command processing. On EH entrance, the device is not in + quiescent state. Timed out commands may succeed or fail any time. + pio_task and atapi_task may still be running. + +- Too weak error recovery. Devices / controllers causing HSM mismatch + errors and other errors quite often require reset to return to known + state. Also, advanced error handling is necessary to support features + like NCQ and hotplug. + +- ATA errors are directly handled in the interrupt handler and PIO + errors in pio_task. This is problematic for advanced error handling + for the following reasons. + + First, advanced error handling often requires context and internal qc + execution. + + Second, even a simple failure (say, CRC error) needs information + gathering and could trigger complex error handling (say, resetting & + reconfiguring). Having multiple code paths to gather information, + enter EH and trigger actions makes life painful. + + Third, scattered EH code makes implementing low level drivers + difficult. Low level drivers override libata callbacks. If EH is + scattered over several places, each affected callbacks should perform + its part of error handling. This can be error prone and painful. + +libata Library +============== + +.. kernel-doc:: drivers/ata/libata-core.c + :export: + +libata Core Internals +===================== + +.. kernel-doc:: drivers/ata/libata-core.c + :internal: + +.. kernel-doc:: drivers/ata/libata-eh.c + +libata SCSI translation/emulation +================================= + +.. kernel-doc:: drivers/ata/libata-scsi.c + :export: + +.. kernel-doc:: drivers/ata/libata-scsi.c + :internal: + +ATA errors and exceptions +========================= + +This chapter tries to identify what error/exception conditions exist for +ATA/ATAPI devices and describe how they should be handled in +implementation-neutral way. + +The term 'error' is used to describe conditions where either an explicit +error condition is reported from device or a command has timed out. + +The term 'exception' is either used to describe exceptional conditions +which are not errors (say, power or hotplug events), or to describe both +errors and non-error exceptional conditions. Where explicit distinction +between error and exception is necessary, the term 'non-error exception' +is used. + +Exception categories +-------------------- + +Exceptions are described primarily with respect to legacy taskfile + bus +master IDE interface. If a controller provides other better mechanism +for error reporting, mapping those into categories described below +shouldn't be difficult. + +In the following sections, two recovery actions - reset and +reconfiguring transport - are mentioned. These are described further in +`EH recovery actions <#exrec>`__. + +HSM violation +~~~~~~~~~~~~~ + +This error is indicated when STATUS value doesn't match HSM requirement +during issuing or execution any ATA/ATAPI command. + +- ATA_STATUS doesn't contain !BSY && DRDY && !DRQ while trying to + issue a command. + +- !BSY && !DRQ during PIO data transfer. + +- DRQ on command completion. + +- !BSY && ERR after CDB transfer starts but before the last byte of CDB + is transferred. ATA/ATAPI standard states that "The device shall not + terminate the PACKET command with an error before the last byte of + the command packet has been written" in the error outputs description + of PACKET command and the state diagram doesn't include such + transitions. + +In these cases, HSM is violated and not much information regarding the +error can be acquired from STATUS or ERROR register. IOW, this error can +be anything - driver bug, faulty device, controller and/or cable. + +As HSM is violated, reset is necessary to restore known state. +Reconfiguring transport for lower speed might be helpful too as +transmission errors sometimes cause this kind of errors. + +ATA/ATAPI device error (non-NCQ / non-CHECK CONDITION) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +These are errors detected and reported by ATA/ATAPI devices indicating +device problems. For this type of errors, STATUS and ERROR register +values are valid and describe error condition. Note that some of ATA bus +errors are detected by ATA/ATAPI devices and reported using the same +mechanism as device errors. Those cases are described later in this +section. + +For ATA commands, this type of errors are indicated by !BSY && ERR +during command execution and on completion. + +For ATAPI commands, + +- !BSY && ERR && ABRT right after issuing PACKET indicates that PACKET + command is not supported and falls in this category. + +- !BSY && ERR(==CHK) && !ABRT after the last byte of CDB is transferred + indicates CHECK CONDITION and doesn't fall in this category. + +- !BSY && ERR(==CHK) && ABRT after the last byte of CDB is transferred + \*probably\* indicates CHECK CONDITION and doesn't fall in this + category. + +Of errors detected as above, the following are not ATA/ATAPI device +errors but ATA bus errors and should be handled according to +`ATA bus error <#excatATAbusErr>`__. + +CRC error during data transfer + This is indicated by ICRC bit in the ERROR register and means that + corruption occurred during data transfer. Up to ATA/ATAPI-7, the + standard specifies that this bit is only applicable to UDMA + transfers but ATA/ATAPI-8 draft revision 1f says that the bit may be + applicable to multiword DMA and PIO. + +ABRT error during data transfer or on completion + Up to ATA/ATAPI-7, the standard specifies that ABRT could be set on + ICRC errors and on cases where a device is not able to complete a + command. Combined with the fact that MWDMA and PIO transfer errors + aren't allowed to use ICRC bit up to ATA/ATAPI-7, it seems to imply + that ABRT bit alone could indicate transfer errors. + + However, ATA/ATAPI-8 draft revision 1f removes the part that ICRC + errors can turn on ABRT. So, this is kind of gray area. Some + heuristics are needed here. + +ATA/ATAPI device errors can be further categorized as follows. + +Media errors + This is indicated by UNC bit in the ERROR register. ATA devices + reports UNC error only after certain number of retries cannot + recover the data, so there's nothing much else to do other than + notifying upper layer. + + READ and WRITE commands report CHS or LBA of the first failed sector + but ATA/ATAPI standard specifies that the amount of transferred data + on error completion is indeterminate, so we cannot assume that + sectors preceding the failed sector have been transferred and thus + cannot complete those sectors successfully as SCSI does. + +Media changed / media change requested error + <<TODO: fill here>> + +Address error + This is indicated by IDNF bit in the ERROR register. Report to upper + layer. + +Other errors + This can be invalid command or parameter indicated by ABRT ERROR bit + or some other error condition. Note that ABRT bit can indicate a lot + of things including ICRC and Address errors. Heuristics needed. + +Depending on commands, not all STATUS/ERROR bits are applicable. These +non-applicable bits are marked with "na" in the output descriptions but +up to ATA/ATAPI-7 no definition of "na" can be found. However, +ATA/ATAPI-8 draft revision 1f describes "N/A" as follows. + + 3.2.3.3a N/A + A keyword the indicates a field has no defined value in this + standard and should not be checked by the host or device. N/A + fields should be cleared to zero. + +So, it seems reasonable to assume that "na" bits are cleared to zero by +devices and thus need no explicit masking. + +ATAPI device CHECK CONDITION +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +ATAPI device CHECK CONDITION error is indicated by set CHK bit (ERR bit) +in the STATUS register after the last byte of CDB is transferred for a +PACKET command. For this kind of errors, sense data should be acquired +to gather information regarding the errors. REQUEST SENSE packet command +should be used to acquire sense data. + +Once sense data is acquired, this type of errors can be handled +similarly to other SCSI errors. Note that sense data may indicate ATA +bus error (e.g. Sense Key 04h HARDWARE ERROR && ASC/ASCQ 47h/00h SCSI +PARITY ERROR). In such cases, the error should be considered as an ATA +bus error and handled according to `ATA bus error <#excatATAbusErr>`__. + +ATA device error (NCQ) +~~~~~~~~~~~~~~~~~~~~~~ + +NCQ command error is indicated by cleared BSY and set ERR bit during NCQ +command phase (one or more NCQ commands outstanding). Although STATUS +and ERROR registers will contain valid values describing the error, READ +LOG EXT is required to clear the error condition, determine which +command has failed and acquire more information. + +READ LOG EXT Log Page 10h reports which tag has failed and taskfile +register values describing the error. With this information the failed +command can be handled as a normal ATA command error as in +`ATA/ATAPI device error (non-NCQ / non-CHECK CONDITION) <#excatDevErr>`__ +and all other in-flight commands must be retried. Note that this retry +should not be counted - it's likely that commands retried this way would +have completed normally if it were not for the failed command. + +Note that ATA bus errors can be reported as ATA device NCQ errors. This +should be handled as described in `ATA bus error <#excatATAbusErr>`__. + +If READ LOG EXT Log Page 10h fails or reports NQ, we're thoroughly +screwed. This condition should be treated according to +`HSM violation <#excatHSMviolation>`__. + +ATA bus error +~~~~~~~~~~~~~ + +ATA bus error means that data corruption occurred during transmission +over ATA bus (SATA or PATA). This type of errors can be indicated by + +- ICRC or ABRT error as described in + `ATA/ATAPI device error (non-NCQ / non-CHECK CONDITION) <#excatDevErr>`__. + +- Controller-specific error completion with error information + indicating transmission error. + +- On some controllers, command timeout. In this case, there may be a + mechanism to determine that the timeout is due to transmission error. + +- Unknown/random errors, timeouts and all sorts of weirdities. + +As described above, transmission errors can cause wide variety of +symptoms ranging from device ICRC error to random device lockup, and, +for many cases, there is no way to tell if an error condition is due to +transmission error or not; therefore, it's necessary to employ some kind +of heuristic when dealing with errors and timeouts. For example, +encountering repetitive ABRT errors for known supported command is +likely to indicate ATA bus error. + +Once it's determined that ATA bus errors have possibly occurred, +lowering ATA bus transmission speed is one of actions which may +alleviate the problem. See `Reconfigure transport <#exrecReconf>`__ for +more information. + +PCI bus error +~~~~~~~~~~~~~ + +Data corruption or other failures during transmission over PCI (or other +system bus). For standard BMDMA, this is indicated by Error bit in the +BMDMA Status register. This type of errors must be logged as it +indicates something is very wrong with the system. Resetting host +controller is recommended. + +Late completion +~~~~~~~~~~~~~~~ + +This occurs when timeout occurs and the timeout handler finds out that +the timed out command has completed successfully or with error. This is +usually caused by lost interrupts. This type of errors must be logged. +Resetting host controller is recommended. + +Unknown error (timeout) +~~~~~~~~~~~~~~~~~~~~~~~ + +This is when timeout occurs and the command is still processing or the +host and device are in unknown state. When this occurs, HSM could be in +any valid or invalid state. To bring the device to known state and make +it forget about the timed out command, resetting is necessary. The timed +out command may be retried. + +Timeouts can also be caused by transmission errors. Refer to +`ATA bus error <#excatATAbusErr>`__ for more details. + +Hotplug and power management exceptions +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +<<TODO: fill here>> + +EH recovery actions +------------------- + +This section discusses several important recovery actions. + +Clearing error condition +~~~~~~~~~~~~~~~~~~~~~~~~ + +Many controllers require its error registers to be cleared by error +handler. Different controllers may have different requirements. + +For SATA, it's strongly recommended to clear at least SError register +during error handling. + +Reset +~~~~~ + +During EH, resetting is necessary in the following cases. + +- HSM is in unknown or invalid state + +- HBA is in unknown or invalid state + +- EH needs to make HBA/device forget about in-flight commands + +- HBA/device behaves weirdly + +Resetting during EH might be a good idea regardless of error condition +to improve EH robustness. Whether to reset both or either one of HBA and +device depends on situation but the following scheme is recommended. + +- When it's known that HBA is in ready state but ATA/ATAPI device is in + unknown state, reset only device. + +- If HBA is in unknown state, reset both HBA and device. + +HBA resetting is implementation specific. For a controller complying to +taskfile/BMDMA PCI IDE, stopping active DMA transaction may be +sufficient iff BMDMA state is the only HBA context. But even mostly +taskfile/BMDMA PCI IDE complying controllers may have implementation +specific requirements and mechanism to reset themselves. This must be +addressed by specific drivers. + +OTOH, ATA/ATAPI standard describes in detail ways to reset ATA/ATAPI +devices. + +PATA hardware reset + This is hardware initiated device reset signalled with asserted PATA + RESET- signal. There is no standard way to initiate hardware reset + from software although some hardware provides registers that allow + driver to directly tweak the RESET- signal. + +Software reset + This is achieved by turning CONTROL SRST bit on for at least 5us. + Both PATA and SATA support it but, in case of SATA, this may require + controller-specific support as the second Register FIS to clear SRST + should be transmitted while BSY bit is still set. Note that on PATA, + this resets both master and slave devices on a channel. + +EXECUTE DEVICE DIAGNOSTIC command + Although ATA/ATAPI standard doesn't describe exactly, EDD implies + some level of resetting, possibly similar level with software reset. + Host-side EDD protocol can be handled with normal command processing + and most SATA controllers should be able to handle EDD's just like + other commands. As in software reset, EDD affects both devices on a + PATA bus. + + Although EDD does reset devices, this doesn't suit error handling as + EDD cannot be issued while BSY is set and it's unclear how it will + act when device is in unknown/weird state. + +ATAPI DEVICE RESET command + This is very similar to software reset except that reset can be + restricted to the selected device without affecting the other device + sharing the cable. + +SATA phy reset + This is the preferred way of resetting a SATA device. In effect, + it's identical to PATA hardware reset. Note that this can be done + with the standard SCR Control register. As such, it's usually easier + to implement than software reset. + +One more thing to consider when resetting devices is that resetting +clears certain configuration parameters and they need to be set to their +previous or newly adjusted values after reset. + +Parameters affected are. + +- CHS set up with INITIALIZE DEVICE PARAMETERS (seldom used) + +- Parameters set with SET FEATURES including transfer mode setting + +- Block count set with SET MULTIPLE MODE + +- Other parameters (SET MAX, MEDIA LOCK...) + +ATA/ATAPI standard specifies that some parameters must be maintained +across hardware or software reset, but doesn't strictly specify all of +them. Always reconfiguring needed parameters after reset is required for +robustness. Note that this also applies when resuming from deep sleep +(power-off). + +Also, ATA/ATAPI standard requires that IDENTIFY DEVICE / IDENTIFY PACKET +DEVICE is issued after any configuration parameter is updated or a +hardware reset and the result used for further operation. OS driver is +required to implement revalidation mechanism to support this. + +Reconfigure transport +~~~~~~~~~~~~~~~~~~~~~ + +For both PATA and SATA, a lot of corners are cut for cheap connectors, +cables or controllers and it's quite common to see high transmission +error rate. This can be mitigated by lowering transmission speed. + +The following is a possible scheme Jeff Garzik suggested. + + If more than $N (3?) transmission errors happen in 15 minutes, + + - if SATA, decrease SATA PHY speed. if speed cannot be decreased, + + - decrease UDMA xfer speed. if at UDMA0, switch to PIO4, + + - decrease PIO xfer speed. if at PIO3, complain, but continue + +ata_piix Internals +=================== + +.. kernel-doc:: drivers/ata/ata_piix.c + :internal: + +sata_sil Internals +=================== + +.. kernel-doc:: drivers/ata/sata_sil.c + :internal: + +Thanks +====== + +The bulk of the ATA knowledge comes thanks to long conversations with +Andre Hedrick (www.linux-ide.org), and long hours pondering the ATA and +SCSI specifications. + +Thanks to Alan Cox for pointing out similarities between SATA and SCSI, +and in general for motivation to hack on libata. + +libata's device detection method, ata_pio_devchk, and in general all +the early probing was based on extensive study of Hale Landis's +probe/reset code in his ATADRVR driver (www.ata-atapi.com). diff --git a/Documentation/driver-api/mtdnand.rst b/Documentation/driver-api/mtdnand.rst new file mode 100644 index 000000000000..e9afa586d15e --- /dev/null +++ b/Documentation/driver-api/mtdnand.rst @@ -0,0 +1,1007 @@ +===================================== +MTD NAND Driver Programming Interface +===================================== + +:Author: Thomas Gleixner + +Introduction +============ + +The generic NAND driver supports almost all NAND and AG-AND based chips +and connects them to the Memory Technology Devices (MTD) subsystem of +the Linux Kernel. + +This documentation is provided for developers who want to implement +board drivers or filesystem drivers suitable for NAND devices. + +Known Bugs And Assumptions +========================== + +None. + +Documentation hints +=================== + +The function and structure docs are autogenerated. Each function and +struct member has a short description which is marked with an [XXX] +identifier. The following chapters explain the meaning of those +identifiers. + +Function identifiers [XXX] +-------------------------- + +The functions are marked with [XXX] identifiers in the short comment. +The identifiers explain the usage and scope of the functions. Following +identifiers are used: + +- [MTD Interface] + + These functions provide the interface to the MTD kernel API. They are + not replaceable and provide functionality which is complete hardware + independent. + +- [NAND Interface] + + These functions are exported and provide the interface to the NAND + kernel API. + +- [GENERIC] + + Generic functions are not replaceable and provide functionality which + is complete hardware independent. + +- [DEFAULT] + + Default functions provide hardware related functionality which is + suitable for most of the implementations. These functions can be + replaced by the board driver if necessary. Those functions are called + via pointers in the NAND chip description structure. The board driver + can set the functions which should be replaced by board dependent + functions before calling nand_scan(). If the function pointer is + NULL on entry to nand_scan() then the pointer is set to the default + function which is suitable for the detected chip type. + +Struct member identifiers [XXX] +------------------------------- + +The struct members are marked with [XXX] identifiers in the comment. The +identifiers explain the usage and scope of the members. Following +identifiers are used: + +- [INTERN] + + These members are for NAND driver internal use only and must not be + modified. Most of these values are calculated from the chip geometry + information which is evaluated during nand_scan(). + +- [REPLACEABLE] + + Replaceable members hold hardware related functions which can be + provided by the board driver. The board driver can set the functions + which should be replaced by board dependent functions before calling + nand_scan(). If the function pointer is NULL on entry to + nand_scan() then the pointer is set to the default function which is + suitable for the detected chip type. + +- [BOARDSPECIFIC] + + Board specific members hold hardware related information which must + be provided by the board driver. The board driver must set the + function pointers and datafields before calling nand_scan(). + +- [OPTIONAL] + + Optional members can hold information relevant for the board driver. + The generic NAND driver code does not use this information. + +Basic board driver +================== + +For most boards it will be sufficient to provide just the basic +functions and fill out some really board dependent members in the nand +chip description structure. + +Basic defines +------------- + +At least you have to provide a nand_chip structure and a storage for +the ioremap'ed chip address. You can allocate the nand_chip structure +using kmalloc or you can allocate it statically. The NAND chip structure +embeds an mtd structure which will be registered to the MTD subsystem. +You can extract a pointer to the mtd structure from a nand_chip pointer +using the nand_to_mtd() helper. + +Kmalloc based example + +:: + + static struct mtd_info *board_mtd; + static void __iomem *baseaddr; + + +Static example + +:: + + static struct nand_chip board_chip; + static void __iomem *baseaddr; + + +Partition defines +----------------- + +If you want to divide your device into partitions, then define a +partitioning scheme suitable to your board. + +:: + + #define NUM_PARTITIONS 2 + static struct mtd_partition partition_info[] = { + { .name = "Flash partition 1", + .offset = 0, + .size = 8 * 1024 * 1024 }, + { .name = "Flash partition 2", + .offset = MTDPART_OFS_NEXT, + .size = MTDPART_SIZ_FULL }, + }; + + +Hardware control function +------------------------- + +The hardware control function provides access to the control pins of the +NAND chip(s). The access can be done by GPIO pins or by address lines. +If you use address lines, make sure that the timing requirements are +met. + +*GPIO based example* + +:: + + static void board_hwcontrol(struct mtd_info *mtd, int cmd) + { + switch(cmd){ + case NAND_CTL_SETCLE: /* Set CLE pin high */ break; + case NAND_CTL_CLRCLE: /* Set CLE pin low */ break; + case NAND_CTL_SETALE: /* Set ALE pin high */ break; + case NAND_CTL_CLRALE: /* Set ALE pin low */ break; + case NAND_CTL_SETNCE: /* Set nCE pin low */ break; + case NAND_CTL_CLRNCE: /* Set nCE pin high */ break; + } + } + + +*Address lines based example.* It's assumed that the nCE pin is driven +by a chip select decoder. + +:: + + static void board_hwcontrol(struct mtd_info *mtd, int cmd) + { + struct nand_chip *this = mtd_to_nand(mtd); + switch(cmd){ + case NAND_CTL_SETCLE: this->IO_ADDR_W |= CLE_ADRR_BIT; break; + case NAND_CTL_CLRCLE: this->IO_ADDR_W &= ~CLE_ADRR_BIT; break; + case NAND_CTL_SETALE: this->IO_ADDR_W |= ALE_ADRR_BIT; break; + case NAND_CTL_CLRALE: this->IO_ADDR_W &= ~ALE_ADRR_BIT; break; + } + } + + +Device ready function +--------------------- + +If the hardware interface has the ready busy pin of the NAND chip +connected to a GPIO or other accessible I/O pin, this function is used +to read back the state of the pin. The function has no arguments and +should return 0, if the device is busy (R/B pin is low) and 1, if the +device is ready (R/B pin is high). If the hardware interface does not +give access to the ready busy pin, then the function must not be defined +and the function pointer this->dev_ready is set to NULL. + +Init function +------------- + +The init function allocates memory and sets up all the board specific +parameters and function pointers. When everything is set up nand_scan() +is called. This function tries to detect and identify then chip. If a +chip is found all the internal data fields are initialized accordingly. +The structure(s) have to be zeroed out first and then filled with the +necessary information about the device. + +:: + + static int __init board_init (void) + { + struct nand_chip *this; + int err = 0; + + /* Allocate memory for MTD device structure and private data */ + this = kzalloc(sizeof(struct nand_chip), GFP_KERNEL); + if (!this) { + printk ("Unable to allocate NAND MTD device structure.\n"); + err = -ENOMEM; + goto out; + } + + board_mtd = nand_to_mtd(this); + + /* map physical address */ + baseaddr = ioremap(CHIP_PHYSICAL_ADDRESS, 1024); + if (!baseaddr) { + printk("Ioremap to access NAND chip failed\n"); + err = -EIO; + goto out_mtd; + } + + /* Set address of NAND IO lines */ + this->IO_ADDR_R = baseaddr; + this->IO_ADDR_W = baseaddr; + /* Reference hardware control function */ + this->hwcontrol = board_hwcontrol; + /* Set command delay time, see datasheet for correct value */ + this->chip_delay = CHIP_DEPENDEND_COMMAND_DELAY; + /* Assign the device ready function, if available */ + this->dev_ready = board_dev_ready; + this->eccmode = NAND_ECC_SOFT; + + /* Scan to find existence of the device */ + if (nand_scan (board_mtd, 1)) { + err = -ENXIO; + goto out_ior; + } + + add_mtd_partitions(board_mtd, partition_info, NUM_PARTITIONS); + goto out; + + out_ior: + iounmap(baseaddr); + out_mtd: + kfree (this); + out: + return err; + } + module_init(board_init); + + +Exit function +------------- + +The exit function is only necessary if the driver is compiled as a +module. It releases all resources which are held by the chip driver and +unregisters the partitions in the MTD layer. + +:: + + #ifdef MODULE + static void __exit board_cleanup (void) + { + /* Release resources, unregister device */ + nand_release (board_mtd); + + /* unmap physical address */ + iounmap(baseaddr); + + /* Free the MTD device structure */ + kfree (mtd_to_nand(board_mtd)); + } + module_exit(board_cleanup); + #endif + + +Advanced board driver functions +=============================== + +This chapter describes the advanced functionality of the NAND driver. +For a list of functions which can be overridden by the board driver see +the documentation of the nand_chip structure. + +Multiple chip control +--------------------- + +The nand driver can control chip arrays. Therefore the board driver must +provide an own select_chip function. This function must (de)select the +requested chip. The function pointer in the nand_chip structure must be +set before calling nand_scan(). The maxchip parameter of nand_scan() +defines the maximum number of chips to scan for. Make sure that the +select_chip function can handle the requested number of chips. + +The nand driver concatenates the chips to one virtual chip and provides +this virtual chip to the MTD layer. + +*Note: The driver can only handle linear chip arrays of equally sized +chips. There is no support for parallel arrays which extend the +buswidth.* + +*GPIO based example* + +:: + + static void board_select_chip (struct mtd_info *mtd, int chip) + { + /* Deselect all chips, set all nCE pins high */ + GPIO(BOARD_NAND_NCE) |= 0xff; + if (chip >= 0) + GPIO(BOARD_NAND_NCE) &= ~ (1 << chip); + } + + +*Address lines based example.* Its assumed that the nCE pins are +connected to an address decoder. + +:: + + static void board_select_chip (struct mtd_info *mtd, int chip) + { + struct nand_chip *this = mtd_to_nand(mtd); + + /* Deselect all chips */ + this->IO_ADDR_R &= ~BOARD_NAND_ADDR_MASK; + this->IO_ADDR_W &= ~BOARD_NAND_ADDR_MASK; + switch (chip) { + case 0: + this->IO_ADDR_R |= BOARD_NAND_ADDR_CHIP0; + this->IO_ADDR_W |= BOARD_NAND_ADDR_CHIP0; + break; + .... + case n: + this->IO_ADDR_R |= BOARD_NAND_ADDR_CHIPn; + this->IO_ADDR_W |= BOARD_NAND_ADDR_CHIPn; + break; + } + } + + +Hardware ECC support +-------------------- + +Functions and constants +~~~~~~~~~~~~~~~~~~~~~~~ + +The nand driver supports three different types of hardware ECC. + +- NAND_ECC_HW3_256 + + Hardware ECC generator providing 3 bytes ECC per 256 byte. + +- NAND_ECC_HW3_512 + + Hardware ECC generator providing 3 bytes ECC per 512 byte. + +- NAND_ECC_HW6_512 + + Hardware ECC generator providing 6 bytes ECC per 512 byte. + +- NAND_ECC_HW8_512 + + Hardware ECC generator providing 6 bytes ECC per 512 byte. + +If your hardware generator has a different functionality add it at the +appropriate place in nand_base.c + +The board driver must provide following functions: + +- enable_hwecc + + This function is called before reading / writing to the chip. Reset + or initialize the hardware generator in this function. The function + is called with an argument which let you distinguish between read and + write operations. + +- calculate_ecc + + This function is called after read / write from / to the chip. + Transfer the ECC from the hardware to the buffer. If the option + NAND_HWECC_SYNDROME is set then the function is only called on + write. See below. + +- correct_data + + In case of an ECC error this function is called for error detection + and correction. Return 1 respectively 2 in case the error can be + corrected. If the error is not correctable return -1. If your + hardware generator matches the default algorithm of the nand_ecc + software generator then use the correction function provided by + nand_ecc instead of implementing duplicated code. + +Hardware ECC with syndrome calculation +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Many hardware ECC implementations provide Reed-Solomon codes and +calculate an error syndrome on read. The syndrome must be converted to a +standard Reed-Solomon syndrome before calling the error correction code +in the generic Reed-Solomon library. + +The ECC bytes must be placed immediately after the data bytes in order +to make the syndrome generator work. This is contrary to the usual +layout used by software ECC. The separation of data and out of band area +is not longer possible. The nand driver code handles this layout and the +remaining free bytes in the oob area are managed by the autoplacement +code. Provide a matching oob-layout in this case. See rts_from4.c and +diskonchip.c for implementation reference. In those cases we must also +use bad block tables on FLASH, because the ECC layout is interfering +with the bad block marker positions. See bad block table support for +details. + +Bad block table support +----------------------- + +Most NAND chips mark the bad blocks at a defined position in the spare +area. Those blocks must not be erased under any circumstances as the bad +block information would be lost. It is possible to check the bad block +mark each time when the blocks are accessed by reading the spare area of +the first page in the block. This is time consuming so a bad block table +is used. + +The nand driver supports various types of bad block tables. + +- Per device + + The bad block table contains all bad block information of the device + which can consist of multiple chips. + +- Per chip + + A bad block table is used per chip and contains the bad block + information for this particular chip. + +- Fixed offset + + The bad block table is located at a fixed offset in the chip + (device). This applies to various DiskOnChip devices. + +- Automatic placed + + The bad block table is automatically placed and detected either at + the end or at the beginning of a chip (device) + +- Mirrored tables + + The bad block table is mirrored on the chip (device) to allow updates + of the bad block table without data loss. + +nand_scan() calls the function nand_default_bbt(). +nand_default_bbt() selects appropriate default bad block table +descriptors depending on the chip information which was retrieved by +nand_scan(). + +The standard policy is scanning the device for bad blocks and build a +ram based bad block table which allows faster access than always +checking the bad block information on the flash chip itself. + +Flash based tables +~~~~~~~~~~~~~~~~~~ + +It may be desired or necessary to keep a bad block table in FLASH. For +AG-AND chips this is mandatory, as they have no factory marked bad +blocks. They have factory marked good blocks. The marker pattern is +erased when the block is erased to be reused. So in case of powerloss +before writing the pattern back to the chip this block would be lost and +added to the bad blocks. Therefore we scan the chip(s) when we detect +them the first time for good blocks and store this information in a bad +block table before erasing any of the blocks. + +The blocks in which the tables are stored are protected against +accidental access by marking them bad in the memory bad block table. The +bad block table management functions are allowed to circumvent this +protection. + +The simplest way to activate the FLASH based bad block table support is +to set the option NAND_BBT_USE_FLASH in the bbt_option field of the +nand chip structure before calling nand_scan(). For AG-AND chips is +this done by default. This activates the default FLASH based bad block +table functionality of the NAND driver. The default bad block table +options are + +- Store bad block table per chip + +- Use 2 bits per block + +- Automatic placement at the end of the chip + +- Use mirrored tables with version numbers + +- Reserve 4 blocks at the end of the chip + +User defined tables +~~~~~~~~~~~~~~~~~~~ + +User defined tables are created by filling out a nand_bbt_descr +structure and storing the pointer in the nand_chip structure member +bbt_td before calling nand_scan(). If a mirror table is necessary a +second structure must be created and a pointer to this structure must be +stored in bbt_md inside the nand_chip structure. If the bbt_md member +is set to NULL then only the main table is used and no scan for the +mirrored table is performed. + +The most important field in the nand_bbt_descr structure is the +options field. The options define most of the table properties. Use the +predefined constants from nand.h to define the options. + +- Number of bits per block + + The supported number of bits is 1, 2, 4, 8. + +- Table per chip + + Setting the constant NAND_BBT_PERCHIP selects that a bad block + table is managed for each chip in a chip array. If this option is not + set then a per device bad block table is used. + +- Table location is absolute + + Use the option constant NAND_BBT_ABSPAGE and define the absolute + page number where the bad block table starts in the field pages. If + you have selected bad block tables per chip and you have a multi chip + array then the start page must be given for each chip in the chip + array. Note: there is no scan for a table ident pattern performed, so + the fields pattern, veroffs, offs, len can be left uninitialized + +- Table location is automatically detected + + The table can either be located in the first or the last good blocks + of the chip (device). Set NAND_BBT_LASTBLOCK to place the bad block + table at the end of the chip (device). The bad block tables are + marked and identified by a pattern which is stored in the spare area + of the first page in the block which holds the bad block table. Store + a pointer to the pattern in the pattern field. Further the length of + the pattern has to be stored in len and the offset in the spare area + must be given in the offs member of the nand_bbt_descr structure. + For mirrored bad block tables different patterns are mandatory. + +- Table creation + + Set the option NAND_BBT_CREATE to enable the table creation if no + table can be found during the scan. Usually this is done only once if + a new chip is found. + +- Table write support + + Set the option NAND_BBT_WRITE to enable the table write support. + This allows the update of the bad block table(s) in case a block has + to be marked bad due to wear. The MTD interface function + block_markbad is calling the update function of the bad block table. + If the write support is enabled then the table is updated on FLASH. + + Note: Write support should only be enabled for mirrored tables with + version control. + +- Table version control + + Set the option NAND_BBT_VERSION to enable the table version + control. It's highly recommended to enable this for mirrored tables + with write support. It makes sure that the risk of losing the bad + block table information is reduced to the loss of the information + about the one worn out block which should be marked bad. The version + is stored in 4 consecutive bytes in the spare area of the device. The + position of the version number is defined by the member veroffs in + the bad block table descriptor. + +- Save block contents on write + + In case that the block which holds the bad block table does contain + other useful information, set the option NAND_BBT_SAVECONTENT. When + the bad block table is written then the whole block is read the bad + block table is updated and the block is erased and everything is + written back. If this option is not set only the bad block table is + written and everything else in the block is ignored and erased. + +- Number of reserved blocks + + For automatic placement some blocks must be reserved for bad block + table storage. The number of reserved blocks is defined in the + maxblocks member of the bad block table description structure. + Reserving 4 blocks for mirrored tables should be a reasonable number. + This also limits the number of blocks which are scanned for the bad + block table ident pattern. + +Spare area (auto)placement +-------------------------- + +The nand driver implements different possibilities for placement of +filesystem data in the spare area, + +- Placement defined by fs driver + +- Automatic placement + +The default placement function is automatic placement. The nand driver +has built in default placement schemes for the various chiptypes. If due +to hardware ECC functionality the default placement does not fit then +the board driver can provide a own placement scheme. + +File system drivers can provide a own placement scheme which is used +instead of the default placement scheme. + +Placement schemes are defined by a nand_oobinfo structure + +:: + + struct nand_oobinfo { + int useecc; + int eccbytes; + int eccpos[24]; + int oobfree[8][2]; + }; + + +- useecc + + The useecc member controls the ecc and placement function. The header + file include/mtd/mtd-abi.h contains constants to select ecc and + placement. MTD_NANDECC_OFF switches off the ecc complete. This is + not recommended and available for testing and diagnosis only. + MTD_NANDECC_PLACE selects caller defined placement, + MTD_NANDECC_AUTOPLACE selects automatic placement. + +- eccbytes + + The eccbytes member defines the number of ecc bytes per page. + +- eccpos + + The eccpos array holds the byte offsets in the spare area where the + ecc codes are placed. + +- oobfree + + The oobfree array defines the areas in the spare area which can be + used for automatic placement. The information is given in the format + {offset, size}. offset defines the start of the usable area, size the + length in bytes. More than one area can be defined. The list is + terminated by an {0, 0} entry. + +Placement defined by fs driver +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The calling function provides a pointer to a nand_oobinfo structure +which defines the ecc placement. For writes the caller must provide a +spare area buffer along with the data buffer. The spare area buffer size +is (number of pages) \* (size of spare area). For reads the buffer size +is (number of pages) \* ((size of spare area) + (number of ecc steps per +page) \* sizeof (int)). The driver stores the result of the ecc check +for each tuple in the spare buffer. The storage sequence is:: + + <spare data page 0><ecc result 0>...<ecc result n> + + ... + + <spare data page n><ecc result 0>...<ecc result n> + +This is a legacy mode used by YAFFS1. + +If the spare area buffer is NULL then only the ECC placement is done +according to the given scheme in the nand_oobinfo structure. + +Automatic placement +~~~~~~~~~~~~~~~~~~~ + +Automatic placement uses the built in defaults to place the ecc bytes in +the spare area. If filesystem data have to be stored / read into the +spare area then the calling function must provide a buffer. The buffer +size per page is determined by the oobfree array in the nand_oobinfo +structure. + +If the spare area buffer is NULL then only the ECC placement is done +according to the default builtin scheme. + +Spare area autoplacement default schemes +---------------------------------------- + +256 byte pagesize +~~~~~~~~~~~~~~~~~ + +======== ================== =================================================== +Offset Content Comment +======== ================== =================================================== +0x00 ECC byte 0 Error correction code byte 0 +0x01 ECC byte 1 Error correction code byte 1 +0x02 ECC byte 2 Error correction code byte 2 +0x03 Autoplace 0 +0x04 Autoplace 1 +0x05 Bad block marker If any bit in this byte is zero, then this + block is bad. This applies only to the first + page in a block. In the remaining pages this + byte is reserved +0x06 Autoplace 2 +0x07 Autoplace 3 +======== ================== =================================================== + +512 byte pagesize +~~~~~~~~~~~~~~~~~ + + +============= ================== ============================================== +Offset Content Comment +============= ================== ============================================== +0x00 ECC byte 0 Error correction code byte 0 of the lower + 256 Byte data in this page +0x01 ECC byte 1 Error correction code byte 1 of the lower + 256 Bytes of data in this page +0x02 ECC byte 2 Error correction code byte 2 of the lower + 256 Bytes of data in this page +0x03 ECC byte 3 Error correction code byte 0 of the upper + 256 Bytes of data in this page +0x04 reserved reserved +0x05 Bad block marker If any bit in this byte is zero, then this + block is bad. This applies only to the first + page in a block. In the remaining pages this + byte is reserved +0x06 ECC byte 4 Error correction code byte 1 of the upper + 256 Bytes of data in this page +0x07 ECC byte 5 Error correction code byte 2 of the upper + 256 Bytes of data in this page +0x08 - 0x0F Autoplace 0 - 7 +============= ================== ============================================== + +2048 byte pagesize +~~~~~~~~~~~~~~~~~~ + +=========== ================== ================================================ +Offset Content Comment +=========== ================== ================================================ +0x00 Bad block marker If any bit in this byte is zero, then this block + is bad. This applies only to the first page in a + block. In the remaining pages this byte is + reserved +0x01 Reserved Reserved +0x02-0x27 Autoplace 0 - 37 +0x28 ECC byte 0 Error correction code byte 0 of the first + 256 Byte data in this page +0x29 ECC byte 1 Error correction code byte 1 of the first + 256 Bytes of data in this page +0x2A ECC byte 2 Error correction code byte 2 of the first + 256 Bytes data in this page +0x2B ECC byte 3 Error correction code byte 0 of the second + 256 Bytes of data in this page +0x2C ECC byte 4 Error correction code byte 1 of the second + 256 Bytes of data in this page +0x2D ECC byte 5 Error correction code byte 2 of the second + 256 Bytes of data in this page +0x2E ECC byte 6 Error correction code byte 0 of the third + 256 Bytes of data in this page +0x2F ECC byte 7 Error correction code byte 1 of the third + 256 Bytes of data in this page +0x30 ECC byte 8 Error correction code byte 2 of the third + 256 Bytes of data in this page +0x31 ECC byte 9 Error correction code byte 0 of the fourth + 256 Bytes of data in this page +0x32 ECC byte 10 Error correction code byte 1 of the fourth + 256 Bytes of data in this page +0x33 ECC byte 11 Error correction code byte 2 of the fourth + 256 Bytes of data in this page +0x34 ECC byte 12 Error correction code byte 0 of the fifth + 256 Bytes of data in this page +0x35 ECC byte 13 Error correction code byte 1 of the fifth + 256 Bytes of data in this page +0x36 ECC byte 14 Error correction code byte 2 of the fifth + 256 Bytes of data in this page +0x37 ECC byte 15 Error correction code byte 0 of the sixth + 256 Bytes of data in this page +0x38 ECC byte 16 Error correction code byte 1 of the sixth + 256 Bytes of data in this page +0x39 ECC byte 17 Error correction code byte 2 of the sixth + 256 Bytes of data in this page +0x3A ECC byte 18 Error correction code byte 0 of the seventh + 256 Bytes of data in this page +0x3B ECC byte 19 Error correction code byte 1 of the seventh + 256 Bytes of data in this page +0x3C ECC byte 20 Error correction code byte 2 of the seventh + 256 Bytes of data in this page +0x3D ECC byte 21 Error correction code byte 0 of the eighth + 256 Bytes of data in this page +0x3E ECC byte 22 Error correction code byte 1 of the eighth + 256 Bytes of data in this page +0x3F ECC byte 23 Error correction code byte 2 of the eighth + 256 Bytes of data in this page +=========== ================== ================================================ + +Filesystem support +================== + +The NAND driver provides all necessary functions for a filesystem via +the MTD interface. + +Filesystems must be aware of the NAND peculiarities and restrictions. +One major restrictions of NAND Flash is, that you cannot write as often +as you want to a page. The consecutive writes to a page, before erasing +it again, are restricted to 1-3 writes, depending on the manufacturers +specifications. This applies similar to the spare area. + +Therefore NAND aware filesystems must either write in page size chunks +or hold a writebuffer to collect smaller writes until they sum up to +pagesize. Available NAND aware filesystems: JFFS2, YAFFS. + +The spare area usage to store filesystem data is controlled by the spare +area placement functionality which is described in one of the earlier +chapters. + +Tools +===== + +The MTD project provides a couple of helpful tools to handle NAND Flash. + +- flasherase, flasheraseall: Erase and format FLASH partitions + +- nandwrite: write filesystem images to NAND FLASH + +- nanddump: dump the contents of a NAND FLASH partitions + +These tools are aware of the NAND restrictions. Please use those tools +instead of complaining about errors which are caused by non NAND aware +access methods. + +Constants +========= + +This chapter describes the constants which might be relevant for a +driver developer. + +Chip option constants +--------------------- + +Constants for chip id table +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +These constants are defined in nand.h. They are OR-ed together to +describe the chip functionality:: + + /* Buswitdh is 16 bit */ + #define NAND_BUSWIDTH_16 0x00000002 + /* Device supports partial programming without padding */ + #define NAND_NO_PADDING 0x00000004 + /* Chip has cache program function */ + #define NAND_CACHEPRG 0x00000008 + /* Chip has copy back function */ + #define NAND_COPYBACK 0x00000010 + /* AND Chip which has 4 banks and a confusing page / block + * assignment. See Renesas datasheet for further information */ + #define NAND_IS_AND 0x00000020 + /* Chip has a array of 4 pages which can be read without + * additional ready /busy waits */ + #define NAND_4PAGE_ARRAY 0x00000040 + + +Constants for runtime options +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +These constants are defined in nand.h. They are OR-ed together to +describe the functionality:: + + /* The hw ecc generator provides a syndrome instead a ecc value on read + * This can only work if we have the ecc bytes directly behind the + * data bytes. Applies for DOC and AG-AND Renesas HW Reed Solomon generators */ + #define NAND_HWECC_SYNDROME 0x00020000 + + +ECC selection constants +----------------------- + +Use these constants to select the ECC algorithm:: + + /* No ECC. Usage is not recommended ! */ + #define NAND_ECC_NONE 0 + /* Software ECC 3 byte ECC per 256 Byte data */ + #define NAND_ECC_SOFT 1 + /* Hardware ECC 3 byte ECC per 256 Byte data */ + #define NAND_ECC_HW3_256 2 + /* Hardware ECC 3 byte ECC per 512 Byte data */ + #define NAND_ECC_HW3_512 3 + /* Hardware ECC 6 byte ECC per 512 Byte data */ + #define NAND_ECC_HW6_512 4 + /* Hardware ECC 6 byte ECC per 512 Byte data */ + #define NAND_ECC_HW8_512 6 + + +Hardware control related constants +---------------------------------- + +These constants describe the requested hardware access function when the +boardspecific hardware control function is called:: + + /* Select the chip by setting nCE to low */ + #define NAND_CTL_SETNCE 1 + /* Deselect the chip by setting nCE to high */ + #define NAND_CTL_CLRNCE 2 + /* Select the command latch by setting CLE to high */ + #define NAND_CTL_SETCLE 3 + /* Deselect the command latch by setting CLE to low */ + #define NAND_CTL_CLRCLE 4 + /* Select the address latch by setting ALE to high */ + #define NAND_CTL_SETALE 5 + /* Deselect the address latch by setting ALE to low */ + #define NAND_CTL_CLRALE 6 + /* Set write protection by setting WP to high. Not used! */ + #define NAND_CTL_SETWP 7 + /* Clear write protection by setting WP to low. Not used! */ + #define NAND_CTL_CLRWP 8 + + +Bad block table related constants +--------------------------------- + +These constants describe the options used for bad block table +descriptors:: + + /* Options for the bad block table descriptors */ + + /* The number of bits used per block in the bbt on the device */ + #define NAND_BBT_NRBITS_MSK 0x0000000F + #define NAND_BBT_1BIT 0x00000001 + #define NAND_BBT_2BIT 0x00000002 + #define NAND_BBT_4BIT 0x00000004 + #define NAND_BBT_8BIT 0x00000008 + /* The bad block table is in the last good block of the device */ + #define NAND_BBT_LASTBLOCK 0x00000010 + /* The bbt is at the given page, else we must scan for the bbt */ + #define NAND_BBT_ABSPAGE 0x00000020 + /* bbt is stored per chip on multichip devices */ + #define NAND_BBT_PERCHIP 0x00000080 + /* bbt has a version counter at offset veroffs */ + #define NAND_BBT_VERSION 0x00000100 + /* Create a bbt if none axists */ + #define NAND_BBT_CREATE 0x00000200 + /* Write bbt if necessary */ + #define NAND_BBT_WRITE 0x00001000 + /* Read and write back block contents when writing bbt */ + #define NAND_BBT_SAVECONTENT 0x00002000 + + +Structures +========== + +This chapter contains the autogenerated documentation of the structures +which are used in the NAND driver and might be relevant for a driver +developer. Each struct member has a short description which is marked +with an [XXX] identifier. See the chapter "Documentation hints" for an +explanation. + +.. kernel-doc:: include/linux/mtd/nand.h + :internal: + +Public Functions Provided +========================= + +This chapter contains the autogenerated documentation of the NAND kernel +API functions which are exported. Each function has a short description +which is marked with an [XXX] identifier. See the chapter "Documentation +hints" for an explanation. + +.. kernel-doc:: drivers/mtd/nand/nand_base.c + :export: + +.. kernel-doc:: drivers/mtd/nand/nand_ecc.c + :export: + +Internal Functions Provided +=========================== + +This chapter contains the autogenerated documentation of the NAND driver +internal functions. Each function has a short description which is +marked with an [XXX] identifier. See the chapter "Documentation hints" +for an explanation. The functions marked with [DEFAULT] might be +relevant for a board driver developer. + +.. kernel-doc:: drivers/mtd/nand/nand_base.c + :internal: + +.. kernel-doc:: drivers/mtd/nand/nand_bbt.c + :internal: + +Credits +======= + +The following people have contributed to the NAND driver: + +1. Steven J. Hill\ sjhill@realitydiluted.com + +2. David Woodhouse\ dwmw2@infradead.org + +3. Thomas Gleixner\ tglx@linutronix.de + +A lot of users have provided bugfixes, improvements and helping hands +for testing. Thanks a lot. + +The following people have contributed to this document: + +1. Thomas Gleixner\ tglx@linutronix.de diff --git a/Documentation/pinctrl.txt b/Documentation/driver-api/pinctl.rst index f2af35f6d6b2..48f15b4f9d3e 100644 --- a/Documentation/pinctrl.txt +++ b/Documentation/driver-api/pinctl.rst @@ -1,4 +1,7 @@ +=============================== PINCTRL (PIN CONTROL) subsystem +=============================== + This document outlines the pin control subsystem in Linux This subsystem deals with: @@ -33,7 +36,7 @@ When a PIN CONTROLLER is instantiated, it will register a descriptor to the pin control framework, and this descriptor contains an array of pin descriptors describing the pins handled by this specific pin controller. -Here is an example of a PGA (Pin Grid Array) chip seen from underneath: +Here is an example of a PGA (Pin Grid Array) chip seen from underneath:: A B C D E F G H @@ -54,39 +57,40 @@ Here is an example of a PGA (Pin Grid Array) chip seen from underneath: 1 o o o o o o o o To register a pin controller and name all the pins on this package we can do -this in our driver: - -#include <linux/pinctrl/pinctrl.h> - -const struct pinctrl_pin_desc foo_pins[] = { - PINCTRL_PIN(0, "A8"), - PINCTRL_PIN(1, "B8"), - PINCTRL_PIN(2, "C8"), - ... - PINCTRL_PIN(61, "F1"), - PINCTRL_PIN(62, "G1"), - PINCTRL_PIN(63, "H1"), -}; +this in our driver:: -static struct pinctrl_desc foo_desc = { - .name = "foo", - .pins = foo_pins, - .npins = ARRAY_SIZE(foo_pins), - .owner = THIS_MODULE, -}; + #include <linux/pinctrl/pinctrl.h> -int __init foo_probe(void) -{ - int error; + const struct pinctrl_pin_desc foo_pins[] = { + PINCTRL_PIN(0, "A8"), + PINCTRL_PIN(1, "B8"), + PINCTRL_PIN(2, "C8"), + ... + PINCTRL_PIN(61, "F1"), + PINCTRL_PIN(62, "G1"), + PINCTRL_PIN(63, "H1"), + }; + + static struct pinctrl_desc foo_desc = { + .name = "foo", + .pins = foo_pins, + .npins = ARRAY_SIZE(foo_pins), + .owner = THIS_MODULE, + }; + + int __init foo_probe(void) + { + int error; - struct pinctrl_dev *pctl; + struct pinctrl_dev *pctl; - error = pinctrl_register_and_init(&foo_desc, <PARENT>, NULL, &pctl); - if (error) - return error; + error = pinctrl_register_and_init(&foo_desc, <PARENT>, + NULL, &pctl); + if (error) + return error; - return pinctrl_enable(pctl); -} + return pinctrl_enable(pctl); + } To enable the pinctrl subsystem and the subgroups for PINMUX and PINCONF and selected drivers, you need to select them from your machine's Kconfig entry, @@ -105,7 +109,7 @@ the pin controller. For a padring with 467 pads, as opposed to actual pins, I used an enumeration like this, walking around the edge of the chip, which seems to be industry -standard too (all these pads had names, too): +standard too (all these pads had names, too):: 0 ..... 104 @@ -128,64 +132,64 @@ on { 0, 8, 16, 24 }, and a group of pins dealing with an I2C interface on pins on { 24, 25 }. These two groups are presented to the pin control subsystem by implementing -some generic pinctrl_ops like this: - -#include <linux/pinctrl/pinctrl.h> - -struct foo_group { - const char *name; - const unsigned int *pins; - const unsigned num_pins; -}; - -static const unsigned int spi0_pins[] = { 0, 8, 16, 24 }; -static const unsigned int i2c0_pins[] = { 24, 25 }; - -static const struct foo_group foo_groups[] = { - { - .name = "spi0_grp", - .pins = spi0_pins, - .num_pins = ARRAY_SIZE(spi0_pins), - }, +some generic pinctrl_ops like this:: + + #include <linux/pinctrl/pinctrl.h> + + struct foo_group { + const char *name; + const unsigned int *pins; + const unsigned num_pins; + }; + + static const unsigned int spi0_pins[] = { 0, 8, 16, 24 }; + static const unsigned int i2c0_pins[] = { 24, 25 }; + + static const struct foo_group foo_groups[] = { + { + .name = "spi0_grp", + .pins = spi0_pins, + .num_pins = ARRAY_SIZE(spi0_pins), + }, + { + .name = "i2c0_grp", + .pins = i2c0_pins, + .num_pins = ARRAY_SIZE(i2c0_pins), + }, + }; + + + static int foo_get_groups_count(struct pinctrl_dev *pctldev) { - .name = "i2c0_grp", - .pins = i2c0_pins, - .num_pins = ARRAY_SIZE(i2c0_pins), - }, -}; - - -static int foo_get_groups_count(struct pinctrl_dev *pctldev) -{ - return ARRAY_SIZE(foo_groups); -} + return ARRAY_SIZE(foo_groups); + } -static const char *foo_get_group_name(struct pinctrl_dev *pctldev, - unsigned selector) -{ - return foo_groups[selector].name; -} + static const char *foo_get_group_name(struct pinctrl_dev *pctldev, + unsigned selector) + { + return foo_groups[selector].name; + } -static int foo_get_group_pins(struct pinctrl_dev *pctldev, unsigned selector, - const unsigned **pins, - unsigned *num_pins) -{ - *pins = (unsigned *) foo_groups[selector].pins; - *num_pins = foo_groups[selector].num_pins; - return 0; -} + static int foo_get_group_pins(struct pinctrl_dev *pctldev, unsigned selector, + const unsigned **pins, + unsigned *num_pins) + { + *pins = (unsigned *) foo_groups[selector].pins; + *num_pins = foo_groups[selector].num_pins; + return 0; + } -static struct pinctrl_ops foo_pctrl_ops = { - .get_groups_count = foo_get_groups_count, - .get_group_name = foo_get_group_name, - .get_group_pins = foo_get_group_pins, -}; + static struct pinctrl_ops foo_pctrl_ops = { + .get_groups_count = foo_get_groups_count, + .get_group_name = foo_get_group_name, + .get_group_pins = foo_get_group_pins, + }; -static struct pinctrl_desc foo_desc = { - ... - .pctlops = &foo_pctrl_ops, -}; + static struct pinctrl_desc foo_desc = { + ... + .pctlops = &foo_pctrl_ops, + }; The pin control subsystem will call the .get_groups_count() function to determine the total number of legal selectors, then it will call the other functions @@ -213,62 +217,62 @@ The format and meaning of the configuration parameter, PLATFORM_X_PULL_UP above, is entirely defined by the pin controller driver. The pin configuration driver implements callbacks for changing pin -configuration in the pin controller ops like this: +configuration in the pin controller ops like this:: -#include <linux/pinctrl/pinctrl.h> -#include <linux/pinctrl/pinconf.h> -#include "platform_x_pindefs.h" + #include <linux/pinctrl/pinctrl.h> + #include <linux/pinctrl/pinconf.h> + #include "platform_x_pindefs.h" -static int foo_pin_config_get(struct pinctrl_dev *pctldev, - unsigned offset, - unsigned long *config) -{ - struct my_conftype conf; + static int foo_pin_config_get(struct pinctrl_dev *pctldev, + unsigned offset, + unsigned long *config) + { + struct my_conftype conf; - ... Find setting for pin @ offset ... + ... Find setting for pin @ offset ... - *config = (unsigned long) conf; -} + *config = (unsigned long) conf; + } -static int foo_pin_config_set(struct pinctrl_dev *pctldev, - unsigned offset, - unsigned long config) -{ - struct my_conftype *conf = (struct my_conftype *) config; + static int foo_pin_config_set(struct pinctrl_dev *pctldev, + unsigned offset, + unsigned long config) + { + struct my_conftype *conf = (struct my_conftype *) config; - switch (conf) { - case PLATFORM_X_PULL_UP: - ... + switch (conf) { + case PLATFORM_X_PULL_UP: + ... + } } } -} -static int foo_pin_config_group_get (struct pinctrl_dev *pctldev, - unsigned selector, - unsigned long *config) -{ - ... -} + static int foo_pin_config_group_get (struct pinctrl_dev *pctldev, + unsigned selector, + unsigned long *config) + { + ... + } -static int foo_pin_config_group_set (struct pinctrl_dev *pctldev, - unsigned selector, - unsigned long config) -{ - ... -} + static int foo_pin_config_group_set (struct pinctrl_dev *pctldev, + unsigned selector, + unsigned long config) + { + ... + } -static struct pinconf_ops foo_pconf_ops = { - .pin_config_get = foo_pin_config_get, - .pin_config_set = foo_pin_config_set, - .pin_config_group_get = foo_pin_config_group_get, - .pin_config_group_set = foo_pin_config_group_set, -}; + static struct pinconf_ops foo_pconf_ops = { + .pin_config_get = foo_pin_config_get, + .pin_config_set = foo_pin_config_set, + .pin_config_group_get = foo_pin_config_group_get, + .pin_config_group_set = foo_pin_config_group_set, + }; -/* Pin config operations are handled by some pin controller */ -static struct pinctrl_desc foo_desc = { - ... - .confops = &foo_pconf_ops, -}; + /* Pin config operations are handled by some pin controller */ + static struct pinctrl_desc foo_desc = { + ... + .confops = &foo_pconf_ops, + }; Since some controllers have special logic for handling entire groups of pins they can exploit the special whole-group pin control function. The @@ -296,35 +300,35 @@ controller handles control of a certain GPIO pin. Since a single pin controller may be muxing several GPIO ranges (typically SoCs that have one set of pins, but internally several GPIO silicon blocks, each modelled as a struct gpio_chip) any number of GPIO ranges can be added to a pin controller instance -like this: - -struct gpio_chip chip_a; -struct gpio_chip chip_b; - -static struct pinctrl_gpio_range gpio_range_a = { - .name = "chip a", - .id = 0, - .base = 32, - .pin_base = 32, - .npins = 16, - .gc = &chip_a; -}; - -static struct pinctrl_gpio_range gpio_range_b = { - .name = "chip b", - .id = 0, - .base = 48, - .pin_base = 64, - .npins = 8, - .gc = &chip_b; -}; - -{ - struct pinctrl_dev *pctl; - ... - pinctrl_add_gpio_range(pctl, &gpio_range_a); - pinctrl_add_gpio_range(pctl, &gpio_range_b); -} +like this:: + + struct gpio_chip chip_a; + struct gpio_chip chip_b; + + static struct pinctrl_gpio_range gpio_range_a = { + .name = "chip a", + .id = 0, + .base = 32, + .pin_base = 32, + .npins = 16, + .gc = &chip_a; + }; + + static struct pinctrl_gpio_range gpio_range_b = { + .name = "chip b", + .id = 0, + .base = 48, + .pin_base = 64, + .npins = 8, + .gc = &chip_b; + }; + + { + struct pinctrl_dev *pctl; + ... + pinctrl_add_gpio_range(pctl, &gpio_range_a); + pinctrl_add_gpio_range(pctl, &gpio_range_b); + } So this complex system has one pin controller handling two different GPIO chips. "chip a" has 16 pins and "chip b" has 8 pins. The "chip a" and @@ -348,25 +352,26 @@ chip b: The above examples assume the mapping between the GPIOs and pins is linear. If the mapping is sparse or haphazard, an array of arbitrary pin -numbers can be encoded in the range like this: +numbers can be encoded in the range like this:: -static const unsigned range_pins[] = { 14, 1, 22, 17, 10, 8, 6, 2 }; + static const unsigned range_pins[] = { 14, 1, 22, 17, 10, 8, 6, 2 }; -static struct pinctrl_gpio_range gpio_range = { - .name = "chip", - .id = 0, - .base = 32, - .pins = &range_pins, - .npins = ARRAY_SIZE(range_pins), - .gc = &chip; -}; + static struct pinctrl_gpio_range gpio_range = { + .name = "chip", + .id = 0, + .base = 32, + .pins = &range_pins, + .npins = ARRAY_SIZE(range_pins), + .gc = &chip; + }; In this case the pin_base property will be ignored. If the name of a pin group is known, the pins and npins elements of the above structure can be initialised using the function pinctrl_get_group_pins(), e.g. for pin -group "foo": +group "foo":: -pinctrl_get_group_pins(pctl, "foo", &gpio_range.pins, &gpio_range.npins); + pinctrl_get_group_pins(pctl, "foo", &gpio_range.pins, + &gpio_range.npins); When GPIO-specific functions in the pin control subsystem are called, these ranges will be used to look up the appropriate pin controller by inspecting @@ -405,7 +410,7 @@ we usually mean a way of soldering or wiring the package into an electronic system, even though the framework makes it possible to also change the function at runtime. -Here is an example of a PGA (Pin Grid Array) chip seen from underneath: +Here is an example of a PGA (Pin Grid Array) chip seen from underneath:: A B C D E F G H +---+ @@ -519,12 +524,12 @@ Definitions: In the example case we can define that this particular machine shall use device spi0 with pinmux function fspi0 group gspi0 and i2c0 on function fi2c0 group gi2c0, on the primary pin controller, we get mappings - like these: + like these:: - { - {"map-spi0", spi0, pinctrl0, fspi0, gspi0}, - {"map-i2c0", i2c0, pinctrl0, fi2c0, gi2c0} - } + { + {"map-spi0", spi0, pinctrl0, fspi0, gspi0}, + {"map-i2c0", i2c0, pinctrl0, fi2c0, gi2c0} + } Every map must be assigned a state name, pin controller, device and function. The group is not compulsory - if it is omitted the first group @@ -578,155 +583,155 @@ some certain registers to activate a certain mux setting for a certain pin. A simple driver for the above example will work by setting bits 0, 1, 2, 3 or 4 into some register named MUX to select a certain function with a certain -group of pins would work something like this: - -#include <linux/pinctrl/pinctrl.h> -#include <linux/pinctrl/pinmux.h> - -struct foo_group { - const char *name; - const unsigned int *pins; - const unsigned num_pins; -}; - -static const unsigned spi0_0_pins[] = { 0, 8, 16, 24 }; -static const unsigned spi0_1_pins[] = { 38, 46, 54, 62 }; -static const unsigned i2c0_pins[] = { 24, 25 }; -static const unsigned mmc0_1_pins[] = { 56, 57 }; -static const unsigned mmc0_2_pins[] = { 58, 59 }; -static const unsigned mmc0_3_pins[] = { 60, 61, 62, 63 }; - -static const struct foo_group foo_groups[] = { - { - .name = "spi0_0_grp", - .pins = spi0_0_pins, - .num_pins = ARRAY_SIZE(spi0_0_pins), - }, +group of pins would work something like this:: + + #include <linux/pinctrl/pinctrl.h> + #include <linux/pinctrl/pinmux.h> + + struct foo_group { + const char *name; + const unsigned int *pins; + const unsigned num_pins; + }; + + static const unsigned spi0_0_pins[] = { 0, 8, 16, 24 }; + static const unsigned spi0_1_pins[] = { 38, 46, 54, 62 }; + static const unsigned i2c0_pins[] = { 24, 25 }; + static const unsigned mmc0_1_pins[] = { 56, 57 }; + static const unsigned mmc0_2_pins[] = { 58, 59 }; + static const unsigned mmc0_3_pins[] = { 60, 61, 62, 63 }; + + static const struct foo_group foo_groups[] = { + { + .name = "spi0_0_grp", + .pins = spi0_0_pins, + .num_pins = ARRAY_SIZE(spi0_0_pins), + }, + { + .name = "spi0_1_grp", + .pins = spi0_1_pins, + .num_pins = ARRAY_SIZE(spi0_1_pins), + }, + { + .name = "i2c0_grp", + .pins = i2c0_pins, + .num_pins = ARRAY_SIZE(i2c0_pins), + }, + { + .name = "mmc0_1_grp", + .pins = mmc0_1_pins, + .num_pins = ARRAY_SIZE(mmc0_1_pins), + }, + { + .name = "mmc0_2_grp", + .pins = mmc0_2_pins, + .num_pins = ARRAY_SIZE(mmc0_2_pins), + }, + { + .name = "mmc0_3_grp", + .pins = mmc0_3_pins, + .num_pins = ARRAY_SIZE(mmc0_3_pins), + }, + }; + + + static int foo_get_groups_count(struct pinctrl_dev *pctldev) { - .name = "spi0_1_grp", - .pins = spi0_1_pins, - .num_pins = ARRAY_SIZE(spi0_1_pins), - }, - { - .name = "i2c0_grp", - .pins = i2c0_pins, - .num_pins = ARRAY_SIZE(i2c0_pins), - }, + return ARRAY_SIZE(foo_groups); + } + + static const char *foo_get_group_name(struct pinctrl_dev *pctldev, + unsigned selector) { - .name = "mmc0_1_grp", - .pins = mmc0_1_pins, - .num_pins = ARRAY_SIZE(mmc0_1_pins), - }, + return foo_groups[selector].name; + } + + static int foo_get_group_pins(struct pinctrl_dev *pctldev, unsigned selector, + unsigned ** const pins, + unsigned * const num_pins) { - .name = "mmc0_2_grp", - .pins = mmc0_2_pins, - .num_pins = ARRAY_SIZE(mmc0_2_pins), - }, + *pins = (unsigned *) foo_groups[selector].pins; + *num_pins = foo_groups[selector].num_pins; + return 0; + } + + static struct pinctrl_ops foo_pctrl_ops = { + .get_groups_count = foo_get_groups_count, + .get_group_name = foo_get_group_name, + .get_group_pins = foo_get_group_pins, + }; + + struct foo_pmx_func { + const char *name; + const char * const *groups; + const unsigned num_groups; + }; + + static const char * const spi0_groups[] = { "spi0_0_grp", "spi0_1_grp" }; + static const char * const i2c0_groups[] = { "i2c0_grp" }; + static const char * const mmc0_groups[] = { "mmc0_1_grp", "mmc0_2_grp", + "mmc0_3_grp" }; + + static const struct foo_pmx_func foo_functions[] = { + { + .name = "spi0", + .groups = spi0_groups, + .num_groups = ARRAY_SIZE(spi0_groups), + }, + { + .name = "i2c0", + .groups = i2c0_groups, + .num_groups = ARRAY_SIZE(i2c0_groups), + }, + { + .name = "mmc0", + .groups = mmc0_groups, + .num_groups = ARRAY_SIZE(mmc0_groups), + }, + }; + + static int foo_get_functions_count(struct pinctrl_dev *pctldev) { - .name = "mmc0_3_grp", - .pins = mmc0_3_pins, - .num_pins = ARRAY_SIZE(mmc0_3_pins), - }, -}; - - -static int foo_get_groups_count(struct pinctrl_dev *pctldev) -{ - return ARRAY_SIZE(foo_groups); -} - -static const char *foo_get_group_name(struct pinctrl_dev *pctldev, - unsigned selector) -{ - return foo_groups[selector].name; -} - -static int foo_get_group_pins(struct pinctrl_dev *pctldev, unsigned selector, - unsigned ** const pins, - unsigned * const num_pins) -{ - *pins = (unsigned *) foo_groups[selector].pins; - *num_pins = foo_groups[selector].num_pins; - return 0; -} - -static struct pinctrl_ops foo_pctrl_ops = { - .get_groups_count = foo_get_groups_count, - .get_group_name = foo_get_group_name, - .get_group_pins = foo_get_group_pins, -}; - -struct foo_pmx_func { - const char *name; - const char * const *groups; - const unsigned num_groups; -}; - -static const char * const spi0_groups[] = { "spi0_0_grp", "spi0_1_grp" }; -static const char * const i2c0_groups[] = { "i2c0_grp" }; -static const char * const mmc0_groups[] = { "mmc0_1_grp", "mmc0_2_grp", - "mmc0_3_grp" }; - -static const struct foo_pmx_func foo_functions[] = { + return ARRAY_SIZE(foo_functions); + } + + static const char *foo_get_fname(struct pinctrl_dev *pctldev, unsigned selector) { - .name = "spi0", - .groups = spi0_groups, - .num_groups = ARRAY_SIZE(spi0_groups), - }, + return foo_functions[selector].name; + } + + static int foo_get_groups(struct pinctrl_dev *pctldev, unsigned selector, + const char * const **groups, + unsigned * const num_groups) { - .name = "i2c0", - .groups = i2c0_groups, - .num_groups = ARRAY_SIZE(i2c0_groups), - }, + *groups = foo_functions[selector].groups; + *num_groups = foo_functions[selector].num_groups; + return 0; + } + + static int foo_set_mux(struct pinctrl_dev *pctldev, unsigned selector, + unsigned group) { - .name = "mmc0", - .groups = mmc0_groups, - .num_groups = ARRAY_SIZE(mmc0_groups), - }, -}; - -static int foo_get_functions_count(struct pinctrl_dev *pctldev) -{ - return ARRAY_SIZE(foo_functions); -} - -static const char *foo_get_fname(struct pinctrl_dev *pctldev, unsigned selector) -{ - return foo_functions[selector].name; -} - -static int foo_get_groups(struct pinctrl_dev *pctldev, unsigned selector, - const char * const **groups, - unsigned * const num_groups) -{ - *groups = foo_functions[selector].groups; - *num_groups = foo_functions[selector].num_groups; - return 0; -} - -static int foo_set_mux(struct pinctrl_dev *pctldev, unsigned selector, - unsigned group) -{ - u8 regbit = (1 << selector + group); - - writeb((readb(MUX)|regbit), MUX) - return 0; -} - -static struct pinmux_ops foo_pmxops = { - .get_functions_count = foo_get_functions_count, - .get_function_name = foo_get_fname, - .get_function_groups = foo_get_groups, - .set_mux = foo_set_mux, - .strict = true, -}; - -/* Pinmux operations are handled by some pin controller */ -static struct pinctrl_desc foo_desc = { - ... - .pctlops = &foo_pctrl_ops, - .pmxops = &foo_pmxops, -}; + u8 regbit = (1 << selector + group); + + writeb((readb(MUX)|regbit), MUX) + return 0; + } + + static struct pinmux_ops foo_pmxops = { + .get_functions_count = foo_get_functions_count, + .get_function_name = foo_get_fname, + .get_function_groups = foo_get_groups, + .set_mux = foo_set_mux, + .strict = true, + }; + + /* Pinmux operations are handled by some pin controller */ + static struct pinctrl_desc foo_desc = { + ... + .pctlops = &foo_pctrl_ops, + .pmxops = &foo_pmxops, + }; In the example activating muxing 0 and 1 at the same time setting bits 0 and 1, uses one pin in common so they would collide. @@ -809,9 +814,9 @@ for a device. The GPIO portions of a pin and its relation to a certain pin controller configuration and muxing logic can be constructed in several ways. Here -are two examples: +are two examples:: -(A) + (A) pin config logic regs | +- SPI @@ -840,7 +845,9 @@ simultaneous access to the same pin from GPIO and pin multiplexing consumers on hardware of this type. The pinctrl driver should set this flag accordingly. -(B) +:: + + (B) pin config logic regs @@ -911,52 +918,55 @@ has to be handled by the <linux/gpio.h> interface. Instead view this as a certain pin config setting. Look in e.g. <linux/pinctrl/pinconf-generic.h> and you find this in the documentation: - PIN_CONFIG_OUTPUT: this will configure the pin in output, use argument + PIN_CONFIG_OUTPUT: + this will configure the pin in output, use argument 1 to indicate high level, argument 0 to indicate low level. So it is perfectly possible to push a pin into "GPIO mode" and drive the line low as part of the usual pin control map. So for example your UART -driver may look like this: +driver may look like this:: -#include <linux/pinctrl/consumer.h> + #include <linux/pinctrl/consumer.h> -struct pinctrl *pinctrl; -struct pinctrl_state *pins_default; -struct pinctrl_state *pins_sleep; + struct pinctrl *pinctrl; + struct pinctrl_state *pins_default; + struct pinctrl_state *pins_sleep; -pins_default = pinctrl_lookup_state(uap->pinctrl, PINCTRL_STATE_DEFAULT); -pins_sleep = pinctrl_lookup_state(uap->pinctrl, PINCTRL_STATE_SLEEP); + pins_default = pinctrl_lookup_state(uap->pinctrl, PINCTRL_STATE_DEFAULT); + pins_sleep = pinctrl_lookup_state(uap->pinctrl, PINCTRL_STATE_SLEEP); -/* Normal mode */ -retval = pinctrl_select_state(pinctrl, pins_default); -/* Sleep mode */ -retval = pinctrl_select_state(pinctrl, pins_sleep); + /* Normal mode */ + retval = pinctrl_select_state(pinctrl, pins_default); + /* Sleep mode */ + retval = pinctrl_select_state(pinctrl, pins_sleep); And your machine configuration may look like this: -------------------------------------------------- -static unsigned long uart_default_mode[] = { - PIN_CONF_PACKED(PIN_CONFIG_DRIVE_PUSH_PULL, 0), -}; - -static unsigned long uart_sleep_mode[] = { - PIN_CONF_PACKED(PIN_CONFIG_OUTPUT, 0), -}; - -static struct pinctrl_map pinmap[] __initdata = { - PIN_MAP_MUX_GROUP("uart", PINCTRL_STATE_DEFAULT, "pinctrl-foo", - "u0_group", "u0"), - PIN_MAP_CONFIGS_PIN("uart", PINCTRL_STATE_DEFAULT, "pinctrl-foo", - "UART_TX_PIN", uart_default_mode), - PIN_MAP_MUX_GROUP("uart", PINCTRL_STATE_SLEEP, "pinctrl-foo", - "u0_group", "gpio-mode"), - PIN_MAP_CONFIGS_PIN("uart", PINCTRL_STATE_SLEEP, "pinctrl-foo", - "UART_TX_PIN", uart_sleep_mode), -}; - -foo_init(void) { - pinctrl_register_mappings(pinmap, ARRAY_SIZE(pinmap)); -} +:: + + static unsigned long uart_default_mode[] = { + PIN_CONF_PACKED(PIN_CONFIG_DRIVE_PUSH_PULL, 0), + }; + + static unsigned long uart_sleep_mode[] = { + PIN_CONF_PACKED(PIN_CONFIG_OUTPUT, 0), + }; + + static struct pinctrl_map pinmap[] __initdata = { + PIN_MAP_MUX_GROUP("uart", PINCTRL_STATE_DEFAULT, "pinctrl-foo", + "u0_group", "u0"), + PIN_MAP_CONFIGS_PIN("uart", PINCTRL_STATE_DEFAULT, "pinctrl-foo", + "UART_TX_PIN", uart_default_mode), + PIN_MAP_MUX_GROUP("uart", PINCTRL_STATE_SLEEP, "pinctrl-foo", + "u0_group", "gpio-mode"), + PIN_MAP_CONFIGS_PIN("uart", PINCTRL_STATE_SLEEP, "pinctrl-foo", + "UART_TX_PIN", uart_sleep_mode), + }; + + foo_init(void) { + pinctrl_register_mappings(pinmap, ARRAY_SIZE(pinmap)); + } Here the pins we want to control are in the "u0_group" and there is some function called "u0" that can be enabled on this group of pins, and then @@ -985,7 +995,7 @@ API. Board/machine configuration -================================== +=========================== Boards and machines define how a certain complete running system is put together, including how GPIOs and devices are muxed, how regulators are @@ -994,33 +1004,33 @@ part of this. A pin controller configuration for a machine looks pretty much like a simple regulator configuration, so for the example array above we want to enable i2c -and spi on the second function mapping: - -#include <linux/pinctrl/machine.h> - -static const struct pinctrl_map mapping[] __initconst = { - { - .dev_name = "foo-spi.0", - .name = PINCTRL_STATE_DEFAULT, - .type = PIN_MAP_TYPE_MUX_GROUP, - .ctrl_dev_name = "pinctrl-foo", - .data.mux.function = "spi0", - }, - { - .dev_name = "foo-i2c.0", - .name = PINCTRL_STATE_DEFAULT, - .type = PIN_MAP_TYPE_MUX_GROUP, - .ctrl_dev_name = "pinctrl-foo", - .data.mux.function = "i2c0", - }, - { - .dev_name = "foo-mmc.0", - .name = PINCTRL_STATE_DEFAULT, - .type = PIN_MAP_TYPE_MUX_GROUP, - .ctrl_dev_name = "pinctrl-foo", - .data.mux.function = "mmc0", - }, -}; +and spi on the second function mapping:: + + #include <linux/pinctrl/machine.h> + + static const struct pinctrl_map mapping[] __initconst = { + { + .dev_name = "foo-spi.0", + .name = PINCTRL_STATE_DEFAULT, + .type = PIN_MAP_TYPE_MUX_GROUP, + .ctrl_dev_name = "pinctrl-foo", + .data.mux.function = "spi0", + }, + { + .dev_name = "foo-i2c.0", + .name = PINCTRL_STATE_DEFAULT, + .type = PIN_MAP_TYPE_MUX_GROUP, + .ctrl_dev_name = "pinctrl-foo", + .data.mux.function = "i2c0", + }, + { + .dev_name = "foo-mmc.0", + .name = PINCTRL_STATE_DEFAULT, + .type = PIN_MAP_TYPE_MUX_GROUP, + .ctrl_dev_name = "pinctrl-foo", + .data.mux.function = "mmc0", + }, + }; The dev_name here matches to the unique device name that can be used to look up the device struct (just like with clockdev or regulators). The function name @@ -1029,76 +1039,81 @@ must match a function provided by the pinmux driver handling this pin range. As you can see we may have several pin controllers on the system and thus we need to specify which one of them contains the functions we wish to map. -You register this pinmux mapping to the pinmux subsystem by simply: +You register this pinmux mapping to the pinmux subsystem by simply:: ret = pinctrl_register_mappings(mapping, ARRAY_SIZE(mapping)); Since the above construct is pretty common there is a helper macro to make it even more compact which assumes you want to use pinctrl-foo and position -0 for mapping, for example: +0 for mapping, for example:: -static struct pinctrl_map mapping[] __initdata = { - PIN_MAP_MUX_GROUP("foo-i2c.o", PINCTRL_STATE_DEFAULT, "pinctrl-foo", NULL, "i2c0"), -}; + static struct pinctrl_map mapping[] __initdata = { + PIN_MAP_MUX_GROUP("foo-i2c.o", PINCTRL_STATE_DEFAULT, + "pinctrl-foo", NULL, "i2c0"), + }; The mapping table may also contain pin configuration entries. It's common for each pin/group to have a number of configuration entries that affect it, so the table entries for configuration reference an array of config parameters -and values. An example using the convenience macros is shown below: - -static unsigned long i2c_grp_configs[] = { - FOO_PIN_DRIVEN, - FOO_PIN_PULLUP, -}; - -static unsigned long i2c_pin_configs[] = { - FOO_OPEN_COLLECTOR, - FOO_SLEW_RATE_SLOW, -}; - -static struct pinctrl_map mapping[] __initdata = { - PIN_MAP_MUX_GROUP("foo-i2c.0", PINCTRL_STATE_DEFAULT, "pinctrl-foo", "i2c0", "i2c0"), - PIN_MAP_CONFIGS_GROUP("foo-i2c.0", PINCTRL_STATE_DEFAULT, "pinctrl-foo", "i2c0", i2c_grp_configs), - PIN_MAP_CONFIGS_PIN("foo-i2c.0", PINCTRL_STATE_DEFAULT, "pinctrl-foo", "i2c0scl", i2c_pin_configs), - PIN_MAP_CONFIGS_PIN("foo-i2c.0", PINCTRL_STATE_DEFAULT, "pinctrl-foo", "i2c0sda", i2c_pin_configs), -}; +and values. An example using the convenience macros is shown below:: + + static unsigned long i2c_grp_configs[] = { + FOO_PIN_DRIVEN, + FOO_PIN_PULLUP, + }; + + static unsigned long i2c_pin_configs[] = { + FOO_OPEN_COLLECTOR, + FOO_SLEW_RATE_SLOW, + }; + + static struct pinctrl_map mapping[] __initdata = { + PIN_MAP_MUX_GROUP("foo-i2c.0", PINCTRL_STATE_DEFAULT, + "pinctrl-foo", "i2c0", "i2c0"), + PIN_MAP_CONFIGS_GROUP("foo-i2c.0", PINCTRL_STATE_DEFAULT, + "pinctrl-foo", "i2c0", i2c_grp_configs), + PIN_MAP_CONFIGS_PIN("foo-i2c.0", PINCTRL_STATE_DEFAULT, + "pinctrl-foo", "i2c0scl", i2c_pin_configs), + PIN_MAP_CONFIGS_PIN("foo-i2c.0", PINCTRL_STATE_DEFAULT, + "pinctrl-foo", "i2c0sda", i2c_pin_configs), + }; Finally, some devices expect the mapping table to contain certain specific named states. When running on hardware that doesn't need any pin controller configuration, the mapping table must still contain those named states, in order to explicitly indicate that the states were provided and intended to be empty. Table entry macro PIN_MAP_DUMMY_STATE serves the purpose of defining -a named state without causing any pin controller to be programmed: +a named state without causing any pin controller to be programmed:: -static struct pinctrl_map mapping[] __initdata = { - PIN_MAP_DUMMY_STATE("foo-i2c.0", PINCTRL_STATE_DEFAULT), -}; + static struct pinctrl_map mapping[] __initdata = { + PIN_MAP_DUMMY_STATE("foo-i2c.0", PINCTRL_STATE_DEFAULT), + }; Complex mappings ================ As it is possible to map a function to different groups of pins an optional -.group can be specified like this: - -... -{ - .dev_name = "foo-spi.0", - .name = "spi0-pos-A", - .type = PIN_MAP_TYPE_MUX_GROUP, - .ctrl_dev_name = "pinctrl-foo", - .function = "spi0", - .group = "spi0_0_grp", -}, -{ - .dev_name = "foo-spi.0", - .name = "spi0-pos-B", - .type = PIN_MAP_TYPE_MUX_GROUP, - .ctrl_dev_name = "pinctrl-foo", - .function = "spi0", - .group = "spi0_1_grp", -}, -... +.group can be specified like this:: + + ... + { + .dev_name = "foo-spi.0", + .name = "spi0-pos-A", + .type = PIN_MAP_TYPE_MUX_GROUP, + .ctrl_dev_name = "pinctrl-foo", + .function = "spi0", + .group = "spi0_0_grp", + }, + { + .dev_name = "foo-spi.0", + .name = "spi0-pos-B", + .type = PIN_MAP_TYPE_MUX_GROUP, + .ctrl_dev_name = "pinctrl-foo", + .function = "spi0", + .group = "spi0_1_grp", + }, + ... This example mapping is used to switch between two positions for spi0 at runtime, as described further below under the heading "Runtime pinmuxing". @@ -1107,67 +1122,67 @@ Further it is possible for one named state to affect the muxing of several groups of pins, say for example in the mmc0 example above, where you can additively expand the mmc0 bus from 2 to 4 to 8 pins. If we want to use all three groups for a total of 2+2+4 = 8 pins (for an 8-bit MMC bus as is the -case), we define a mapping like this: - -... -{ - .dev_name = "foo-mmc.0", - .name = "2bit" - .type = PIN_MAP_TYPE_MUX_GROUP, - .ctrl_dev_name = "pinctrl-foo", - .function = "mmc0", - .group = "mmc0_1_grp", -}, -{ - .dev_name = "foo-mmc.0", - .name = "4bit" - .type = PIN_MAP_TYPE_MUX_GROUP, - .ctrl_dev_name = "pinctrl-foo", - .function = "mmc0", - .group = "mmc0_1_grp", -}, -{ - .dev_name = "foo-mmc.0", - .name = "4bit" - .type = PIN_MAP_TYPE_MUX_GROUP, - .ctrl_dev_name = "pinctrl-foo", - .function = "mmc0", - .group = "mmc0_2_grp", -}, -{ - .dev_name = "foo-mmc.0", - .name = "8bit" - .type = PIN_MAP_TYPE_MUX_GROUP, - .ctrl_dev_name = "pinctrl-foo", - .function = "mmc0", - .group = "mmc0_1_grp", -}, -{ - .dev_name = "foo-mmc.0", - .name = "8bit" - .type = PIN_MAP_TYPE_MUX_GROUP, - .ctrl_dev_name = "pinctrl-foo", - .function = "mmc0", - .group = "mmc0_2_grp", -}, -{ - .dev_name = "foo-mmc.0", - .name = "8bit" - .type = PIN_MAP_TYPE_MUX_GROUP, - .ctrl_dev_name = "pinctrl-foo", - .function = "mmc0", - .group = "mmc0_3_grp", -}, -... +case), we define a mapping like this:: + + ... + { + .dev_name = "foo-mmc.0", + .name = "2bit" + .type = PIN_MAP_TYPE_MUX_GROUP, + .ctrl_dev_name = "pinctrl-foo", + .function = "mmc0", + .group = "mmc0_1_grp", + }, + { + .dev_name = "foo-mmc.0", + .name = "4bit" + .type = PIN_MAP_TYPE_MUX_GROUP, + .ctrl_dev_name = "pinctrl-foo", + .function = "mmc0", + .group = "mmc0_1_grp", + }, + { + .dev_name = "foo-mmc.0", + .name = "4bit" + .type = PIN_MAP_TYPE_MUX_GROUP, + .ctrl_dev_name = "pinctrl-foo", + .function = "mmc0", + .group = "mmc0_2_grp", + }, + { + .dev_name = "foo-mmc.0", + .name = "8bit" + .type = PIN_MAP_TYPE_MUX_GROUP, + .ctrl_dev_name = "pinctrl-foo", + .function = "mmc0", + .group = "mmc0_1_grp", + }, + { + .dev_name = "foo-mmc.0", + .name = "8bit" + .type = PIN_MAP_TYPE_MUX_GROUP, + .ctrl_dev_name = "pinctrl-foo", + .function = "mmc0", + .group = "mmc0_2_grp", + }, + { + .dev_name = "foo-mmc.0", + .name = "8bit" + .type = PIN_MAP_TYPE_MUX_GROUP, + .ctrl_dev_name = "pinctrl-foo", + .function = "mmc0", + .group = "mmc0_3_grp", + }, + ... The result of grabbing this mapping from the device with something like -this (see next paragraph): +this (see next paragraph):: p = devm_pinctrl_get(dev); s = pinctrl_lookup_state(p, "8bit"); ret = pinctrl_select_state(p, s); -or more simply: +or more simply:: p = devm_pinctrl_get_select(dev, "8bit"); @@ -1205,39 +1220,39 @@ PINCTRL_STATE_SLEEP at runtime, re-biasing or even re-muxing pins to save current in sleep mode. A driver may request a certain control state to be activated, usually just the -default state like this: +default state like this:: -#include <linux/pinctrl/consumer.h> + #include <linux/pinctrl/consumer.h> -struct foo_state { - struct pinctrl *p; - struct pinctrl_state *s; - ... -}; + struct foo_state { + struct pinctrl *p; + struct pinctrl_state *s; + ... + }; -foo_probe() -{ - /* Allocate a state holder named "foo" etc */ - struct foo_state *foo = ...; + foo_probe() + { + /* Allocate a state holder named "foo" etc */ + struct foo_state *foo = ...; - foo->p = devm_pinctrl_get(&device); - if (IS_ERR(foo->p)) { - /* FIXME: clean up "foo" here */ - return PTR_ERR(foo->p); - } + foo->p = devm_pinctrl_get(&device); + if (IS_ERR(foo->p)) { + /* FIXME: clean up "foo" here */ + return PTR_ERR(foo->p); + } - foo->s = pinctrl_lookup_state(foo->p, PINCTRL_STATE_DEFAULT); - if (IS_ERR(foo->s)) { - /* FIXME: clean up "foo" here */ - return PTR_ERR(s); - } + foo->s = pinctrl_lookup_state(foo->p, PINCTRL_STATE_DEFAULT); + if (IS_ERR(foo->s)) { + /* FIXME: clean up "foo" here */ + return PTR_ERR(s); + } - ret = pinctrl_select_state(foo->s); - if (ret < 0) { - /* FIXME: clean up "foo" here */ - return ret; + ret = pinctrl_select_state(foo->s); + if (ret < 0) { + /* FIXME: clean up "foo" here */ + return ret; + } } -} This get/lookup/select/put sequence can just as well be handled by bus drivers if you don't want each and every driver to handle it and you know the @@ -1299,16 +1314,16 @@ Drivers needing both pin control and GPIOs Again, it is discouraged to let drivers lookup and select pin control states themselves, but again sometimes this is unavoidable. -So say that your driver is fetching its resources like this: +So say that your driver is fetching its resources like this:: -#include <linux/pinctrl/consumer.h> -#include <linux/gpio.h> + #include <linux/pinctrl/consumer.h> + #include <linux/gpio.h> -struct pinctrl *pinctrl; -int gpio; + struct pinctrl *pinctrl; + int gpio; -pinctrl = devm_pinctrl_get_select_default(&dev); -gpio = devm_gpio_request(&dev, 14, "foo"); + pinctrl = devm_pinctrl_get_select_default(&dev); + gpio = devm_gpio_request(&dev, 14, "foo"); Here we first request a certain pin state and then request GPIO 14 to be used. If you're using the subsystems orthogonally like this, you should @@ -1347,21 +1362,22 @@ lookup_state() and select_state() on it immediately after the pin control device has been registered. This occurs for mapping table entries where the client device name is equal -to the pin controller device name, and the state name is PINCTRL_STATE_DEFAULT. +to the pin controller device name, and the state name is PINCTRL_STATE_DEFAULT:: -{ - .dev_name = "pinctrl-foo", - .name = PINCTRL_STATE_DEFAULT, - .type = PIN_MAP_TYPE_MUX_GROUP, - .ctrl_dev_name = "pinctrl-foo", - .function = "power_func", -}, + { + .dev_name = "pinctrl-foo", + .name = PINCTRL_STATE_DEFAULT, + .type = PIN_MAP_TYPE_MUX_GROUP, + .ctrl_dev_name = "pinctrl-foo", + .function = "power_func", + }, Since it may be common to request the core to hog a few always-applicable mux settings on the primary pin controller, there is a convenience macro for -this: +this:: -PIN_MAP_MUX_GROUP_HOG_DEFAULT("pinctrl-foo", NULL /* group */, "power_func") + PIN_MAP_MUX_GROUP_HOG_DEFAULT("pinctrl-foo", NULL /* group */, + "power_func") This gives the exact same result as the above construction. @@ -1378,45 +1394,45 @@ function, but with different named in the mapping as described under This snippet first initializes a state object for both groups (in foo_probe()), then muxes the function in the pins defined by group A, and finally muxes it in -on the pins defined by group B: +on the pins defined by group B:: -#include <linux/pinctrl/consumer.h> + #include <linux/pinctrl/consumer.h> -struct pinctrl *p; -struct pinctrl_state *s1, *s2; + struct pinctrl *p; + struct pinctrl_state *s1, *s2; -foo_probe() -{ - /* Setup */ - p = devm_pinctrl_get(&device); - if (IS_ERR(p)) - ... + foo_probe() + { + /* Setup */ + p = devm_pinctrl_get(&device); + if (IS_ERR(p)) + ... + + s1 = pinctrl_lookup_state(foo->p, "pos-A"); + if (IS_ERR(s1)) + ... + + s2 = pinctrl_lookup_state(foo->p, "pos-B"); + if (IS_ERR(s2)) + ... + } - s1 = pinctrl_lookup_state(foo->p, "pos-A"); - if (IS_ERR(s1)) + foo_switch() + { + /* Enable on position A */ + ret = pinctrl_select_state(s1); + if (ret < 0) ... - s2 = pinctrl_lookup_state(foo->p, "pos-B"); - if (IS_ERR(s2)) ... -} - -foo_switch() -{ - /* Enable on position A */ - ret = pinctrl_select_state(s1); - if (ret < 0) - ... - - ... - /* Enable on position B */ - ret = pinctrl_select_state(s2); - if (ret < 0) - ... + /* Enable on position B */ + ret = pinctrl_select_state(s2); + if (ret < 0) + ... - ... -} + ... + } The above has to be done from process context. The reservation of the pins will be done when the state is activated, so in effect one specific pin diff --git a/Documentation/driver-api/rapidio.rst b/Documentation/driver-api/rapidio.rst new file mode 100644 index 000000000000..71ff658ab78e --- /dev/null +++ b/Documentation/driver-api/rapidio.rst @@ -0,0 +1,107 @@ +======================= +RapidIO Subsystem Guide +======================= + +:Author: Matt Porter + +Introduction +============ + +RapidIO is a high speed switched fabric interconnect with features aimed +at the embedded market. RapidIO provides support for memory-mapped I/O +as well as message-based transactions over the switched fabric network. +RapidIO has a standardized discovery mechanism not unlike the PCI bus +standard that allows simple detection of devices in a network. + +This documentation is provided for developers intending to support +RapidIO on new architectures, write new drivers, or to understand the +subsystem internals. + +Known Bugs and Limitations +========================== + +Bugs +---- + +None. ;) + +Limitations +----------- + +1. Access/management of RapidIO memory regions is not supported + +2. Multiple host enumeration is not supported + +RapidIO driver interface +======================== + +Drivers are provided a set of calls in order to interface with the +subsystem to gather info on devices, request/map memory region +resources, and manage mailboxes/doorbells. + +Functions +--------- + +.. kernel-doc:: include/linux/rio_drv.h + :internal: + +.. kernel-doc:: drivers/rapidio/rio-driver.c + :export: + +.. kernel-doc:: drivers/rapidio/rio.c + :export: + +Internals +========= + +This chapter contains the autogenerated documentation of the RapidIO +subsystem. + +Structures +---------- + +.. kernel-doc:: include/linux/rio.h + :internal: + +Enumeration and Discovery +------------------------- + +.. kernel-doc:: drivers/rapidio/rio-scan.c + :internal: + +Driver functionality +-------------------- + +.. kernel-doc:: drivers/rapidio/rio.c + :internal: + +.. kernel-doc:: drivers/rapidio/rio-access.c + :internal: + +Device model support +-------------------- + +.. kernel-doc:: drivers/rapidio/rio-driver.c + :internal: + +PPC32 support +------------- + +.. kernel-doc:: arch/powerpc/sysdev/fsl_rio.c + :internal: + +Credits +======= + +The following people have contributed to the RapidIO subsystem directly +or indirectly: + +1. Matt Porter\ mporter@kernel.crashing.org + +2. Randy Vinson\ rvinson@mvista.com + +3. Dan Malek\ dan@embeddedalley.com + +The following people have contributed to this document: + +1. Matt Porter\ mporter@kernel.crashing.org diff --git a/Documentation/driver-api/s390-drivers.rst b/Documentation/driver-api/s390-drivers.rst new file mode 100644 index 000000000000..7060da136095 --- /dev/null +++ b/Documentation/driver-api/s390-drivers.rst @@ -0,0 +1,111 @@ +=================================== +Writing s390 channel device drivers +=================================== + +:Author: Cornelia Huck + +Introduction +============ + +This document describes the interfaces available for device drivers that +drive s390 based channel attached I/O devices. This includes interfaces +for interaction with the hardware and interfaces for interacting with +the common driver core. Those interfaces are provided by the s390 common +I/O layer. + +The document assumes a familarity with the technical terms associated +with the s390 channel I/O architecture. For a description of this +architecture, please refer to the "z/Architecture: Principles of +Operation", IBM publication no. SA22-7832. + +While most I/O devices on a s390 system are typically driven through the +channel I/O mechanism described here, there are various other methods +(like the diag interface). These are out of the scope of this document. + +Some additional information can also be found in the kernel source under +Documentation/s390/driver-model.txt. + +The ccw bus +=========== + +The ccw bus typically contains the majority of devices available to a +s390 system. Named after the channel command word (ccw), the basic +command structure used to address its devices, the ccw bus contains +so-called channel attached devices. They are addressed via I/O +subchannels, visible on the css bus. A device driver for +channel-attached devices, however, will never interact with the +subchannel directly, but only via the I/O device on the ccw bus, the ccw +device. + +I/O functions for channel-attached devices +------------------------------------------ + +Some hardware structures have been translated into C structures for use +by the common I/O layer and device drivers. For more information on the +hardware structures represented here, please consult the Principles of +Operation. + +.. kernel-doc:: arch/s390/include/asm/cio.h + :internal: + +ccw devices +----------- + +Devices that want to initiate channel I/O need to attach to the ccw bus. +Interaction with the driver core is done via the common I/O layer, which +provides the abstractions of ccw devices and ccw device drivers. + +The functions that initiate or terminate channel I/O all act upon a ccw +device structure. Device drivers must not bypass those functions or +strange side effects may happen. + +.. kernel-doc:: arch/s390/include/asm/ccwdev.h + :internal: + +.. kernel-doc:: drivers/s390/cio/device.c + :export: + +.. kernel-doc:: drivers/s390/cio/device_ops.c + :export: + +The channel-measurement facility +-------------------------------- + +The channel-measurement facility provides a means to collect measurement +data which is made available by the channel subsystem for each channel +attached device. + +.. kernel-doc:: arch/s390/include/asm/cmb.h + :internal: + +.. kernel-doc:: drivers/s390/cio/cmf.c + :export: + +The ccwgroup bus +================ + +The ccwgroup bus only contains artificial devices, created by the user. +Many networking devices (e.g. qeth) are in fact composed of several ccw +devices (like read, write and data channel for qeth). The ccwgroup bus +provides a mechanism to create a meta-device which contains those ccw +devices as slave devices and can be associated with the netdevice. + +ccw group devices +----------------- + +.. kernel-doc:: arch/s390/include/asm/ccwgroup.h + :internal: + +.. kernel-doc:: drivers/s390/cio/ccwgroup.c + :export: + +Generic interfaces +================== + +Some interfaces are available to other drivers that do not necessarily +have anything to do with the busses described above, but still are +indirectly using basic infrastructure in the common I/O layer. One +example is the support for adapter interrupts. + +.. kernel-doc:: drivers/s390/cio/airq.c + :export: diff --git a/Documentation/driver-api/scsi.rst b/Documentation/driver-api/scsi.rst new file mode 100644 index 000000000000..859fb672319f --- /dev/null +++ b/Documentation/driver-api/scsi.rst @@ -0,0 +1,344 @@ +===================== +SCSI Interfaces Guide +===================== + +:Author: James Bottomley +:Author: Rob Landley + +Introduction +============ + +Protocol vs bus +--------------- + +Once upon a time, the Small Computer Systems Interface defined both a +parallel I/O bus and a data protocol to connect a wide variety of +peripherals (disk drives, tape drives, modems, printers, scanners, +optical drives, test equipment, and medical devices) to a host computer. + +Although the old parallel (fast/wide/ultra) SCSI bus has largely fallen +out of use, the SCSI command set is more widely used than ever to +communicate with devices over a number of different busses. + +The `SCSI protocol <http://www.t10.org/scsi-3.htm>`__ is a big-endian +peer-to-peer packet based protocol. SCSI commands are 6, 10, 12, or 16 +bytes long, often followed by an associated data payload. + +SCSI commands can be transported over just about any kind of bus, and +are the default protocol for storage devices attached to USB, SATA, SAS, +Fibre Channel, FireWire, and ATAPI devices. SCSI packets are also +commonly exchanged over Infiniband, +`I20 <http://i2o.shadowconnect.com/faq.php>`__, TCP/IP +(`iSCSI <https://en.wikipedia.org/wiki/ISCSI>`__), even `Parallel +ports <http://cyberelk.net/tim/parport/parscsi.html>`__. + +Design of the Linux SCSI subsystem +---------------------------------- + +The SCSI subsystem uses a three layer design, with upper, mid, and low +layers. Every operation involving the SCSI subsystem (such as reading a +sector from a disk) uses one driver at each of the 3 levels: one upper +layer driver, one lower layer driver, and the SCSI midlayer. + +The SCSI upper layer provides the interface between userspace and the +kernel, in the form of block and char device nodes for I/O and ioctl(). +The SCSI lower layer contains drivers for specific hardware devices. + +In between is the SCSI mid-layer, analogous to a network routing layer +such as the IPv4 stack. The SCSI mid-layer routes a packet based data +protocol between the upper layer's /dev nodes and the corresponding +devices in the lower layer. It manages command queues, provides error +handling and power management functions, and responds to ioctl() +requests. + +SCSI upper layer +================ + +The upper layer supports the user-kernel interface by providing device +nodes. + +sd (SCSI Disk) +-------------- + +sd (sd_mod.o) + +sr (SCSI CD-ROM) +---------------- + +sr (sr_mod.o) + +st (SCSI Tape) +-------------- + +st (st.o) + +sg (SCSI Generic) +----------------- + +sg (sg.o) + +ch (SCSI Media Changer) +----------------------- + +ch (ch.c) + +SCSI mid layer +============== + +SCSI midlayer implementation +---------------------------- + +include/scsi/scsi_device.h +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. kernel-doc:: include/scsi/scsi_device.h + :internal: + +drivers/scsi/scsi.c +~~~~~~~~~~~~~~~~~~~ + +Main file for the SCSI midlayer. + +.. kernel-doc:: drivers/scsi/scsi.c + :export: + +drivers/scsi/scsicam.c +~~~~~~~~~~~~~~~~~~~~~~ + +`SCSI Common Access +Method <http://www.t10.org/ftp/t10/drafts/cam/cam-r12b.pdf>`__ support +functions, for use with HDIO_GETGEO, etc. + +.. kernel-doc:: drivers/scsi/scsicam.c + :export: + +drivers/scsi/scsi_error.c +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Common SCSI error/timeout handling routines. + +.. kernel-doc:: drivers/scsi/scsi_error.c + :export: + +drivers/scsi/scsi_devinfo.c +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Manage scsi_dev_info_list, which tracks blacklisted and whitelisted +devices. + +.. kernel-doc:: drivers/scsi/scsi_devinfo.c + :internal: + +drivers/scsi/scsi_ioctl.c +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Handle ioctl() calls for SCSI devices. + +.. kernel-doc:: drivers/scsi/scsi_ioctl.c + :export: + +drivers/scsi/scsi_lib.c +~~~~~~~~~~~~~~~~~~~~~~~~ + +SCSI queuing library. + +.. kernel-doc:: drivers/scsi/scsi_lib.c + :export: + +drivers/scsi/scsi_lib_dma.c +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +SCSI library functions depending on DMA (map and unmap scatter-gather +lists). + +.. kernel-doc:: drivers/scsi/scsi_lib_dma.c + :export: + +drivers/scsi/scsi_module.c +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The file drivers/scsi/scsi_module.c contains legacy support for +old-style host templates. It should never be used by any new driver. + +drivers/scsi/scsi_proc.c +~~~~~~~~~~~~~~~~~~~~~~~~~ + +The functions in this file provide an interface between the PROC file +system and the SCSI device drivers It is mainly used for debugging, +statistics and to pass information directly to the lowlevel driver. I.E. +plumbing to manage /proc/scsi/\* + +.. kernel-doc:: drivers/scsi/scsi_proc.c + :internal: + +drivers/scsi/scsi_netlink.c +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Infrastructure to provide async events from transports to userspace via +netlink, using a single NETLINK_SCSITRANSPORT protocol for all +transports. See `the original patch +submission <http://marc.info/?l=linux-scsi&m=115507374832500&w=2>`__ for +more details. + +.. kernel-doc:: drivers/scsi/scsi_netlink.c + :internal: + +drivers/scsi/scsi_scan.c +~~~~~~~~~~~~~~~~~~~~~~~~~ + +Scan a host to determine which (if any) devices are attached. The +general scanning/probing algorithm is as follows, exceptions are made to +it depending on device specific flags, compilation options, and global +variable (boot or module load time) settings. A specific LUN is scanned +via an INQUIRY command; if the LUN has a device attached, a scsi_device +is allocated and setup for it. For every id of every channel on the +given host, start by scanning LUN 0. Skip hosts that don't respond at +all to a scan of LUN 0. Otherwise, if LUN 0 has a device attached, +allocate and setup a scsi_device for it. If target is SCSI-3 or up, +issue a REPORT LUN, and scan all of the LUNs returned by the REPORT LUN; +else, sequentially scan LUNs up until some maximum is reached, or a LUN +is seen that cannot have a device attached to it. + +.. kernel-doc:: drivers/scsi/scsi_scan.c + :internal: + +drivers/scsi/scsi_sysctl.c +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Set up the sysctl entry: "/dev/scsi/logging_level" +(DEV_SCSI_LOGGING_LEVEL) which sets/returns scsi_logging_level. + +drivers/scsi/scsi_sysfs.c +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +SCSI sysfs interface routines. + +.. kernel-doc:: drivers/scsi/scsi_sysfs.c + :export: + +drivers/scsi/hosts.c +~~~~~~~~~~~~~~~~~~~~ + +mid to lowlevel SCSI driver interface + +.. kernel-doc:: drivers/scsi/hosts.c + :export: + +drivers/scsi/constants.c +~~~~~~~~~~~~~~~~~~~~~~~~ + +mid to lowlevel SCSI driver interface + +.. kernel-doc:: drivers/scsi/constants.c + :export: + +Transport classes +----------------- + +Transport classes are service libraries for drivers in the SCSI lower +layer, which expose transport attributes in sysfs. + +Fibre Channel transport +~~~~~~~~~~~~~~~~~~~~~~~ + +The file drivers/scsi/scsi_transport_fc.c defines transport attributes +for Fibre Channel. + +.. kernel-doc:: drivers/scsi/scsi_transport_fc.c + :export: + +iSCSI transport class +~~~~~~~~~~~~~~~~~~~~~ + +The file drivers/scsi/scsi_transport_iscsi.c defines transport +attributes for the iSCSI class, which sends SCSI packets over TCP/IP +connections. + +.. kernel-doc:: drivers/scsi/scsi_transport_iscsi.c + :export: + +Serial Attached SCSI (SAS) transport class +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The file drivers/scsi/scsi_transport_sas.c defines transport +attributes for Serial Attached SCSI, a variant of SATA aimed at large +high-end systems. + +The SAS transport class contains common code to deal with SAS HBAs, an +aproximated representation of SAS topologies in the driver model, and +various sysfs attributes to expose these topologies and management +interfaces to userspace. + +In addition to the basic SCSI core objects this transport class +introduces two additional intermediate objects: The SAS PHY as +represented by struct sas_phy defines an "outgoing" PHY on a SAS HBA or +Expander, and the SAS remote PHY represented by struct sas_rphy defines +an "incoming" PHY on a SAS Expander or end device. Note that this is +purely a software concept, the underlying hardware for a PHY and a +remote PHY is the exactly the same. + +There is no concept of a SAS port in this code, users can see what PHYs +form a wide port based on the port_identifier attribute, which is the +same for all PHYs in a port. + +.. kernel-doc:: drivers/scsi/scsi_transport_sas.c + :export: + +SATA transport class +~~~~~~~~~~~~~~~~~~~~ + +The SATA transport is handled by libata, which has its own book of +documentation in this directory. + +Parallel SCSI (SPI) transport class +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The file drivers/scsi/scsi_transport_spi.c defines transport +attributes for traditional (fast/wide/ultra) SCSI busses. + +.. kernel-doc:: drivers/scsi/scsi_transport_spi.c + :export: + +SCSI RDMA (SRP) transport class +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The file drivers/scsi/scsi_transport_srp.c defines transport +attributes for SCSI over Remote Direct Memory Access. + +.. kernel-doc:: drivers/scsi/scsi_transport_srp.c + :export: + +SCSI lower layer +================ + +Host Bus Adapter transport types +-------------------------------- + +Many modern device controllers use the SCSI command set as a protocol to +communicate with their devices through many different types of physical +connections. + +In SCSI language a bus capable of carrying SCSI commands is called a +"transport", and a controller connecting to such a bus is called a "host +bus adapter" (HBA). + +Debug transport +~~~~~~~~~~~~~~~ + +The file drivers/scsi/scsi_debug.c simulates a host adapter with a +variable number of disks (or disk like devices) attached, sharing a +common amount of RAM. Does a lot of checking to make sure that we are +not getting blocks mixed up, and panics the kernel if anything out of +the ordinary is seen. + +To be more realistic, the simulated devices have the transport +attributes of SAS disks. + +For documentation see http://sg.danny.cz/sg/sdebug26.html + +todo +~~~~ + +Parallel (fast/wide/ultra) SCSI, USB, SATA, SAS, Fibre Channel, +FireWire, ATAPI devices, Infiniband, I20, iSCSI, Parallel ports, +netlink... diff --git a/Documentation/driver-api/usb/dwc3.rst b/Documentation/driver-api/usb/dwc3.rst new file mode 100644 index 000000000000..c3dc84a50ce5 --- /dev/null +++ b/Documentation/driver-api/usb/dwc3.rst @@ -0,0 +1,712 @@ +=============================================================== +Synopsys DesignWare Core SuperSpeed USB 3.0 Controller +=============================================================== + +:Author: Felipe Balbi <felipe.balbi@linux.intel.com> +:Date: April 2017 + +Introduction +============ + +The *Synopsys DesignWare Core SuperSpeed USB 3.0 Controller* +(hereinafter referred to as *DWC3*) is a USB SuperSpeed compliant +controller which can be configured in one of 4 ways: + + 1. Peripheral-only configuration + 2. Host-only configuration + 3. Dual-Role configuration + 4. Hub configuration + +Linux currently supports several versions of this controller. In all +likelyhood, the version in your SoC is already supported. At the time +of this writing, known tested versions range from 2.02a to 3.10a. As a +rule of thumb, anything above 2.02a should work reliably well. + +Currently, we have many known users for this driver. In alphabetical +order: + + 1. Cavium + 2. Intel Corporation + 3. Qualcomm + 4. Rockchip + 5. ST + 6. Samsung + 7. Texas Instruments + 8. Xilinx + +Summary of Features +====================== + +For details about features supported by your version of DWC3, consult +your IP team and/or *Synopsys DesignWare Core SuperSpeed USB 3.0 +Controller Databook*. Following is a list of features supported by the +driver at the time of this writing: + + 1. Up to 16 bidirectional endpoints (including the control + pipe - ep0) + 2. Flexible endpoint configuration + 3. Simultaneous IN and OUT transfer support + 4. Scatter-list support + 5. Up to 256 TRBs [#trb]_ per endpoint + 6. Support for all transfer types (*Control*, *Bulk*, + *Interrupt*, and *Isochronous*) + 7. SuperSpeed Bulk Streams + 8. Link Power Management + 9. Trace Events for debugging + 10. DebugFS [#debugfs]_ interface + +These features have all been exercised with many of the **in-tree** +gadget drivers. We have verified both *ConfigFS* [#configfs]_ and +legacy gadget drivers. + +Driver Design +============== + +The DWC3 driver sits on the *drivers/usb/dwc3/* directory. All files +related to this driver are in this one directory. This makes it easy +for new-comers to read the code and understand how it behaves. + +Because of DWC3's configuration flexibility, the driver is a little +complex in some places but it should be rather straightforward to +understand. + +The biggest part of the driver refers to the Gadget API. + +Known Limitations +=================== + +Like any other HW, DWC3 has its own set of limitations. To avoid +constant questions about such problems, we decided to document them +here and have a single location to where we could point users. + +OUT Transfer Size Requirements +--------------------------------- + +According to Synopsys Databook, all OUT transfer TRBs [#trb]_ must +have their *size* field set to a value which is integer divisible by +the endpoint's *wMaxPacketSize*. This means that *e.g.* in order to +receive a Mass Storage *CBW* [#cbw]_, req->length must either be set +to a value that's divisible by *wMaxPacketSize* (1024 on SuperSpeed, +512 on HighSpeed, etc), or DWC3 driver must add a Chained TRB pointing +to a throw-away buffer for the remaining length. Without this, OUT +transfers will **NOT** start. + +Note that as of this writing, this won't be a problem because DWC3 is +fully capable of appending a chained TRB for the remaining length and +completely hide this detail from the gadget driver. It's still worth +mentioning because this seems to be the largest source of queries +about DWC3 and *non-working transfers*. + +TRB Ring Size Limitation +------------------------- + +We, currently, have a hard limit of 256 TRBs [#trb]_ per endpoint, +with the last TRB being a Link TRB [#link_trb]_ pointing back to the +first. This limit is arbitrary but it has the benefit of adding up to +exactly 4096 bytes, or 1 Page. + +DWC3 driver will try its best to cope with more than 255 requests and, +for the most part, it should work normally. However this is not +something that has been exercised very frequently. If you experience +any problems, see section **Reporting Bugs** below. + +Reporting Bugs +================ + +Whenever you encounter a problem with DWC3, first and foremost you +should make sure that: + + 1. You're running latest tag from `Linus' tree`_ + 2. You can reproduce the error without any out-of-tree changes + to DWC3 + 3. You have checked that it's not a fault on the host machine + +After all these are verified, then here's how to capture enough +information so we can be of any help to you. + +Required Information +--------------------- + +DWC3 relies exclusively on Trace Events for debugging. Everything is +exposed there, with some extra bits being exposed to DebugFS +[#debugfs]_. + +In order to capture DWC3's Trace Events you should run the following +commands **before** plugging the USB cable to a host machine: + +.. code-block:: sh + + # mkdir -p /d + # mkdir -p /t + # mount -t debugfs none /d + # mount -t tracefs none /t + # echo 81920 > /t/buffer_size_kb + # echo 1 > /t/events/dwc3/enable + +After this is done, you can connect your USB cable and reproduce the +problem. As soon as the fault is reproduced, make a copy of files +``trace`` and ``regdump``, like so: + +.. code-block:: sh + + # cp /t/trace /root/trace.txt + # cat /d/*dwc3*/regdump > /root/regdump.txt + +Make sure to compress ``trace.txt`` and ``regdump.txt`` in a tarball +and email it to `me`_ with `linux-usb`_ in Cc. If you want to be extra +sure that I'll help you, write your subject line in the following +format: + + **[BUG REPORT] usb: dwc3: Bug while doing XYZ** + +On the email body, make sure to detail what you doing, which gadget +driver you were using, how to reproduce the problem, what SoC you're +using, which OS (and its version) was running on the Host machine. + +With all this information, we should be able to understand what's +going on and be helpful to you. + +Debugging +=========== + +First and foremost a disclaimer:: + + DISCLAIMER: The information available on DebugFS and/or TraceFS can + change at any time at any Major Linux Kernel Release. If writing + scripts, do **NOT** assume information to be available in the + current format. + +With that out of the way, let's carry on. + +If you're willing to debug your own problem, you deserve a round of +applause :-) + +Anyway, there isn't much to say here other than Trace Events will be +really helpful in figuring out issues with DWC3. Also, access to +Synopsys Databook will be **really** valuable in this case. + +A USB Sniffer can be helpful at times but it's not entirely required, +there's a lot that can be understood without looking at the wire. + +Feel free to email `me`_ and Cc `linux-usb`_ if you need any help. + +``DebugFS`` +------------- + +``DebugFS`` is very good for gathering snapshots of what's going on +with DWC3 and/or any endpoint. + +On DWC3's ``DebugFS`` directory, you will find the following files and +directories: + +``ep[0..15]{in,out}/`` +``link_state`` +``regdump`` +``testmode`` + +``link_state`` +`````````````` + +When read, ``link_state`` will print out one of ``U0``, ``U1``, +``U2``, ``U3``, ``SS.Disabled``, ``RX.Detect``, ``SS.Inactive``, +``Polling``, ``Recovery``, ``Hot Reset``, ``Compliance``, +``Loopback``, ``Reset``, ``Resume`` or ``UNKNOWN link state``. + +This file can also be written to in order to force link to one of the +states above. + +``regdump`` +````````````` + +File name is self-explanatory. When read, ``regdump`` will print out a +register dump of DWC3. Note that this file can be grepped to find the +information you want. + +``testmode`` +`````````````` + +When read, ``testmode`` will print out a name of one of the specified +USB 2.0 Testmodes (``test_j``, ``test_k``, ``test_se0_nak``, +``test_packet``, ``test_force_enable``) or the string ``no test`` in +case no tests are currently being executed. + +In order to start any of these test modes, the same strings can be +written to the file and DWC3 will enter the requested test mode. + + +``ep[0..15]{in,out}`` +`````````````````````` + +For each endpoint we expose one directory following the naming +convention ``ep$num$dir`` *(ep0in, ep0out, ep1in, ...)*. Inside each +of these directories you will find the following files: + +``descriptor_fetch_queue`` +``event_queue`` +``rx_fifo_queue`` +``rx_info_queue`` +``rx_request_queue`` +``transfer_type`` +``trb_ring`` +``tx_fifo_queue`` +``tx_request_queue`` + +With access to Synopsys Databook, you can decode the information on +them. + +``transfer_type`` +~~~~~~~~~~~~~~~~~~ + +When read, ``transfer_type`` will print out one of ``control``, +``bulk``, ``interrupt`` or ``isochronous`` depending on what the +endpoint descriptor says. If the endpoint hasn't been enabled yet, it +will print ``--``. + +``trb_ring`` +~~~~~~~~~~~~~ + +When read, ``trb_ring`` will print out details about all TRBs on the +ring. It will also tell you where our enqueue and dequeue pointers are +located in the ring: + +.. code-block:: sh + + buffer_addr,size,type,ioc,isp_imi,csp,chn,lst,hwo + 000000002c754000,481,normal,1,0,1,0,0,0 + 000000002c75c000,481,normal,1,0,1,0,0,0 + 000000002c780000,481,normal,1,0,1,0,0,0 + 000000002c788000,481,normal,1,0,1,0,0,0 + 000000002c78c000,481,normal,1,0,1,0,0,0 + 000000002c754000,481,normal,1,0,1,0,0,0 + 000000002c75c000,481,normal,1,0,1,0,0,0 + 000000002c784000,481,normal,1,0,1,0,0,0 + 000000002c788000,481,normal,1,0,1,0,0,0 + 000000002c78c000,481,normal,1,0,1,0,0,0 + 000000002c790000,481,normal,1,0,1,0,0,0 + 000000002c758000,481,normal,1,0,1,0,0,0 + 000000002c780000,481,normal,1,0,1,0,0,0 + 000000002c788000,481,normal,1,0,1,0,0,0 + 000000002c790000,481,normal,1,0,1,0,0,0 + 000000002c758000,481,normal,1,0,1,0,0,0 + 000000002c780000,481,normal,1,0,1,0,0,0 + 000000002c784000,481,normal,1,0,1,0,0,0 + 000000002c788000,481,normal,1,0,1,0,0,0 + 000000002c78c000,481,normal,1,0,1,0,0,0 + 000000002c754000,481,normal,1,0,1,0,0,0 + 000000002c758000,481,normal,1,0,1,0,0,0 + 000000002c780000,481,normal,1,0,1,0,0,0 + 000000002c784000,481,normal,1,0,1,0,0,0 + 000000002c78c000,481,normal,1,0,1,0,0,0 + 000000002c790000,481,normal,1,0,1,0,0,0 + 000000002c758000,481,normal,1,0,1,0,0,0 + 000000002c780000,481,normal,1,0,1,0,0,0 + 000000002c788000,481,normal,1,0,1,0,0,0 + 000000002c790000,481,normal,1,0,1,0,0,0 + 000000002c758000,481,normal,1,0,1,0,0,0 + 000000002c780000,481,normal,1,0,1,0,0,0 + 000000002c788000,481,normal,1,0,1,0,0,0 + 000000002c790000,481,normal,1,0,1,0,0,0 + 000000002c758000,481,normal,1,0,1,0,0,0 + 000000002c780000,481,normal,1,0,1,0,0,0 + 000000002c788000,481,normal,1,0,1,0,0,0 + 000000002c790000,481,normal,1,0,1,0,0,0 + 000000002c758000,481,normal,1,0,1,0,0,0 + 000000002c780000,481,normal,1,0,1,0,0,0 + 000000002c788000,481,normal,1,0,1,0,0,0 + 000000002c790000,481,normal,1,0,1,0,0,0 + 000000002c758000,481,normal,1,0,1,0,0,0 + 000000002c780000,481,normal,1,0,1,0,0,0 + 000000002c788000,481,normal,1,0,1,0,0,0 + 000000002c790000,481,normal,1,0,1,0,0,0 + 000000002c758000,481,normal,1,0,1,0,0,0 + 000000002c780000,481,normal,1,0,1,0,0,0 + 000000002c788000,481,normal,1,0,1,0,0,0 + 000000002c790000,481,normal,1,0,1,0,0,0 + 000000002c758000,481,normal,1,0,1,0,0,0 + 000000002c780000,481,normal,1,0,1,0,0,0 + 000000002c788000,481,normal,1,0,1,0,0,0 + 000000002c790000,481,normal,1,0,1,0,0,0 + 000000002c758000,481,normal,1,0,1,0,0,0 + 000000002c780000,481,normal,1,0,1,0,0,0 + 000000002c78c000,481,normal,1,0,1,0,0,0 + 000000002c784000,481,normal,1,0,1,0,0,0 + 000000002c788000,481,normal,1,0,1,0,0,0 + 000000002c78c000,481,normal,1,0,1,0,0,0 + 000000002c754000,481,normal,1,0,1,0,0,0 + 000000002c758000,481,normal,1,0,1,0,0,0 + 000000002c780000,481,normal,1,0,1,0,0,0 + 000000002c788000,481,normal,1,0,1,0,0,0 + 000000002c790000,481,normal,1,0,1,0,0,0 + 000000002c758000,481,normal,1,0,1,0,0,0 + 000000002c780000,481,normal,1,0,1,0,0,0 + 000000002c758000,481,normal,1,0,1,0,0,0 + 000000002c780000,481,normal,1,0,1,0,0,0 + 000000002c78c000,481,normal,1,0,1,0,0,0 + 000000002c75c000,481,normal,1,0,1,0,0,0 + 000000002c78c000,481,normal,1,0,1,0,0,0 + 000000002c780000,481,normal,1,0,1,0,0,0 + 000000002c754000,481,normal,1,0,1,0,0,0 + 000000002c788000,481,normal,1,0,1,0,0,0 + 000000002c754000,481,normal,1,0,1,0,0,0 + 000000002c780000,481,normal,1,0,1,0,0,0 + 000000002c788000,481,normal,1,0,1,0,0,0 + 000000002c78c000,481,normal,1,0,1,0,0,0 + 000000002c790000,481,normal,1,0,1,0,0,0 + 000000002c754000,481,normal,1,0,1,0,0,0 + 000000002c758000,481,normal,1,0,1,0,0,0 + 000000002c75c000,481,normal,1,0,1,0,0,0 + 000000002c780000,481,normal,1,0,1,0,0,0 + 000000002c784000,481,normal,1,0,1,0,0,0 + 000000002c788000,481,normal,1,0,1,0,0,0 + 000000002c78c000,481,normal,1,0,1,0,0,0 + 000000002c790000,481,normal,1,0,1,0,0,0 + 000000002c754000,481,normal,1,0,1,0,0,0 + 000000002c758000,481,normal,1,0,1,0,0,0 + 000000002c75c000,512,normal,1,0,1,0,0,1 D + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 E + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 0000000000000000,0,UNKNOWN,0,0,0,0,0,0 + 00000000381ab000,0,link,0,0,0,0,0,1 + + +Trace Events +------------- + +DWC3 also provides several trace events which help us gathering +information about the behavior of the driver during runtime. + +In order to use these events, you must enable ``CONFIG_FTRACE`` in +your kernel config. + +For details about how enable DWC3 events, see section **Reporting +Bugs**. + +The following subsections will give details about each Event Class and +each Event defined by DWC3. + +MMIO +``````` + +It is sometimes useful to look at every MMIO access when looking for +bugs. Because of that, DWC3 offers two Trace Events (one for +dwc3_readl() and one for dwc3_writel()). ``TP_printk`` follows:: + + TP_printk("addr %p value %08x", __entry->base + __entry->offset, + __entry->value) + +Interrupt Events +```````````````` + +Every IRQ event can be logged and decoded into a human readable +string. Because every event will be different, we don't give an +example other than the ``TP_printk`` format used:: + + TP_printk("event (%08x): %s", __entry->event, + dwc3_decode_event(__entry->event, __entry->ep0state)) + +Control Request +````````````````` + +Every USB Control Request can be logged to the trace buffer. The +output format is:: + + TP_printk("%s", dwc3_decode_ctrl(__entry->bRequestType, + __entry->bRequest, __entry->wValue, + __entry->wIndex, __entry->wLength) + ) + +Note that Standard Control Requests will be decoded into +human-readable strings with their respective arguments. Class and +Vendor requests will be printed out a sequence of 8 bytes in hex +format. + +Lifetime of a ``struct usb_request`` +``````````````````````````````````````` + +The entire lifetime of a ``struct usb_request`` can be tracked on the +trace buffer. We have one event for each of allocation, free, +queueing, dequeueing, and giveback. Output format is:: + + TP_printk("%s: req %p length %u/%u %s%s%s ==> %d", + __get_str(name), __entry->req, __entry->actual, __entry->length, + __entry->zero ? "Z" : "z", + __entry->short_not_ok ? "S" : "s", + __entry->no_interrupt ? "i" : "I", + __entry->status + ) + +Generic Commands +```````````````````` + +We can log and decode every Generic Command with its completion +code. Format is:: + + TP_printk("cmd '%s' [%x] param %08x --> status: %s", + dwc3_gadget_generic_cmd_string(__entry->cmd), + __entry->cmd, __entry->param, + dwc3_gadget_generic_cmd_status_string(__entry->status) + ) + +Endpoint Commands +```````````````````` + +Endpoints commands can also be logged together with completion +code. Format is:: + + TP_printk("%s: cmd '%s' [%d] params %08x %08x %08x --> status: %s", + __get_str(name), dwc3_gadget_ep_cmd_string(__entry->cmd), + __entry->cmd, __entry->param0, + __entry->param1, __entry->param2, + dwc3_ep_cmd_status_string(__entry->cmd_status) + ) + +Lifetime of a ``TRB`` +`````````````````````` + +A ``TRB`` Lifetime is simple. We are either preparing a ``TRB`` or +completing it. With these two events, we can see how a ``TRB`` changes +over time. Format is:: + + TP_printk("%s: %d/%d trb %p buf %08x%08x size %s%d ctrl %08x (%c%c%c%c:%c%c:%s)", + __get_str(name), __entry->queued, __entry->allocated, + __entry->trb, __entry->bph, __entry->bpl, + ({char *s; + int pcm = ((__entry->size >> 24) & 3) + 1; + switch (__entry->type) { + case USB_ENDPOINT_XFER_INT: + case USB_ENDPOINT_XFER_ISOC: + switch (pcm) { + case 1: + s = "1x "; + break; + case 2: + s = "2x "; + break; + case 3: + s = "3x "; + break; + } + default: + s = ""; + } s; }), + DWC3_TRB_SIZE_LENGTH(__entry->size), __entry->ctrl, + __entry->ctrl & DWC3_TRB_CTRL_HWO ? 'H' : 'h', + __entry->ctrl & DWC3_TRB_CTRL_LST ? 'L' : 'l', + __entry->ctrl & DWC3_TRB_CTRL_CHN ? 'C' : 'c', + __entry->ctrl & DWC3_TRB_CTRL_CSP ? 'S' : 's', + __entry->ctrl & DWC3_TRB_CTRL_ISP_IMI ? 'S' : 's', + __entry->ctrl & DWC3_TRB_CTRL_IOC ? 'C' : 'c', + dwc3_trb_type_string(DWC3_TRBCTL_TYPE(__entry->ctrl)) + ) + +Lifetime of an Endpoint +``````````````````````` + +And endpoint's lifetime is summarized with enable and disable +operations, both of which can be traced. Format is:: + + TP_printk("%s: mps %d/%d streams %d burst %d ring %d/%d flags %c:%c%c%c%c%c:%c:%c", + __get_str(name), __entry->maxpacket, + __entry->maxpacket_limit, __entry->max_streams, + __entry->maxburst, __entry->trb_enqueue, + __entry->trb_dequeue, + __entry->flags & DWC3_EP_ENABLED ? 'E' : 'e', + __entry->flags & DWC3_EP_STALL ? 'S' : 's', + __entry->flags & DWC3_EP_WEDGE ? 'W' : 'w', + __entry->flags & DWC3_EP_BUSY ? 'B' : 'b', + __entry->flags & DWC3_EP_PENDING_REQUEST ? 'P' : 'p', + __entry->flags & DWC3_EP_MISSED_ISOC ? 'M' : 'm', + __entry->flags & DWC3_EP_END_TRANSFER_PENDING ? 'E' : 'e', + __entry->direction ? '<' : '>' + ) + + +Structures, Methods and Definitions +==================================== + +.. kernel-doc:: drivers/usb/dwc3/core.h + :doc: main data structures + :internal: + +.. kernel-doc:: drivers/usb/dwc3/gadget.h + :doc: gadget-only helpers + :internal: + +.. kernel-doc:: drivers/usb/dwc3/gadget.c + :doc: gadget-side implementation + :internal: + +.. kernel-doc:: drivers/usb/dwc3/core.c + :doc: core driver (probe, PM, etc) + :internal: + +.. [#trb] Transfer Request Block +.. [#link_trb] Transfer Request Block pointing to another Transfer + Request Block. +.. [#debugfs] The Debug File System +.. [#configfs] The Config File System +.. [#cbw] Command Block Wrapper +.. _Linus' tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/ +.. _me: felipe.balbi@linux.intel.com +.. _linux-usb: linux-usb@vger.kernel.org diff --git a/Documentation/driver-api/usb/index.rst b/Documentation/driver-api/usb/index.rst index 1bf64edc8c8a..8fe995a1ec94 100644 --- a/Documentation/driver-api/usb/index.rst +++ b/Documentation/driver-api/usb/index.rst @@ -16,7 +16,10 @@ Linux USB API persist error-codes writing_usb_driver + dwc3 writing_musb_glue_layer + typec + usb3-debug-port .. only:: subproject and html diff --git a/Documentation/usb/typec.rst b/Documentation/driver-api/usb/typec.rst index 8a7249f2ff04..8a7249f2ff04 100644 --- a/Documentation/usb/typec.rst +++ b/Documentation/driver-api/usb/typec.rst diff --git a/Documentation/usb/usb3-debug-port.rst b/Documentation/driver-api/usb/usb3-debug-port.rst index feb1a36a65b7..feb1a36a65b7 100644 --- a/Documentation/usb/usb3-debug-port.rst +++ b/Documentation/driver-api/usb/usb3-debug-port.rst diff --git a/Documentation/driver-api/w1.rst b/Documentation/driver-api/w1.rst new file mode 100644 index 000000000000..9963cca788a1 --- /dev/null +++ b/Documentation/driver-api/w1.rst @@ -0,0 +1,70 @@ +====================== +W1: Dallas' 1-wire bus +====================== + +:Author: David Fries + +W1 API internal to the kernel +============================= + +W1 API internal to the kernel +----------------------------- + +include/linux/w1.h +~~~~~~~~~~~~~~~~~~ + +W1 kernel API functions. + +.. kernel-doc:: include/linux/w1.h + :internal: + +drivers/w1/w1.c +~~~~~~~~~~~~~~~ + +W1 core functions. + +.. kernel-doc:: drivers/w1/w1.c + :internal: + +drivers/w1/w1_family.c +~~~~~~~~~~~~~~~~~~~~~~~ + +Allows registering device family operations. + +.. kernel-doc:: drivers/w1/w1_family.c + :export: + +drivers/w1/w1_internal.h +~~~~~~~~~~~~~~~~~~~~~~~~ + +W1 internal initialization for master devices. + +.. kernel-doc:: drivers/w1/w1_internal.h + :internal: + +drivers/w1/w1_int.c +~~~~~~~~~~~~~~~~~~~~ + +W1 internal initialization for master devices. + +.. kernel-doc:: drivers/w1/w1_int.c + :export: + +drivers/w1/w1_netlink.h +~~~~~~~~~~~~~~~~~~~~~~~~ + +W1 external netlink API structures and commands. + +.. kernel-doc:: drivers/w1/w1_netlink.h + :internal: + +drivers/w1/w1_io.c +~~~~~~~~~~~~~~~~~~~ + +W1 input/output. + +.. kernel-doc:: drivers/w1/w1_io.c + :export: + +.. kernel-doc:: drivers/w1/w1_io.c + :internal: diff --git a/Documentation/driver-model/devres.txt b/Documentation/driver-model/devres.txt index e72587fe477d..30e04f7a690d 100644 --- a/Documentation/driver-model/devres.txt +++ b/Documentation/driver-model/devres.txt @@ -240,10 +240,9 @@ CLOCK DMA dmam_alloc_coherent() - dmam_alloc_noncoherent() + dmam_alloc_attrs() dmam_declare_coherent_memory() dmam_free_coherent() - dmam_free_noncoherent() dmam_pool_create() dmam_pool_destroy() @@ -311,6 +310,8 @@ IRQ devm_irq_alloc_desc_at() devm_irq_alloc_desc_from() devm_irq_alloc_descs_from() + devm_irq_alloc_generic_chip() + devm_irq_setup_generic_chip() LED devm_led_classdev_register() @@ -335,13 +336,19 @@ MEM devm_kzalloc() MFD - devm_mfd_add_devices() + devm_mfd_add_devices() + +MUX + devm_mux_chip_alloc() + devm_mux_chip_register() + devm_mux_control_get() PER-CPU MEM devm_alloc_percpu() devm_free_percpu() PCI + devm_pci_alloc_host_bridge() : managed PCI host bridge allocation devm_pci_remap_cfgspace() : ioremap PCI configuration space devm_pci_remap_cfg_resource() : ioremap PCI configuration space resource pcim_enable_device() : after success, all PCI ops become managed diff --git a/Documentation/efi-stub.txt b/Documentation/efi-stub.txt index e15746988261..41df801f9a50 100644 --- a/Documentation/efi-stub.txt +++ b/Documentation/efi-stub.txt @@ -1,5 +1,6 @@ - The EFI Boot Stub - --------------------------- +================= +The EFI Boot Stub +================= On the x86 and ARM platforms, a kernel zImage/bzImage can masquerade as a PE/COFF image, thereby convincing EFI firmware loaders to load @@ -25,7 +26,8 @@ a certain sense it *IS* the boot loader. The EFI boot stub is enabled with the CONFIG_EFI_STUB kernel option. -**** How to install bzImage.efi +How to install bzImage.efi +-------------------------- The bzImage located in arch/x86/boot/bzImage must be copied to the EFI System Partition (ESP) and renamed with the extension ".efi". Without @@ -37,14 +39,16 @@ may not need to be renamed. Similarly for arm64, arch/arm64/boot/Image should be copied but not necessarily renamed. -**** Passing kernel parameters from the EFI shell +Passing kernel parameters from the EFI shell +-------------------------------------------- -Arguments to the kernel can be passed after bzImage.efi, e.g. +Arguments to the kernel can be passed after bzImage.efi, e.g.:: fs0:> bzImage.efi console=ttyS0 root=/dev/sda4 -**** The "initrd=" option +The "initrd=" option +-------------------- Like most boot loaders, the EFI stub allows the user to specify multiple initrd files using the "initrd=" option. This is the only EFI @@ -54,9 +58,9 @@ kernel when it boots. The path to the initrd file must be an absolute path from the beginning of the ESP, relative path names do not work. Also, the path is an EFI-style path and directory elements must be separated with -backslashes (\). For example, given the following directory layout, +backslashes (\). For example, given the following directory layout:: -fs0:> + fs0:> Kernels\ bzImage.efi initrd-large.img @@ -66,7 +70,7 @@ fs0:> initrd-medium.img to boot with the initrd-large.img file if the current working -directory is fs0:\Kernels, the following command must be used, +directory is fs0:\Kernels, the following command must be used:: fs0:\Kernels> bzImage.efi initrd=\Kernels\initrd-large.img @@ -76,7 +80,8 @@ which understands relative paths, whereas the rest of the command line is passed to bzImage.efi. -**** The "dtb=" option +The "dtb=" option +----------------- For the ARM and arm64 architectures, we also need to be able to provide a device tree to the kernel. This is done with the "dtb=" command line option, diff --git a/Documentation/eisa.txt b/Documentation/eisa.txt index a55e4910924e..2806e5544e43 100644 --- a/Documentation/eisa.txt +++ b/Documentation/eisa.txt @@ -1,4 +1,8 @@ -EISA bus support (Marc Zyngier <maz@wild-wind.fr.eu.org>) +================ +EISA bus support +================ + +:Author: Marc Zyngier <maz@wild-wind.fr.eu.org> This document groups random notes about porting EISA drivers to the new EISA/sysfs API. @@ -14,168 +18,189 @@ detection code is generally also used to probe ISA cards). Moreover, most EISA drivers are among the oldest Linux drivers so, as you can imagine, some dust has settled here over the years. -The EISA infrastructure is made up of three parts : +The EISA infrastructure is made up of three parts: - The bus code implements most of the generic code. It is shared - among all the architectures that the EISA code runs on. It - implements bus probing (detecting EISA cards available on the bus), - allocates I/O resources, allows fancy naming through sysfs, and - offers interfaces for driver to register. + among all the architectures that the EISA code runs on. It + implements bus probing (detecting EISA cards available on the bus), + allocates I/O resources, allows fancy naming through sysfs, and + offers interfaces for driver to register. - The bus root driver implements the glue between the bus hardware - and the generic bus code. It is responsible for discovering the - device implementing the bus, and setting it up to be latter probed - by the bus code. This can go from something as simple as reserving - an I/O region on x86, to the rather more complex, like the hppa - EISA code. This is the part to implement in order to have EISA - running on an "new" platform. + and the generic bus code. It is responsible for discovering the + device implementing the bus, and setting it up to be latter probed + by the bus code. This can go from something as simple as reserving + an I/O region on x86, to the rather more complex, like the hppa + EISA code. This is the part to implement in order to have EISA + running on an "new" platform. - The driver offers the bus a list of devices that it manages, and - implements the necessary callbacks to probe and release devices - whenever told to. + implements the necessary callbacks to probe and release devices + whenever told to. Every function/structure below lives in <linux/eisa.h>, which depends heavily on <linux/device.h>. -** Bus root driver : +Bus root driver +=============== + +:: -int eisa_root_register (struct eisa_root_device *root); + int eisa_root_register (struct eisa_root_device *root); The eisa_root_register function is used to declare a device as the root of an EISA bus. The eisa_root_device structure holds a reference -to this device, as well as some parameters for probing purposes. - -struct eisa_root_device { - struct device *dev; /* Pointer to bridge device */ - struct resource *res; - unsigned long bus_base_addr; - int slots; /* Max slot number */ - int force_probe; /* Probe even when no slot 0 */ - u64 dma_mask; /* from bridge device */ - int bus_nr; /* Set by eisa_root_register */ - struct resource eisa_root_res; /* ditto */ -}; - -node : used for eisa_root_register internal purpose -dev : pointer to the root device -res : root device I/O resource -bus_base_addr : slot 0 address on this bus -slots : max slot number to probe -force_probe : Probe even when slot 0 is empty (no EISA mainboard) -dma_mask : Default DMA mask. Usually the bridge device dma_mask. -bus_nr : unique bus id, set by eisa_root_register - -** Driver : - -int eisa_driver_register (struct eisa_driver *edrv); -void eisa_driver_unregister (struct eisa_driver *edrv); +to this device, as well as some parameters for probing purposes:: + + struct eisa_root_device { + struct device *dev; /* Pointer to bridge device */ + struct resource *res; + unsigned long bus_base_addr; + int slots; /* Max slot number */ + int force_probe; /* Probe even when no slot 0 */ + u64 dma_mask; /* from bridge device */ + int bus_nr; /* Set by eisa_root_register */ + struct resource eisa_root_res; /* ditto */ + }; + +============= ====================================================== +node used for eisa_root_register internal purpose +dev pointer to the root device +res root device I/O resource +bus_base_addr slot 0 address on this bus +slots max slot number to probe +force_probe Probe even when slot 0 is empty (no EISA mainboard) +dma_mask Default DMA mask. Usually the bridge device dma_mask. +bus_nr unique bus id, set by eisa_root_register +============= ====================================================== + +Driver +====== + +:: + + int eisa_driver_register (struct eisa_driver *edrv); + void eisa_driver_unregister (struct eisa_driver *edrv); Clear enough ? -struct eisa_device_id { - char sig[EISA_SIG_LEN]; - unsigned long driver_data; -}; - -struct eisa_driver { - const struct eisa_device_id *id_table; - struct device_driver driver; -}; - -id_table : an array of NULL terminated EISA id strings, - followed by an empty string. Each string can - optionally be paired with a driver-dependent value - (driver_data). - -driver : a generic driver, such as described in - Documentation/driver-model/driver.txt. Only .name, - .probe and .remove members are mandatory. - -An example is the 3c59x driver : - -static struct eisa_device_id vortex_eisa_ids[] = { - { "TCM5920", EISA_3C592_OFFSET }, - { "TCM5970", EISA_3C597_OFFSET }, - { "" } -}; - -static struct eisa_driver vortex_eisa_driver = { - .id_table = vortex_eisa_ids, - .driver = { - .name = "3c59x", - .probe = vortex_eisa_probe, - .remove = vortex_eisa_remove - } -}; - -** Device : +:: + + struct eisa_device_id { + char sig[EISA_SIG_LEN]; + unsigned long driver_data; + }; + + struct eisa_driver { + const struct eisa_device_id *id_table; + struct device_driver driver; + }; + +=============== ==================================================== +id_table an array of NULL terminated EISA id strings, + followed by an empty string. Each string can + optionally be paired with a driver-dependent value + (driver_data). + +driver a generic driver, such as described in + Documentation/driver-model/driver.txt. Only .name, + .probe and .remove members are mandatory. +=============== ==================================================== + +An example is the 3c59x driver:: + + static struct eisa_device_id vortex_eisa_ids[] = { + { "TCM5920", EISA_3C592_OFFSET }, + { "TCM5970", EISA_3C597_OFFSET }, + { "" } + }; + + static struct eisa_driver vortex_eisa_driver = { + .id_table = vortex_eisa_ids, + .driver = { + .name = "3c59x", + .probe = vortex_eisa_probe, + .remove = vortex_eisa_remove + } + }; + +Device +====== The sysfs framework calls .probe and .remove functions upon device discovery and removal (note that the .remove function is only called when driver is built as a module). Both functions are passed a pointer to a 'struct device', which is -encapsulated in a 'struct eisa_device' described as follows : - -struct eisa_device { - struct eisa_device_id id; - int slot; - int state; - unsigned long base_addr; - struct resource res[EISA_MAX_RESOURCES]; - u64 dma_mask; - struct device dev; /* generic device */ -}; - -id : EISA id, as read from device. id.driver_data is set from the - matching driver EISA id. -slot : slot number which the device was detected on -state : set of flags indicating the state of the device. Current - flags are EISA_CONFIG_ENABLED and EISA_CONFIG_FORCED. -res : set of four 256 bytes I/O regions allocated to this device -dma_mask: DMA mask set from the parent device. -dev : generic device (see Documentation/driver-model/device.txt) +encapsulated in a 'struct eisa_device' described as follows:: + + struct eisa_device { + struct eisa_device_id id; + int slot; + int state; + unsigned long base_addr; + struct resource res[EISA_MAX_RESOURCES]; + u64 dma_mask; + struct device dev; /* generic device */ + }; + +======== ============================================================ +id EISA id, as read from device. id.driver_data is set from the + matching driver EISA id. +slot slot number which the device was detected on +state set of flags indicating the state of the device. Current + flags are EISA_CONFIG_ENABLED and EISA_CONFIG_FORCED. +res set of four 256 bytes I/O regions allocated to this device +dma_mask DMA mask set from the parent device. +dev generic device (see Documentation/driver-model/device.txt) +======== ============================================================ You can get the 'struct eisa_device' from 'struct device' using the 'to_eisa_device' macro. -** Misc stuff : +Misc stuff +========== + +:: -void eisa_set_drvdata (struct eisa_device *edev, void *data); + void eisa_set_drvdata (struct eisa_device *edev, void *data); Stores data into the device's driver_data area. -void *eisa_get_drvdata (struct eisa_device *edev): +:: + + void *eisa_get_drvdata (struct eisa_device *edev): Gets the pointer previously stored into the device's driver_data area. -int eisa_get_region_index (void *addr); +:: + + int eisa_get_region_index (void *addr); Returns the region number (0 <= x < EISA_MAX_RESOURCES) of a given address. -** Kernel parameters : +Kernel parameters +================= -eisa_bus.enable_dev : +eisa_bus.enable_dev + A comma-separated list of slots to be enabled, even if the firmware + set the card as disabled. The driver must be able to properly + initialize the device in such conditions. -A comma-separated list of slots to be enabled, even if the firmware -set the card as disabled. The driver must be able to properly -initialize the device in such conditions. +eisa_bus.disable_dev + A comma-separated list of slots to be enabled, even if the firmware + set the card as enabled. The driver won't be called to handle this + device. -eisa_bus.disable_dev : +virtual_root.force_probe + Force the probing code to probe EISA slots even when it cannot find an + EISA compliant mainboard (nothing appears on slot 0). Defaults to 0 + (don't force), and set to 1 (force probing) when either + CONFIG_ALPHA_JENSEN or CONFIG_EISA_VLB_PRIMING are set. -A comma-separated list of slots to be enabled, even if the firmware -set the card as enabled. The driver won't be called to handle this -device. - -virtual_root.force_probe : - -Force the probing code to probe EISA slots even when it cannot find an -EISA compliant mainboard (nothing appears on slot 0). Defaults to 0 -(don't force), and set to 1 (force probing) when either -CONFIG_ALPHA_JENSEN or CONFIG_EISA_VLB_PRIMING are set. - -** Random notes : +Random notes +============ Converting an EISA driver to the new API mostly involves *deleting* code (since probing is now in the core EISA code). Unfortunately, most @@ -194,9 +219,11 @@ routine. For example, switching your favorite EISA SCSI card to the "hotplug" model is "the right thing"(tm). -** Thanks : +Thanks +====== + +I'd like to thank the following people for their help: -I'd like to thank the following people for their help : - Xavier Benigni for lending me a wonderful Alpha Jensen, - James Bottomley, Jeff Garzik for getting this stuff into the kernel, - Andries Brouwer for contributing numerous EISA ids, diff --git a/Documentation/fault-injection/fault-injection.txt b/Documentation/fault-injection/fault-injection.txt index 415484f3d59a..918972babcd8 100644 --- a/Documentation/fault-injection/fault-injection.txt +++ b/Documentation/fault-injection/fault-injection.txt @@ -134,6 +134,23 @@ use the boot option: fail_futex= mmc_core.fail_request=<interval>,<probability>,<space>,<times> +o proc entries + +- /proc/<pid>/fail-nth: +- /proc/self/task/<tid>/fail-nth: + + Write to this file of integer N makes N-th call in the task fail. + Read from this file returns a integer value. A value of '0' indicates + that the fault setup with a previous write to this file was injected. + A positive integer N indicates that the fault wasn't yet injected. + Note that this file enables all types of faults (slab, futex, etc). + This setting takes precedence over all other generic debugfs settings + like probability, interval, times, etc. But per-capability settings + (e.g. fail_futex/ignore-private) take precedence over it. + + This feature is intended for systematic testing of faults in a single + system call. See an example below. + How to add new fault injection capability ----------------------------------------- @@ -278,3 +295,65 @@ allocation failure. # env FAILCMD_TYPE=fail_page_alloc \ ./tools/testing/fault-injection/failcmd.sh --times=100 \ -- make -C tools/testing/selftests/ run_tests + +Systematic faults using fail-nth +--------------------------------- + +The following code systematically faults 0-th, 1-st, 2-nd and so on +capabilities in the socketpair() system call. + +#include <sys/types.h> +#include <sys/stat.h> +#include <sys/socket.h> +#include <sys/syscall.h> +#include <fcntl.h> +#include <unistd.h> +#include <string.h> +#include <stdlib.h> +#include <stdio.h> +#include <errno.h> + +int main() +{ + int i, err, res, fail_nth, fds[2]; + char buf[128]; + + system("echo N > /sys/kernel/debug/failslab/ignore-gfp-wait"); + sprintf(buf, "/proc/self/task/%ld/fail-nth", syscall(SYS_gettid)); + fail_nth = open(buf, O_RDWR); + for (i = 1;; i++) { + sprintf(buf, "%d", i); + write(fail_nth, buf, strlen(buf)); + res = socketpair(AF_LOCAL, SOCK_STREAM, 0, fds); + err = errno; + pread(fail_nth, buf, sizeof(buf), 0); + if (res == 0) { + close(fds[0]); + close(fds[1]); + } + printf("%d-th fault %c: res=%d/%d\n", i, atoi(buf) ? 'N' : 'Y', + res, err); + if (atoi(buf)) + break; + } + return 0; +} + +An example output: + +1-th fault Y: res=-1/23 +2-th fault Y: res=-1/23 +3-th fault Y: res=-1/12 +4-th fault Y: res=-1/12 +5-th fault Y: res=-1/23 +6-th fault Y: res=-1/23 +7-th fault Y: res=-1/23 +8-th fault Y: res=-1/12 +9-th fault Y: res=-1/12 +10-th fault Y: res=-1/12 +11-th fault Y: res=-1/12 +12-th fault Y: res=-1/12 +13-th fault Y: res=-1/12 +14-th fault Y: res=-1/12 +15-th fault Y: res=-1/12 +16-th fault N: res=0/12 diff --git a/Documentation/fb/api.txt b/Documentation/fb/api.txt index d4ff7de85700..d52cf1e3b975 100644 --- a/Documentation/fb/api.txt +++ b/Documentation/fb/api.txt @@ -289,12 +289,12 @@ the FB_CAP_FOURCC bit in the fb_fix_screeninfo capabilities field. FOURCC definitions are located in the linux/videodev2.h header. However, and despite starting with the V4L2_PIX_FMT_prefix, they are not restricted to V4L2 and don't require usage of the V4L2 subsystem. FOURCC documentation is -available in Documentation/DocBook/v4l/pixfmt.xml. +available in Documentation/media/uapi/v4l/pixfmt.rst. To select a format, applications set the grayscale field to the desired FOURCC. For YUV formats, they should also select the appropriate colorspace by setting the colorspace field to one of the colorspaces listed in linux/videodev2.h and -documented in Documentation/DocBook/v4l/colorspaces.xml. +documented in Documentation/media/uapi/v4l/colorspaces.rst. The red, green, blue and transp fields are not used with the FOURCC-based API. For forward compatibility reasons applications must zero those fields, and diff --git a/Documentation/filesystems/conf.py b/Documentation/filesystems/conf.py new file mode 100644 index 000000000000..ea44172af5c4 --- /dev/null +++ b/Documentation/filesystems/conf.py @@ -0,0 +1,10 @@ +# -*- coding: utf-8; mode: python -*- + +project = "Linux Filesystems API" + +tags.add("subproject") + +latex_documents = [ + ('index', 'filesystems.tex', project, + 'The kernel development community', 'manual'), +] diff --git a/Documentation/filesystems/f2fs.txt b/Documentation/filesystems/f2fs.txt index 4f6531a4701b..273ccb26885e 100644 --- a/Documentation/filesystems/f2fs.txt +++ b/Documentation/filesystems/f2fs.txt @@ -155,11 +155,15 @@ noinline_data Disable the inline data feature, inline data feature is enabled by default. data_flush Enable data flushing before checkpoint in order to persist data of regular and symlink. +fault_injection=%d Enable fault injection in all supported types with + specified injection rate. mode=%s Control block allocation mode which supports "adaptive" and "lfs". In "lfs" mode, there should be no random writes towards main area. io_bits=%u Set the bit size of write IO requests. It should be set with "mode=lfs". +usrquota Enable plain user disk quota accounting. +grpquota Enable plain group disk quota accounting. ================================================================================ DEBUGFS ENTRIES diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst new file mode 100644 index 000000000000..256e10eedba4 --- /dev/null +++ b/Documentation/filesystems/index.rst @@ -0,0 +1,317 @@ +===================== +Linux Filesystems API +===================== + +The Linux VFS +============= + +The Filesystem types +-------------------- + +.. kernel-doc:: include/linux/fs.h + :internal: + +The Directory Cache +------------------- + +.. kernel-doc:: fs/dcache.c + :export: + +.. kernel-doc:: include/linux/dcache.h + :internal: + +Inode Handling +-------------- + +.. kernel-doc:: fs/inode.c + :export: + +.. kernel-doc:: fs/bad_inode.c + :export: + +Registration and Superblocks +---------------------------- + +.. kernel-doc:: fs/super.c + :export: + +File Locks +---------- + +.. kernel-doc:: fs/locks.c + :export: + +.. kernel-doc:: fs/locks.c + :internal: + +Other Functions +--------------- + +.. kernel-doc:: fs/mpage.c + :export: + +.. kernel-doc:: fs/namei.c + :export: + +.. kernel-doc:: fs/buffer.c + :export: + +.. kernel-doc:: block/bio.c + :export: + +.. kernel-doc:: fs/seq_file.c + :export: + +.. kernel-doc:: fs/filesystems.c + :export: + +.. kernel-doc:: fs/fs-writeback.c + :export: + +.. kernel-doc:: fs/block_dev.c + :export: + +The proc filesystem +=================== + +sysctl interface +---------------- + +.. kernel-doc:: kernel/sysctl.c + :export: + +proc filesystem interface +------------------------- + +.. kernel-doc:: fs/proc/base.c + :internal: + +Events based on file descriptors +================================ + +.. kernel-doc:: fs/eventfd.c + :export: + +The Filesystem for Exporting Kernel Objects +=========================================== + +.. kernel-doc:: fs/sysfs/file.c + :export: + +.. kernel-doc:: fs/sysfs/symlink.c + :export: + +The debugfs filesystem +====================== + +debugfs interface +----------------- + +.. kernel-doc:: fs/debugfs/inode.c + :export: + +.. kernel-doc:: fs/debugfs/file.c + :export: + +The Linux Journalling API +========================= + +Overview +-------- + +Details +~~~~~~~ + +The journalling layer is easy to use. You need to first of all create a +journal_t data structure. There are two calls to do this dependent on +how you decide to allocate the physical media on which the journal +resides. The :c:func:`jbd2_journal_init_inode` call is for journals stored in +filesystem inodes, or the :c:func:`jbd2_journal_init_dev` call can be used +for journal stored on a raw device (in a continuous range of blocks). A +journal_t is a typedef for a struct pointer, so when you are finally +finished make sure you call :c:func:`jbd2_journal_destroy` on it to free up +any used kernel memory. + +Once you have got your journal_t object you need to 'mount' or load the +journal file. The journalling layer expects the space for the journal +was already allocated and initialized properly by the userspace tools. +When loading the journal you must call :c:func:`jbd2_journal_load` to process +journal contents. If the client file system detects the journal contents +does not need to be processed (or even need not have valid contents), it +may call :c:func:`jbd2_journal_wipe` to clear the journal contents before +calling :c:func:`jbd2_journal_load`. + +Note that jbd2_journal_wipe(..,0) calls +:c:func:`jbd2_journal_skip_recovery` for you if it detects any outstanding +transactions in the journal and similarly :c:func:`jbd2_journal_load` will +call :c:func:`jbd2_journal_recover` if necessary. I would advise reading +:c:func:`ext4_load_journal` in fs/ext4/super.c for examples on this stage. + +Now you can go ahead and start modifying the underlying filesystem. +Almost. + +You still need to actually journal your filesystem changes, this is done +by wrapping them into transactions. Additionally you also need to wrap +the modification of each of the buffers with calls to the journal layer, +so it knows what the modifications you are actually making are. To do +this use :c:func:`jbd2_journal_start` which returns a transaction handle. + +:c:func:`jbd2_journal_start` and its counterpart :c:func:`jbd2_journal_stop`, +which indicates the end of a transaction are nestable calls, so you can +reenter a transaction if necessary, but remember you must call +:c:func:`jbd2_journal_stop` the same number of times as +:c:func:`jbd2_journal_start` before the transaction is completed (or more +accurately leaves the update phase). Ext4/VFS makes use of this feature to +simplify handling of inode dirtying, quota support, etc. + +Inside each transaction you need to wrap the modifications to the +individual buffers (blocks). Before you start to modify a buffer you +need to call :c:func:`jbd2_journal_get_create_access()` / +:c:func:`jbd2_journal_get_write_access()` / +:c:func:`jbd2_journal_get_undo_access()` as appropriate, this allows the +journalling layer to copy the unmodified +data if it needs to. After all the buffer may be part of a previously +uncommitted transaction. At this point you are at last ready to modify a +buffer, and once you are have done so you need to call +:c:func:`jbd2_journal_dirty_metadata`. Or if you've asked for access to a +buffer you now know is now longer required to be pushed back on the +device you can call :c:func:`jbd2_journal_forget` in much the same way as you +might have used :c:func:`bforget` in the past. + +A :c:func:`jbd2_journal_flush` may be called at any time to commit and +checkpoint all your transactions. + +Then at umount time , in your :c:func:`put_super` you can then call +:c:func:`jbd2_journal_destroy` to clean up your in-core journal object. + +Unfortunately there a couple of ways the journal layer can cause a +deadlock. The first thing to note is that each task can only have a +single outstanding transaction at any one time, remember nothing commits +until the outermost :c:func:`jbd2_journal_stop`. This means you must complete +the transaction at the end of each file/inode/address etc. operation you +perform, so that the journalling system isn't re-entered on another +journal. Since transactions can't be nested/batched across differing +journals, and another filesystem other than yours (say ext4) may be +modified in a later syscall. + +The second case to bear in mind is that :c:func:`jbd2_journal_start` can block +if there isn't enough space in the journal for your transaction (based +on the passed nblocks param) - when it blocks it merely(!) needs to wait +for transactions to complete and be committed from other tasks, so +essentially we are waiting for :c:func:`jbd2_journal_stop`. So to avoid +deadlocks you must treat :c:func:`jbd2_journal_start` / +:c:func:`jbd2_journal_stop` as if they were semaphores and include them in +your semaphore ordering rules to prevent +deadlocks. Note that :c:func:`jbd2_journal_extend` has similar blocking +behaviour to :c:func:`jbd2_journal_start` so you can deadlock here just as +easily as on :c:func:`jbd2_journal_start`. + +Try to reserve the right number of blocks the first time. ;-). This will +be the maximum number of blocks you are going to touch in this +transaction. I advise having a look at at least ext4_jbd.h to see the +basis on which ext4 uses to make these decisions. + +Another wriggle to watch out for is your on-disk block allocation +strategy. Why? Because, if you do a delete, you need to ensure you +haven't reused any of the freed blocks until the transaction freeing +these blocks commits. If you reused these blocks and crash happens, +there is no way to restore the contents of the reallocated blocks at the +end of the last fully committed transaction. One simple way of doing +this is to mark blocks as free in internal in-memory block allocation +structures only after the transaction freeing them commits. Ext4 uses +journal commit callback for this purpose. + +With journal commit callbacks you can ask the journalling layer to call +a callback function when the transaction is finally committed to disk, +so that you can do some of your own management. You ask the journalling +layer for calling the callback by simply setting +``journal->j_commit_callback`` function pointer and that function is +called after each transaction commit. You can also use +``transaction->t_private_list`` for attaching entries to a transaction +that need processing when the transaction commits. + +JBD2 also provides a way to block all transaction updates via +:c:func:`jbd2_journal_lock_updates()` / +:c:func:`jbd2_journal_unlock_updates()`. Ext4 uses this when it wants a +window with a clean and stable fs for a moment. E.g. + +:: + + + jbd2_journal_lock_updates() //stop new stuff happening.. + jbd2_journal_flush() // checkpoint everything. + ..do stuff on stable fs + jbd2_journal_unlock_updates() // carry on with filesystem use. + +The opportunities for abuse and DOS attacks with this should be obvious, +if you allow unprivileged userspace to trigger codepaths containing +these calls. + +Summary +~~~~~~~ + +Using the journal is a matter of wrapping the different context changes, +being each mount, each modification (transaction) and each changed +buffer to tell the journalling layer about them. + +Data Types +---------- + +The journalling layer uses typedefs to 'hide' the concrete definitions +of the structures used. As a client of the JBD2 layer you can just rely +on the using the pointer as a magic cookie of some sort. Obviously the +hiding is not enforced as this is 'C'. + +Structures +~~~~~~~~~~ + +.. kernel-doc:: include/linux/jbd2.h + :internal: + +Functions +--------- + +The functions here are split into two groups those that affect a journal +as a whole, and those which are used to manage transactions + +Journal Level +~~~~~~~~~~~~~ + +.. kernel-doc:: fs/jbd2/journal.c + :export: + +.. kernel-doc:: fs/jbd2/recovery.c + :internal: + +Transasction Level +~~~~~~~~~~~~~~~~~~ + +.. kernel-doc:: fs/jbd2/transaction.c + +See also +-------- + +`Journaling the Linux ext2fs Filesystem, LinuxExpo 98, Stephen +Tweedie <http://kernel.org/pub/linux/kernel/people/sct/ext3/journal-design.ps.gz>`__ + +`Ext3 Journalling FileSystem, OLS 2000, Dr. Stephen +Tweedie <http://olstrans.sourceforge.net/release/OLS2000-ext3/OLS2000-ext3.html>`__ + +splice API +========== + +splice is a method for moving blocks of data around inside the kernel, +without continually transferring them between the kernel and user space. + +.. kernel-doc:: fs/splice.c + +pipes API +========= + +Pipe interfaces are all for in-kernel (builtin image) use. They are not +exported for use by modules. + +.. kernel-doc:: include/linux/pipe_fs_i.h + :internal: + +.. kernel-doc:: fs/pipe.c diff --git a/Documentation/filesystems/nfs/idmapper.txt b/Documentation/filesystems/nfs/idmapper.txt index fe03d10bb79a..b86831acd583 100644 --- a/Documentation/filesystems/nfs/idmapper.txt +++ b/Documentation/filesystems/nfs/idmapper.txt @@ -55,7 +55,7 @@ request-key will find the first matching line and corresponding program. In this case, /some/other/program will handle all uid lookups and /usr/sbin/nfs.idmap will handle gid, user, and group lookups. -See <file:Documentation/security/keys-request-key.txt> for more information +See <file:Documentation/security/keys/request-key.rst> for more information about the request-key function. diff --git a/Documentation/filesystems/overlayfs.txt b/Documentation/filesystems/overlayfs.txt index c9e884b52698..36f528a7fdd6 100644 --- a/Documentation/filesystems/overlayfs.txt +++ b/Documentation/filesystems/overlayfs.txt @@ -201,6 +201,40 @@ rightmost one and going left. In the above example lower1 will be the top, lower2 the middle and lower3 the bottom layer. +Sharing and copying layers +-------------------------- + +Lower layers may be shared among several overlay mounts and that is indeed +a very common practice. An overlay mount may use the same lower layer +path as another overlay mount and it may use a lower layer path that is +beneath or above the path of another overlay lower layer path. + +Using an upper layer path and/or a workdir path that are already used by +another overlay mount is not allowed and will fail with EBUSY. Using +partially overlapping paths is not allowed but will not fail with EBUSY. + +Mounting an overlay using an upper layer path, where the upper layer path +was previously used by another mounted overlay in combination with a +different lower layer path, is allowed, unless the "inodes index" feature +is enabled. + +With the "inodes index" feature, on the first time mount, an NFS file +handle of the lower layer root directory, along with the UUID of the lower +filesystem, are encoded and stored in the "trusted.overlay.origin" extended +attribute on the upper layer root directory. On subsequent mount attempts, +the lower root directory file handle and lower filesystem UUID are compared +to the stored origin in upper root directory. On failure to verify the +lower root origin, mount will fail with ESTALE. An overlayfs mount with +"inodes index" enabled will fail with EOPNOTSUPP if the lower filesystem +does not support NFS export, lower filesystem does not have a valid UUID or +if the upper filesystem does not support extended attributes. + +It is quite a common practice to copy overlay layers to a different +directory tree on the same or different underlying filesystem, and even +to a different machine. With the "inodes index" feature, trying to mount +the copied layers will fail the verification of the lower root file handle. + + Non-standard behavior --------------------- diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 4cddbce85ac9..adba21b5ada7 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -1786,12 +1786,16 @@ pair provide additional information particular to the objects they represent. pos: 0 flags: 02 mnt_id: 9 - tfd: 5 events: 1d data: ffffffffffffffff + tfd: 5 events: 1d data: ffffffffffffffff pos:0 ino:61af sdev:7 where 'tfd' is a target file descriptor number in decimal form, 'events' is events mask being watched and the 'data' is data associated with a target [see epoll(7) for more details]. + The 'pos' is current offset of the target file in decimal form + [see lseek(2)], 'ino' and 'sdev' are inode and device numbers + where target file resides, all in hex format. + Fsnotify files ~~~~~~~~~~~~~~ For inotify files the format is the following diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index f42b90687d40..73e7d91f03dc 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -576,7 +576,43 @@ should clear PG_Dirty and set PG_Writeback. It can be actually written at any point after PG_Dirty is clear. Once it is known to be safe, PG_Writeback is cleared. -Writeback makes use of a writeback_control structure... +Writeback makes use of a writeback_control structure to direct the +operations. This gives the the writepage and writepages operations some +information about the nature of and reason for the writeback request, +and the constraints under which it is being done. It is also used to +return information back to the caller about the result of a writepage or +writepages request. + +Handling errors during writeback +-------------------------------- +Most applications that do buffered I/O will periodically call a file +synchronization call (fsync, fdatasync, msync or sync_file_range) to +ensure that data written has made it to the backing store. When there +is an error during writeback, they expect that error to be reported when +a file sync request is made. After an error has been reported on one +request, subsequent requests on the same file descriptor should return +0, unless further writeback errors have occurred since the previous file +syncronization. + +Ideally, the kernel would report errors only on file descriptions on +which writes were done that subsequently failed to be written back. The +generic pagecache infrastructure does not track the file descriptions +that have dirtied each individual page however, so determining which +file descriptors should get back an error is not possible. + +Instead, the generic writeback error tracking infrastructure in the +kernel settles for reporting errors to fsync on all file descriptions +that were open at the time that the error occurred. In a situation with +multiple writers, all of them will get back an error on a subsequent fsync, +even if all of the writes done through that particular file descriptor +succeeded (or even if there were no writes on that file descriptor at all). + +Filesystems that wish to use this infrastructure should call +mapping_set_error to record the error in the address_space when it +occurs. Then, after writing back data from the pagecache in their +file->fsync operation, they should call file_check_and_advance_wb_err to +ensure that the struct file's error cursor has advanced to the correct +point in the stream of errors emitted by the backing device(s). struct address_space_operations ------------------------------- @@ -804,7 +840,8 @@ struct address_space_operations { The File Object =============== -A file object represents a file opened by a process. +A file object represents a file opened by a process. This is also known +as an "open file description" in POSIX parlance. struct file_operations @@ -887,7 +924,8 @@ otherwise noted. release: called when the last reference to an open file is closed - fsync: called by the fsync(2) system call + fsync: called by the fsync(2) system call. Also see the section above + entitled "Handling errors during writeback". fasync: called by the fcntl(2) system call when asynchronous (non-blocking) mode is enabled for a file @@ -1187,12 +1225,6 @@ The underlying reason for the above rules is to make sure, that a mount can be accurately replicated (e.g. umounting and mounting again) based on the information found in /proc/mounts. -A simple method of saving options at mount/remount time and showing -them is provided with the save_mount_options() and -generic_show_options() helper functions. Please note, that using -these may have drawbacks. For more info see header comments for these -functions in fs/namespace.c. - Resources ========= diff --git a/Documentation/flexible-arrays.txt b/Documentation/flexible-arrays.txt index df904aec9904..a0f2989dd804 100644 --- a/Documentation/flexible-arrays.txt +++ b/Documentation/flexible-arrays.txt @@ -1,6 +1,9 @@ +=================================== Using flexible arrays in the kernel -Last updated for 2.6.32 -Jonathan Corbet <corbet@lwn.net> +=================================== + +:Updated: Last updated for 2.6.32 +:Author: Jonathan Corbet <corbet@lwn.net> Large contiguous memory allocations can be unreliable in the Linux kernel. Kernel programmers will sometimes respond to this problem by allocating @@ -26,7 +29,7 @@ operation. It's also worth noting that flexible arrays do no internal locking at all; if concurrent access to an array is possible, then the caller must arrange for appropriate mutual exclusion. -The creation of a flexible array is done with: +The creation of a flexible array is done with:: #include <linux/flex_array.h> @@ -40,14 +43,14 @@ argument is passed directly to the internal memory allocation calls. With the current code, using flags to ask for high memory is likely to lead to notably unpleasant side effects. -It is also possible to define flexible arrays at compile time with: +It is also possible to define flexible arrays at compile time with:: DEFINE_FLEX_ARRAY(name, element_size, total); This macro will result in a definition of an array with the given name; the element size and total will be checked for validity at compile time. -Storing data into a flexible array is accomplished with a call to: +Storing data into a flexible array is accomplished with a call to:: int flex_array_put(struct flex_array *array, unsigned int element_nr, void *src, gfp_t flags); @@ -63,7 +66,7 @@ running in some sort of atomic context; in this situation, sleeping in the memory allocator would be a bad thing. That can be avoided by using GFP_ATOMIC for the flags value, but, often, there is a better way. The trick is to ensure that any needed memory allocations are done before -entering atomic context, using: +entering atomic context, using:: int flex_array_prealloc(struct flex_array *array, unsigned int start, unsigned int nr_elements, gfp_t flags); @@ -73,7 +76,7 @@ defined by start and nr_elements has been allocated. Thereafter, a flex_array_put() call on an element in that range is guaranteed not to block. -Getting data back out of the array is done with: +Getting data back out of the array is done with:: void *flex_array_get(struct flex_array *fa, unsigned int element_nr); @@ -89,7 +92,7 @@ involving that number probably result from use of unstored array entries. Note that, if array elements are allocated with __GFP_ZERO, they will be initialized to zero and this poisoning will not happen. -Individual elements in the array can be cleared with: +Individual elements in the array can be cleared with:: int flex_array_clear(struct flex_array *array, unsigned int element_nr); @@ -97,7 +100,7 @@ This function will set the given element to FLEX_ARRAY_FREE and return zero. If storage for the indicated element is not allocated for the array, flex_array_clear() will return -EINVAL instead. Note that clearing an element does not release the storage associated with it; to reduce the -allocated size of an array, call: +allocated size of an array, call:: int flex_array_shrink(struct flex_array *array); @@ -106,12 +109,12 @@ This function works by scanning the array for pages containing nothing but FLEX_ARRAY_FREE bytes, so (1) it can be expensive, and (2) it will not work if the array's pages are allocated with __GFP_ZERO. -It is possible to remove all elements of an array with a call to: +It is possible to remove all elements of an array with a call to:: void flex_array_free_parts(struct flex_array *array); This call frees all elements, but leaves the array itself in place. -Freeing the entire array is done with: +Freeing the entire array is done with:: void flex_array_free(struct flex_array *array); diff --git a/Documentation/futex-requeue-pi.txt b/Documentation/futex-requeue-pi.txt index 77b36f59d16b..14ab5787b9a7 100644 --- a/Documentation/futex-requeue-pi.txt +++ b/Documentation/futex-requeue-pi.txt @@ -1,5 +1,6 @@ +================ Futex Requeue PI ----------------- +================ Requeueing of tasks from a non-PI futex to a PI futex requires special handling in order to ensure the underlying rt_mutex is never @@ -20,28 +21,28 @@ implementation would wake the highest-priority waiter, and leave the rest to the natural wakeup inherent in unlocking the mutex associated with the condvar. -Consider the simplified glibc calls: - -/* caller must lock mutex */ -pthread_cond_wait(cond, mutex) -{ - lock(cond->__data.__lock); - unlock(mutex); - do { - unlock(cond->__data.__lock); - futex_wait(cond->__data.__futex); - lock(cond->__data.__lock); - } while(...) - unlock(cond->__data.__lock); - lock(mutex); -} - -pthread_cond_broadcast(cond) -{ - lock(cond->__data.__lock); - unlock(cond->__data.__lock); - futex_requeue(cond->data.__futex, cond->mutex); -} +Consider the simplified glibc calls:: + + /* caller must lock mutex */ + pthread_cond_wait(cond, mutex) + { + lock(cond->__data.__lock); + unlock(mutex); + do { + unlock(cond->__data.__lock); + futex_wait(cond->__data.__futex); + lock(cond->__data.__lock); + } while(...) + unlock(cond->__data.__lock); + lock(mutex); + } + + pthread_cond_broadcast(cond) + { + lock(cond->__data.__lock); + unlock(cond->__data.__lock); + futex_requeue(cond->data.__futex, cond->mutex); + } Once pthread_cond_broadcast() requeues the tasks, the cond->mutex has waiters. Note that pthread_cond_wait() attempts to lock the @@ -53,29 +54,29 @@ In order to support PI-aware pthread_condvar's, the kernel needs to be able to requeue tasks to PI futexes. This support implies that upon a successful futex_wait system call, the caller would return to user space already holding the PI futex. The glibc implementation -would be modified as follows: - - -/* caller must lock mutex */ -pthread_cond_wait_pi(cond, mutex) -{ - lock(cond->__data.__lock); - unlock(mutex); - do { - unlock(cond->__data.__lock); - futex_wait_requeue_pi(cond->__data.__futex); - lock(cond->__data.__lock); - } while(...) - unlock(cond->__data.__lock); - /* the kernel acquired the mutex for us */ -} - -pthread_cond_broadcast_pi(cond) -{ - lock(cond->__data.__lock); - unlock(cond->__data.__lock); - futex_requeue_pi(cond->data.__futex, cond->mutex); -} +would be modified as follows:: + + + /* caller must lock mutex */ + pthread_cond_wait_pi(cond, mutex) + { + lock(cond->__data.__lock); + unlock(mutex); + do { + unlock(cond->__data.__lock); + futex_wait_requeue_pi(cond->__data.__futex); + lock(cond->__data.__lock); + } while(...) + unlock(cond->__data.__lock); + /* the kernel acquired the mutex for us */ + } + + pthread_cond_broadcast_pi(cond) + { + lock(cond->__data.__lock); + unlock(cond->__data.__lock); + futex_requeue_pi(cond->data.__futex, cond->mutex); + } The actual glibc implementation will likely test for PI and make the necessary changes inside the existing calls rather than creating new diff --git a/Documentation/gcc-plugins.txt b/Documentation/gcc-plugins.txt index 433eaefb4aa1..8502f24396fb 100644 --- a/Documentation/gcc-plugins.txt +++ b/Documentation/gcc-plugins.txt @@ -1,14 +1,15 @@ +========================= GCC plugin infrastructure ========================= -1. Introduction -=============== +Introduction +============ GCC plugins are loadable modules that provide extra features to the -compiler [1]. They are useful for runtime instrumentation and static analysis. +compiler [1]_. They are useful for runtime instrumentation and static analysis. We can analyse, change and add further code during compilation via -callbacks [2], GIMPLE [3], IPA [4] and RTL passes [5]. +callbacks [2]_, GIMPLE [3]_, IPA [4]_ and RTL passes [5]_. The GCC plugin infrastructure of the kernel supports all gcc versions from 4.5 to 6.0, building out-of-tree modules, cross-compilation and building in a @@ -21,56 +22,61 @@ and versions 4.8+ can only be compiled by a C++ compiler. Currently the GCC plugin infrastructure supports only the x86, arm, arm64 and powerpc architectures. -This infrastructure was ported from grsecurity [6] and PaX [7]. +This infrastructure was ported from grsecurity [6]_ and PaX [7]_. -- -[1] https://gcc.gnu.org/onlinedocs/gccint/Plugins.html -[2] https://gcc.gnu.org/onlinedocs/gccint/Plugin-API.html#Plugin-API -[3] https://gcc.gnu.org/onlinedocs/gccint/GIMPLE.html -[4] https://gcc.gnu.org/onlinedocs/gccint/IPA.html -[5] https://gcc.gnu.org/onlinedocs/gccint/RTL.html -[6] https://grsecurity.net/ -[7] https://pax.grsecurity.net/ +.. [1] https://gcc.gnu.org/onlinedocs/gccint/Plugins.html +.. [2] https://gcc.gnu.org/onlinedocs/gccint/Plugin-API.html#Plugin-API +.. [3] https://gcc.gnu.org/onlinedocs/gccint/GIMPLE.html +.. [4] https://gcc.gnu.org/onlinedocs/gccint/IPA.html +.. [5] https://gcc.gnu.org/onlinedocs/gccint/RTL.html +.. [6] https://grsecurity.net/ +.. [7] https://pax.grsecurity.net/ + + +Files +===== -2. Files -======== +**$(src)/scripts/gcc-plugins** -$(src)/scripts/gcc-plugins This is the directory of the GCC plugins. -$(src)/scripts/gcc-plugins/gcc-common.h +**$(src)/scripts/gcc-plugins/gcc-common.h** + This is a compatibility header for GCC plugins. It should be always included instead of individual gcc headers. -$(src)/scripts/gcc-plugin.sh +**$(src)/scripts/gcc-plugin.sh** + This script checks the availability of the included headers in gcc-common.h and chooses the proper host compiler to build the plugins (gcc-4.7 can be built by either gcc or g++). -$(src)/scripts/gcc-plugins/gcc-generate-gimple-pass.h -$(src)/scripts/gcc-plugins/gcc-generate-ipa-pass.h -$(src)/scripts/gcc-plugins/gcc-generate-simple_ipa-pass.h -$(src)/scripts/gcc-plugins/gcc-generate-rtl-pass.h +**$(src)/scripts/gcc-plugins/gcc-generate-gimple-pass.h, +$(src)/scripts/gcc-plugins/gcc-generate-ipa-pass.h, +$(src)/scripts/gcc-plugins/gcc-generate-simple_ipa-pass.h, +$(src)/scripts/gcc-plugins/gcc-generate-rtl-pass.h** + These headers automatically generate the registration structures for GIMPLE, SIMPLE_IPA, IPA and RTL passes. They support all gcc versions from 4.5 to 6.0. They should be preferred to creating the structures by hand. -3. Usage -======== +Usage +===== You must install the gcc plugin headers for your gcc version, -e.g., on Ubuntu for gcc-4.9: +e.g., on Ubuntu for gcc-4.9:: apt-get install gcc-4.9-plugin-dev -Enable a GCC plugin based feature in the kernel config: +Enable a GCC plugin based feature in the kernel config:: CONFIG_GCC_PLUGIN_CYC_COMPLEXITY = y -To compile only the plugin(s): +To compile only the plugin(s):: make gcc-plugins diff --git a/Documentation/gpu/drm-internals.rst b/Documentation/gpu/drm-internals.rst index babfb6143bd9..0d936c67bf7d 100644 --- a/Documentation/gpu/drm-internals.rst +++ b/Documentation/gpu/drm-internals.rst @@ -98,6 +98,9 @@ DRIVER_ATOMIC implement appropriate obj->atomic_get_property() vfuncs for any modeset objects with driver specific properties. +DRIVER_SYNCOBJ + Driver support drm sync objects. + Major, Minor and Patchlevel ~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -149,60 +152,15 @@ Device Instance and Driver Handling Driver Load ----------- -IRQ Registration -~~~~~~~~~~~~~~~~ - -The DRM core tries to facilitate IRQ handler registration and -unregistration by providing :c:func:`drm_irq_install()` and -:c:func:`drm_irq_uninstall()` functions. Those functions only -support a single interrupt per device, devices that use more than one -IRQs need to be handled manually. - -Managed IRQ Registration -'''''''''''''''''''''''' - -:c:func:`drm_irq_install()` starts by calling the irq_preinstall -driver operation. The operation is optional and must make sure that the -interrupt will not get fired by clearing all pending interrupt flags or -disabling the interrupt. - -The passed-in IRQ will then be requested by a call to -:c:func:`request_irq()`. If the DRIVER_IRQ_SHARED driver feature -flag is set, a shared (IRQF_SHARED) IRQ handler will be requested. - -The IRQ handler function must be provided as the mandatory irq_handler -driver operation. It will get passed directly to -:c:func:`request_irq()` and thus has the same prototype as all IRQ -handlers. It will get called with a pointer to the DRM device as the -second argument. - -Finally the function calls the optional irq_postinstall driver -operation. The operation usually enables interrupts (excluding the -vblank interrupt, which is enabled separately), but drivers may choose -to enable/disable interrupts at a different time. - -:c:func:`drm_irq_uninstall()` is similarly used to uninstall an -IRQ handler. It starts by waking up all processes waiting on a vblank -interrupt to make sure they don't hang, and then calls the optional -irq_uninstall driver operation. The operation must disable all hardware -interrupts. Finally the function frees the IRQ by calling -:c:func:`free_irq()`. - -Manual IRQ Registration -''''''''''''''''''''''' - -Drivers that require multiple interrupt handlers can't use the managed -IRQ registration functions. In that case IRQs must be registered and -unregistered manually (usually with the :c:func:`request_irq()` and -:c:func:`free_irq()` functions, or their :c:func:`devm_request_irq()` and -:c:func:`devm_free_irq()` equivalents). - -When manually registering IRQs, drivers must not set the -DRIVER_HAVE_IRQ driver feature flag, and must not provide the -irq_handler driver operation. They must set the :c:type:`struct -drm_device <drm_device>` irq_enabled field to 1 upon -registration of the IRQs, and clear it to 0 after unregistering the -IRQs. + +IRQ Helper Library +~~~~~~~~~~~~~~~~~~ + +.. kernel-doc:: drivers/gpu/drm/drm_irq.c + :doc: irq helpers + +.. kernel-doc:: drivers/gpu/drm/drm_irq.c + :export: Memory Manager Initialization ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ diff --git a/Documentation/gpu/drm-kms-helpers.rst b/Documentation/gpu/drm-kms-helpers.rst index c075aadd7078..7c5e2549a58a 100644 --- a/Documentation/gpu/drm-kms-helpers.rst +++ b/Documentation/gpu/drm-kms-helpers.rst @@ -143,6 +143,12 @@ Bridge Helper Reference .. kernel-doc:: drivers/gpu/drm/drm_bridge.c :export: +Panel-Bridge Helper Reference +----------------------------- + +.. kernel-doc:: drivers/gpu/drm/bridge/panel.c + :export: + .. _drm_panel_helper: Panel Helper Reference diff --git a/Documentation/gpu/drm-kms.rst b/Documentation/gpu/drm-kms.rst index bfecd21a8cdf..2d77c9580164 100644 --- a/Documentation/gpu/drm-kms.rst +++ b/Documentation/gpu/drm-kms.rst @@ -612,8 +612,8 @@ operation handler. Vertical Blanking and Interrupt Handling Functions Reference ------------------------------------------------------------ -.. kernel-doc:: include/drm/drm_irq.h +.. kernel-doc:: include/drm/drm_vblank.h :internal: -.. kernel-doc:: drivers/gpu/drm/drm_irq.c +.. kernel-doc:: drivers/gpu/drm/drm_vblank.c :export: diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst index 96b9c34c21e4..9412798645c1 100644 --- a/Documentation/gpu/drm-mm.rst +++ b/Documentation/gpu/drm-mm.rst @@ -484,3 +484,15 @@ DRM Cache Handling .. kernel-doc:: drivers/gpu/drm/drm_cache.c :export: + +DRM Sync Objects +=========================== + +.. kernel-doc:: drivers/gpu/drm/drm_syncobj.c + :doc: Overview + +.. kernel-doc:: include/drm/drm_syncobj.h + :export: + +.. kernel-doc:: drivers/gpu/drm/drm_syncobj.c + :export: diff --git a/Documentation/gpu/index.rst b/Documentation/gpu/index.rst index c572f092739e..35d673bf9b56 100644 --- a/Documentation/gpu/index.rst +++ b/Documentation/gpu/index.rst @@ -12,6 +12,8 @@ Linux GPU Driver Developer's Guide drm-uapi i915 meson + pl111 + tegra tinydrm vc4 vga-switcheroo diff --git a/Documentation/gpu/pl111.rst b/Documentation/gpu/pl111.rst new file mode 100644 index 000000000000..9b03736d33dd --- /dev/null +++ b/Documentation/gpu/pl111.rst @@ -0,0 +1,6 @@ +========================================== + drm/pl111 ARM PrimeCell PL111 CLCD Driver +========================================== + +.. kernel-doc:: drivers/gpu/drm/pl111/pl111_drv.c + :doc: ARM PrimeCell PL111 CLCD Driver diff --git a/Documentation/gpu/tegra.rst b/Documentation/gpu/tegra.rst new file mode 100644 index 000000000000..d2ed8938ca43 --- /dev/null +++ b/Documentation/gpu/tegra.rst @@ -0,0 +1,178 @@ +=============================================== + drm/tegra NVIDIA Tegra GPU and display driver +=============================================== + +NVIDIA Tegra SoCs support a set of display, graphics and video functions via +the host1x controller. host1x supplies command streams, gathered from a push +buffer provided directly by the CPU, to its clients via channels. Software, +or blocks amongst themselves, can use syncpoints for synchronization. + +Up until, but not including, Tegra124 (aka Tegra K1) the drm/tegra driver +supports the built-in GPU, comprised of the gr2d and gr3d engines. Starting +with Tegra124 the GPU is based on the NVIDIA desktop GPU architecture and +supported by the drm/nouveau driver. + +The drm/tegra driver supports NVIDIA Tegra SoC generations since Tegra20. It +has three parts: + + - A host1x driver that provides infrastructure and access to the host1x + services. + + - A KMS driver that supports the display controllers as well as a number of + outputs, such as RGB, HDMI, DSI, and DisplayPort. + + - A set of custom userspace IOCTLs that can be used to submit jobs to the + GPU and video engines via host1x. + +Driver Infrastructure +===================== + +The various host1x clients need to be bound together into a logical device in +order to expose their functionality to users. The infrastructure that supports +this is implemented in the host1x driver. When a driver is registered with the +infrastructure it provides a list of compatible strings specifying the devices +that it needs. The infrastructure creates a logical device and scan the device +tree for matching device nodes, adding the required clients to a list. Drivers +for individual clients register with the infrastructure as well and are added +to the logical host1x device. + +Once all clients are available, the infrastructure will initialize the logical +device using a driver-provided function which will set up the bits specific to +the subsystem and in turn initialize each of its clients. + +Similarly, when one of the clients is unregistered, the infrastructure will +destroy the logical device by calling back into the driver, which ensures that +the subsystem specific bits are torn down and the clients destroyed in turn. + +Host1x Infrastructure Reference +------------------------------- + +.. kernel-doc:: include/linux/host1x.h + +.. kernel-doc:: drivers/gpu/host1x/bus.c + :export: + +Host1x Syncpoint Reference +-------------------------- + +.. kernel-doc:: drivers/gpu/host1x/syncpt.c + :export: + +KMS driver +========== + +The display hardware has remained mostly backwards compatible over the various +Tegra SoC generations, up until Tegra186 which introduces several changes that +make it difficult to support with a parameterized driver. + +Display Controllers +------------------- + +Tegra SoCs have two display controllers, each of which can be associated with +zero or more outputs. Outputs can also share a single display controller, but +only if they run with compatible display timings. Two display controllers can +also share a single framebuffer, allowing cloned configurations even if modes +on two outputs don't match. A display controller is modelled as a CRTC in KMS +terms. + +On Tegra186, the number of display controllers has been increased to three. A +display controller can no longer drive all of the outputs. While two of these +controllers can drive both DSI outputs and both SOR outputs, the third cannot +drive any DSI. + +Windows +~~~~~~~ + +A display controller controls a set of windows that can be used to composite +multiple buffers onto the screen. While it is possible to assign arbitrary Z +ordering to individual windows (by programming the corresponding blending +registers), this is currently not supported by the driver. Instead, it will +assume a fixed Z ordering of the windows (window A is the root window, that +is, the lowest, while windows B and C are overlaid on top of window A). The +overlay windows support multiple pixel formats and can automatically convert +from YUV to RGB at scanout time. This makes them useful for displaying video +content. In KMS, each window is modelled as a plane. Each display controller +has a hardware cursor that is exposed as a cursor plane. + +Outputs +------- + +The type and number of supported outputs varies between Tegra SoC generations. +All generations support at least HDMI. While earlier generations supported the +very simple RGB interfaces (one per display controller), recent generations no +longer do and instead provide standard interfaces such as DSI and eDP/DP. + +Outputs are modelled as a composite encoder/connector pair. + +RGB/LVDS +~~~~~~~~ + +This interface is no longer available since Tegra124. It has been replaced by +the more standard DSI and eDP interfaces. + +HDMI +~~~~ + +HDMI is supported on all Tegra SoCs. Starting with Tegra210, HDMI is provided +by the versatile SOR output, which supports eDP, DP and HDMI. The SOR is able +to support HDMI 2.0, though support for this is currently not merged. + +DSI +~~~ + +Although Tegra has supported DSI since Tegra30, the controller has changed in +several ways in Tegra114. Since none of the publicly available development +boards prior to Dalmore (Tegra114) have made use of DSI, only Tegra114 and +later are supported by the drm/tegra driver. + +eDP/DP +~~~~~~ + +eDP was first introduced in Tegra124 where it was used to drive the display +panel for notebook form factors. Tegra210 added support for full DisplayPort +support, though this is currently not implemented in the drm/tegra driver. + +Userspace Interface +=================== + +The userspace interface provided by drm/tegra allows applications to create +GEM buffers, access and control syncpoints as well as submit command streams +to host1x. + +GEM Buffers +----------- + +The ``DRM_IOCTL_TEGRA_GEM_CREATE`` IOCTL is used to create a GEM buffer object +with Tegra-specific flags. This is useful for buffers that should be tiled, or +that are to be scanned out upside down (useful for 3D content). + +After a GEM buffer object has been created, its memory can be mapped by an +application using the mmap offset returned by the ``DRM_IOCTL_TEGRA_GEM_MMAP`` +IOCTL. + +Syncpoints +---------- + +The current value of a syncpoint can be obtained by executing the +``DRM_IOCTL_TEGRA_SYNCPT_READ`` IOCTL. Incrementing the syncpoint is achieved +using the ``DRM_IOCTL_TEGRA_SYNCPT_INCR`` IOCTL. + +Userspace can also request blocking on a syncpoint. To do so, it needs to +execute the ``DRM_IOCTL_TEGRA_SYNCPT_WAIT`` IOCTL, specifying the value of +the syncpoint to wait for. The kernel will release the application when the +syncpoint reaches that value or after a specified timeout. + +Command Stream Submission +------------------------- + +Before an application can submit command streams to host1x it needs to open a +channel to an engine using the ``DRM_IOCTL_TEGRA_OPEN_CHANNEL`` IOCTL. Client +IDs are used to identify the target of the channel. When a channel is no +longer needed, it can be closed using the ``DRM_IOCTL_TEGRA_CLOSE_CHANNEL`` +IOCTL. To retrieve the syncpoint associated with a channel, an application +can use the ``DRM_IOCTL_TEGRA_GET_SYNCPT``. + +After opening a channel, submitting command streams is easy. The application +writes commands into the memory backing a GEM buffer object and passes these +to the ``DRM_IOCTL_TEGRA_SUBMIT`` IOCTL along with various other parameters, +such as the syncpoints or relocations used in the job submission. diff --git a/Documentation/gpu/todo.rst b/Documentation/gpu/todo.rst index 1bdb7356a310..1ae42006deea 100644 --- a/Documentation/gpu/todo.rst +++ b/Documentation/gpu/todo.rst @@ -177,19 +177,6 @@ following drivers still use ``struct_mutex``: ``msm``, ``omapdrm`` and Contact: Daniel Vetter, respective driver maintainers -Switch to drm_connector_list_iter for any connector_list walking ----------------------------------------------------------------- - -Connectors can be hotplugged, and we now have a special list of helpers to walk -the connector_list in a race-free fashion, without incurring deadlocks on -mutexes and other fun stuff. - -Unfortunately most drivers are not converted yet. At least all those supporting -DP MST hotplug should be converted, since for those drivers the difference -matters. See drm_for_each_connector_iter() vs. drm_for_each_connector(). - -Contact: Daniel Vetter - Core refactorings ================= @@ -228,7 +215,7 @@ The DRM reference documentation is still lacking kerneldoc in a few areas. The task would be to clean up interfaces like moving functions around between files to better group them and improving the interfaces like dropping return values for functions that never fail. Then write kerneldoc for all exported -functions and an overview section and integrate it all into the drm DocBook. +functions and an overview section and integrate it all into the drm book. See https://dri.freedesktop.org/docs/drm/ for what's there already. diff --git a/Documentation/highuid.txt b/Documentation/highuid.txt index 6bad6f1d1cac..6ee70465c0ea 100644 --- a/Documentation/highuid.txt +++ b/Documentation/highuid.txt @@ -1,4 +1,9 @@ -Notes on the change from 16-bit UIDs to 32-bit UIDs: +=================================================== +Notes on the change from 16-bit UIDs to 32-bit UIDs +=================================================== + +:Author: Chris Wing <wingc@umich.edu> +:Last updated: January 11, 2000 - kernel code MUST take into account __kernel_uid_t and __kernel_uid32_t when communicating between user and kernel space in an ioctl or data @@ -28,30 +33,34 @@ What's left to be done for 32-bit UIDs on all Linux architectures: uses the 32-bit UID system calls properly otherwise. This affects at least: - iBCS on Intel - sparc32 emulation on sparc64 - (need to support whatever new 32-bit UID system calls are added to - sparc32) + - iBCS on Intel + + - sparc32 emulation on sparc64 + (need to support whatever new 32-bit UID system calls are added to + sparc32) - Validate that all filesystems behave properly. At present, 32-bit UIDs _should_ work for: - ext2 - ufs - isofs - nfs - coda - udf + + - ext2 + - ufs + - isofs + - nfs + - coda + - udf Ioctl() fixups have been made for: - ncpfs - smbfs + + - ncpfs + - smbfs Filesystems with simple fixups to prevent 16-bit UID wraparound: - minix - sysv - qnx4 + + - minix + - sysv + - qnx4 Other filesystems have not been checked yet. @@ -69,9 +78,3 @@ What's left to be done for 32-bit UIDs on all Linux architectures: - make sure that the UID mapping feature of AX25 networking works properly (it should be safe because it's always used a 32-bit integer to communicate between user and kernel) - - -Chris Wing -wingc@umich.edu - -last updated: January 11, 2000 diff --git a/Documentation/hw_random.txt b/Documentation/hw_random.txt index fce1634907d0..121de96e395e 100644 --- a/Documentation/hw_random.txt +++ b/Documentation/hw_random.txt @@ -1,90 +1,105 @@ -Introduction: - - The hw_random framework is software that makes use of a - special hardware feature on your CPU or motherboard, - a Random Number Generator (RNG). The software has two parts: - a core providing the /dev/hwrng character device and its - sysfs support, plus a hardware-specific driver that plugs - into that core. - - To make the most effective use of these mechanisms, you - should download the support software as well. Download the - latest version of the "rng-tools" package from the - hw_random driver's official Web site: - - http://sourceforge.net/projects/gkernel/ - - Those tools use /dev/hwrng to fill the kernel entropy pool, - which is used internally and exported by the /dev/urandom and - /dev/random special files. - -Theory of operation: - - CHARACTER DEVICE. Using the standard open() - and read() system calls, you can read random data from - the hardware RNG device. This data is NOT CHECKED by any - fitness tests, and could potentially be bogus (if the - hardware is faulty or has been tampered with). Data is only - output if the hardware "has-data" flag is set, but nevertheless - a security-conscious person would run fitness tests on the - data before assuming it is truly random. - - The rng-tools package uses such tests in "rngd", and lets you - run them by hand with a "rngtest" utility. - - /dev/hwrng is char device major 10, minor 183. - - CLASS DEVICE. There is a /sys/class/misc/hw_random node with - two unique attributes, "rng_available" and "rng_current". The - "rng_available" attribute lists the hardware-specific drivers - available, while "rng_current" lists the one which is currently - connected to /dev/hwrng. If your system has more than one - RNG available, you may change the one used by writing a name from - the list in "rng_available" into "rng_current". +========================================================== +Linux support for random number generator in i8xx chipsets +========================================================== + +Introduction +============ + +The hw_random framework is software that makes use of a +special hardware feature on your CPU or motherboard, +a Random Number Generator (RNG). The software has two parts: +a core providing the /dev/hwrng character device and its +sysfs support, plus a hardware-specific driver that plugs +into that core. + +To make the most effective use of these mechanisms, you +should download the support software as well. Download the +latest version of the "rng-tools" package from the +hw_random driver's official Web site: + + http://sourceforge.net/projects/gkernel/ + +Those tools use /dev/hwrng to fill the kernel entropy pool, +which is used internally and exported by the /dev/urandom and +/dev/random special files. + +Theory of operation +=================== + +CHARACTER DEVICE. Using the standard open() +and read() system calls, you can read random data from +the hardware RNG device. This data is NOT CHECKED by any +fitness tests, and could potentially be bogus (if the +hardware is faulty or has been tampered with). Data is only +output if the hardware "has-data" flag is set, but nevertheless +a security-conscious person would run fitness tests on the +data before assuming it is truly random. + +The rng-tools package uses such tests in "rngd", and lets you +run them by hand with a "rngtest" utility. + +/dev/hwrng is char device major 10, minor 183. + +CLASS DEVICE. There is a /sys/class/misc/hw_random node with +two unique attributes, "rng_available" and "rng_current". The +"rng_available" attribute lists the hardware-specific drivers +available, while "rng_current" lists the one which is currently +connected to /dev/hwrng. If your system has more than one +RNG available, you may change the one used by writing a name from +the list in "rng_available" into "rng_current". ========================================================================== - Hardware driver for Intel/AMD/VIA Random Number Generators (RNG) - Copyright 2000,2001 Jeff Garzik <jgarzik@pobox.com> - Copyright 2000,2001 Philipp Rumpf <prumpf@mandrakesoft.com> +Hardware driver for Intel/AMD/VIA Random Number Generators (RNG) + - Copyright 2000,2001 Jeff Garzik <jgarzik@pobox.com> + - Copyright 2000,2001 Philipp Rumpf <prumpf@mandrakesoft.com> -About the Intel RNG hardware, from the firmware hub datasheet: - The Firmware Hub integrates a Random Number Generator (RNG) - using thermal noise generated from inherently random quantum - mechanical properties of silicon. When not generating new random - bits the RNG circuitry will enter a low power state. Intel will - provide a binary software driver to give third party software - access to our RNG for use as a security feature. At this time, - the RNG is only to be used with a system in an OS-present state. +About the Intel RNG hardware, from the firmware hub datasheet +============================================================= -Intel RNG Driver notes: +The Firmware Hub integrates a Random Number Generator (RNG) +using thermal noise generated from inherently random quantum +mechanical properties of silicon. When not generating new random +bits the RNG circuitry will enter a low power state. Intel will +provide a binary software driver to give third party software +access to our RNG for use as a security feature. At this time, +the RNG is only to be used with a system in an OS-present state. - * FIXME: support poll(2) +Intel RNG Driver notes +====================== - NOTE: request_mem_region was removed, for three reasons: - 1) Only one RNG is supported by this driver, 2) The location - used by the RNG is a fixed location in MMIO-addressable memory, +FIXME: support poll(2) + +.. note:: + + request_mem_region was removed, for three reasons: + + 1) Only one RNG is supported by this driver; + 2) The location used by the RNG is a fixed location in + MMIO-addressable memory; 3) users with properly working BIOS e820 handling will always - have the region in which the RNG is located reserved, so - request_mem_region calls always fail for proper setups. - However, for people who use mem=XX, BIOS e820 information is - -not- in /proc/iomem, and request_mem_region(RNG_ADDR) can - succeed. + have the region in which the RNG is located reserved, so + request_mem_region calls always fail for proper setups. + However, for people who use mem=XX, BIOS e820 information is + **not** in /proc/iomem, and request_mem_region(RNG_ADDR) can + succeed. -Driver details: +Driver details +============== - Based on: +Based on: Intel 82802AB/82802AC Firmware Hub (FWH) Datasheet - May 1999 Order Number: 290658-002 R + May 1999 Order Number: 290658-002 R - Intel 82802 Firmware Hub: Random Number Generator +Intel 82802 Firmware Hub: + Random Number Generator Programmer's Reference Manual - December 1999 Order Number: 298029-001 R + December 1999 Order Number: 298029-001 R - Intel 82802 Firmware HUB Random Number Generator Driver +Intel 82802 Firmware HUB Random Number Generator Driver Copyright (c) 2000 Matt Sottek <msottek@quiknet.com> - Special thanks to Matt Sottek. I did the "guts", he - did the "brains" and all the testing. +Special thanks to Matt Sottek. I did the "guts", he +did the "brains" and all the testing. diff --git a/Documentation/hwmon/ads1015 b/Documentation/hwmon/ads1015 index 063b80d857b1..02d2a459385f 100644 --- a/Documentation/hwmon/ads1015 +++ b/Documentation/hwmon/ads1015 @@ -40,7 +40,7 @@ By default all inputs are exported. Platform Data ------------- -In linux/i2c/ads1015.h platform data is defined, channel_data contains +In linux/platform_data/ads1015.h platform data is defined, channel_data contains configuration data for the used input combinations: - pga is the programmable gain amplifier (values are full scale) 0: +/- 6.144 V diff --git a/Documentation/hwmon/adt7475 b/Documentation/hwmon/adt7475 index 0502f2b464e1..09d73a10644c 100644 --- a/Documentation/hwmon/adt7475 +++ b/Documentation/hwmon/adt7475 @@ -109,6 +109,15 @@ fan speed) is applied. PWM values range from 0 (off) to 255 (full speed). Fan speed may be set to maximum when the temperature sensor associated with the PWM control exceeds temp#_max. +At Tmin - hysteresis the PWM output can either be off (0% duty cycle) or at the +minimum (i.e. auto_point1_pwm). This behaviour can be configured using the +pwm[1-*]_stall_disable sysfs attribute. A value of 0 means the fans will shut +off. A value of 1 means the fans will run at auto_point1_pwm. + +The responsiveness of the ADT747x to temperature changes can be configured. +This allows smoothing of the fan speed transition. To set the transition time +set the value in ms in the temp[1-*]_smoothing sysfs attribute. + Notes ----- diff --git a/Documentation/hwmon/ir35221 b/Documentation/hwmon/ir35221 new file mode 100644 index 000000000000..f7e112752c04 --- /dev/null +++ b/Documentation/hwmon/ir35221 @@ -0,0 +1,87 @@ +Kernel driver ir35221 +===================== + +Supported chips: + * Infinion IR35221 + Prefix: 'ir35221' + Addresses scanned: - + Datasheet: Datasheet is not publicly available. + +Author: Samuel Mendoza-Jonas <sam@mendozajonas.com> + + +Description +----------- + +IR35221 is a Digital DC-DC Multiphase Converter + + +Usage Notes +----------- + +This driver does not probe for PMBus devices. You will have to instantiate +devices explicitly. + +Example: the following commands will load the driver for an IR35221 +at address 0x70 on I2C bus #4: + +# modprobe ir35221 +# echo ir35221 0x70 > /sys/bus/i2c/devices/i2c-4/new_device + + +Sysfs attributes +---------------- + +curr1_label "iin" +curr1_input Measured input current +curr1_max Maximum current +curr1_max_alarm Current high alarm + +curr[2-3]_label "iout[1-2]" +curr[2-3]_input Measured output current +curr[2-3]_crit Critical maximum current +curr[2-3]_crit_alarm Current critical high alarm +curr[2-3]_highest Highest output current +curr[2-3]_lowest Lowest output current +curr[2-3]_max Maximum current +curr[2-3]_max_alarm Current high alarm + +in1_label "vin" +in1_input Measured input voltage +in1_crit Critical maximum input voltage +in1_crit_alarm Input voltage critical high alarm +in1_highest Highest input voltage +in1_lowest Lowest input voltage +in1_min Minimum input voltage +in1_min_alarm Input voltage low alarm + +in[2-3]_label "vout[1-2]" +in[2-3]_input Measured output voltage +in[2-3]_lcrit Critical minimum output voltage +in[2-3]_lcrit_alarm Output voltage critical low alarm +in[2-3]_crit Critical maximum output voltage +in[2-3]_crit_alarm Output voltage critical high alarm +in[2-3]_highest Highest output voltage +in[2-3]_lowest Lowest output voltage +in[2-3]_max Maximum output voltage +in[2-3]_max_alarm Output voltage high alarm +in[2-3]_min Minimum output voltage +in[2-3]_min_alarm Output voltage low alarm + +power1_label "pin" +power1_input Measured input power +power1_alarm Input power high alarm +power1_max Input power limit + +power[2-3]_label "pout[1-2]" +power[2-3]_input Measured output power +power[2-3]_max Output power limit +power[2-3]_max_alarm Output power high alarm + +temp[1-2]_input Measured temperature +temp[1-2]_crit Critical high temperature +temp[1-2]_crit_alarm Chip temperature critical high alarm +temp[1-2]_highest Highest temperature +temp[1-2]_lowest Lowest temperature +temp[1-2]_max Maximum temperature +temp[1-2]_max_alarm Chip temperature high alarm diff --git a/Documentation/hwmon/ltc4245 b/Documentation/hwmon/ltc4245 index b478b0864965..4ca7a9da09f9 100644 --- a/Documentation/hwmon/ltc4245 +++ b/Documentation/hwmon/ltc4245 @@ -96,7 +96,7 @@ slowly, -EAGAIN will be returned when you read the sysfs attribute containing the sensor reading. The LTC4245 chip can be configured to sample all GPIO pins with two methods: -1) platform data -- see include/linux/i2c/ltc4245.h +1) platform data -- see include/linux/platform_data/ltc4245.h 2) OF device tree -- add the "ltc4245,use-extra-gpios" property to each chip The default mode of operation is to sample a single GPIO pin. diff --git a/Documentation/hwmon/pmbus-core b/Documentation/hwmon/pmbus-core index 31e4720fed18..8ed10e9ddfb5 100644 --- a/Documentation/hwmon/pmbus-core +++ b/Documentation/hwmon/pmbus-core @@ -253,7 +253,7 @@ Specifically, it provides the following information. PMBus driver platform data ========================== -PMBus platform data is defined in include/linux/i2c/pmbus.h. Platform data +PMBus platform data is defined in include/linux/pmbus.h. Platform data currently only provides a flag field with a single bit used. #define PMBUS_SKIP_STATUS_CHECK (1 << 0) diff --git a/Documentation/hwspinlock.txt b/Documentation/hwspinlock.txt index 61c1ee98e59f..ed640a278185 100644 --- a/Documentation/hwspinlock.txt +++ b/Documentation/hwspinlock.txt @@ -1,6 +1,9 @@ +=========================== Hardware Spinlock Framework +=========================== -1. Introduction +Introduction +============ Hardware spinlock modules provide hardware assistance for synchronization and mutual exclusion between heterogeneous processors and those not operating @@ -32,286 +35,370 @@ structure). A common hwspinlock interface makes it possible to have generic, platform- independent, drivers. -2. User API +User API +======== + +:: struct hwspinlock *hwspin_lock_request(void); - - dynamically assign an hwspinlock and return its address, or NULL - in case an unused hwspinlock isn't available. Users of this - API will usually want to communicate the lock's id to the remote core - before it can be used to achieve synchronization. - Should be called from a process context (might sleep). + +Dynamically assign an hwspinlock and return its address, or NULL +in case an unused hwspinlock isn't available. Users of this +API will usually want to communicate the lock's id to the remote core +before it can be used to achieve synchronization. + +Should be called from a process context (might sleep). + +:: struct hwspinlock *hwspin_lock_request_specific(unsigned int id); - - assign a specific hwspinlock id and return its address, or NULL - if that hwspinlock is already in use. Usually board code will - be calling this function in order to reserve specific hwspinlock - ids for predefined purposes. - Should be called from a process context (might sleep). + +Assign a specific hwspinlock id and return its address, or NULL +if that hwspinlock is already in use. Usually board code will +be calling this function in order to reserve specific hwspinlock +ids for predefined purposes. + +Should be called from a process context (might sleep). + +:: int of_hwspin_lock_get_id(struct device_node *np, int index); - - retrieve the global lock id for an OF phandle-based specific lock. - This function provides a means for DT users of a hwspinlock module - to get the global lock id of a specific hwspinlock, so that it can - be requested using the normal hwspin_lock_request_specific() API. - The function returns a lock id number on success, -EPROBE_DEFER if - the hwspinlock device is not yet registered with the core, or other - error values. - Should be called from a process context (might sleep). + +Retrieve the global lock id for an OF phandle-based specific lock. +This function provides a means for DT users of a hwspinlock module +to get the global lock id of a specific hwspinlock, so that it can +be requested using the normal hwspin_lock_request_specific() API. + +The function returns a lock id number on success, -EPROBE_DEFER if +the hwspinlock device is not yet registered with the core, or other +error values. + +Should be called from a process context (might sleep). + +:: int hwspin_lock_free(struct hwspinlock *hwlock); - - free a previously-assigned hwspinlock; returns 0 on success, or an - appropriate error code on failure (e.g. -EINVAL if the hwspinlock - is already free). - Should be called from a process context (might sleep). + +Free a previously-assigned hwspinlock; returns 0 on success, or an +appropriate error code on failure (e.g. -EINVAL if the hwspinlock +is already free). + +Should be called from a process context (might sleep). + +:: int hwspin_lock_timeout(struct hwspinlock *hwlock, unsigned int timeout); - - lock a previously-assigned hwspinlock with a timeout limit (specified in - msecs). If the hwspinlock is already taken, the function will busy loop - waiting for it to be released, but give up when the timeout elapses. - Upon a successful return from this function, preemption is disabled so - the caller must not sleep, and is advised to release the hwspinlock as - soon as possible, in order to minimize remote cores polling on the - hardware interconnect. - Returns 0 when successful and an appropriate error code otherwise (most - notably -ETIMEDOUT if the hwspinlock is still busy after timeout msecs). - The function will never sleep. + +Lock a previously-assigned hwspinlock with a timeout limit (specified in +msecs). If the hwspinlock is already taken, the function will busy loop +waiting for it to be released, but give up when the timeout elapses. +Upon a successful return from this function, preemption is disabled so +the caller must not sleep, and is advised to release the hwspinlock as +soon as possible, in order to minimize remote cores polling on the +hardware interconnect. + +Returns 0 when successful and an appropriate error code otherwise (most +notably -ETIMEDOUT if the hwspinlock is still busy after timeout msecs). +The function will never sleep. + +:: int hwspin_lock_timeout_irq(struct hwspinlock *hwlock, unsigned int timeout); - - lock a previously-assigned hwspinlock with a timeout limit (specified in - msecs). If the hwspinlock is already taken, the function will busy loop - waiting for it to be released, but give up when the timeout elapses. - Upon a successful return from this function, preemption and the local - interrupts are disabled, so the caller must not sleep, and is advised to - release the hwspinlock as soon as possible. - Returns 0 when successful and an appropriate error code otherwise (most - notably -ETIMEDOUT if the hwspinlock is still busy after timeout msecs). - The function will never sleep. + +Lock a previously-assigned hwspinlock with a timeout limit (specified in +msecs). If the hwspinlock is already taken, the function will busy loop +waiting for it to be released, but give up when the timeout elapses. +Upon a successful return from this function, preemption and the local +interrupts are disabled, so the caller must not sleep, and is advised to +release the hwspinlock as soon as possible. + +Returns 0 when successful and an appropriate error code otherwise (most +notably -ETIMEDOUT if the hwspinlock is still busy after timeout msecs). +The function will never sleep. + +:: int hwspin_lock_timeout_irqsave(struct hwspinlock *hwlock, unsigned int to, - unsigned long *flags); - - lock a previously-assigned hwspinlock with a timeout limit (specified in - msecs). If the hwspinlock is already taken, the function will busy loop - waiting for it to be released, but give up when the timeout elapses. - Upon a successful return from this function, preemption is disabled, - local interrupts are disabled and their previous state is saved at the - given flags placeholder. The caller must not sleep, and is advised to - release the hwspinlock as soon as possible. - Returns 0 when successful and an appropriate error code otherwise (most - notably -ETIMEDOUT if the hwspinlock is still busy after timeout msecs). - The function will never sleep. + unsigned long *flags); + +Lock a previously-assigned hwspinlock with a timeout limit (specified in +msecs). If the hwspinlock is already taken, the function will busy loop +waiting for it to be released, but give up when the timeout elapses. +Upon a successful return from this function, preemption is disabled, +local interrupts are disabled and their previous state is saved at the +given flags placeholder. The caller must not sleep, and is advised to +release the hwspinlock as soon as possible. + +Returns 0 when successful and an appropriate error code otherwise (most +notably -ETIMEDOUT if the hwspinlock is still busy after timeout msecs). + +The function will never sleep. + +:: int hwspin_trylock(struct hwspinlock *hwlock); - - attempt to lock a previously-assigned hwspinlock, but immediately fail if - it is already taken. - Upon a successful return from this function, preemption is disabled so - caller must not sleep, and is advised to release the hwspinlock as soon as - possible, in order to minimize remote cores polling on the hardware - interconnect. - Returns 0 on success and an appropriate error code otherwise (most - notably -EBUSY if the hwspinlock was already taken). - The function will never sleep. + + +Attempt to lock a previously-assigned hwspinlock, but immediately fail if +it is already taken. + +Upon a successful return from this function, preemption is disabled so +caller must not sleep, and is advised to release the hwspinlock as soon as +possible, in order to minimize remote cores polling on the hardware +interconnect. + +Returns 0 on success and an appropriate error code otherwise (most +notably -EBUSY if the hwspinlock was already taken). +The function will never sleep. + +:: int hwspin_trylock_irq(struct hwspinlock *hwlock); - - attempt to lock a previously-assigned hwspinlock, but immediately fail if - it is already taken. - Upon a successful return from this function, preemption and the local - interrupts are disabled so caller must not sleep, and is advised to - release the hwspinlock as soon as possible. - Returns 0 on success and an appropriate error code otherwise (most - notably -EBUSY if the hwspinlock was already taken). - The function will never sleep. + + +Attempt to lock a previously-assigned hwspinlock, but immediately fail if +it is already taken. + +Upon a successful return from this function, preemption and the local +interrupts are disabled so caller must not sleep, and is advised to +release the hwspinlock as soon as possible. + +Returns 0 on success and an appropriate error code otherwise (most +notably -EBUSY if the hwspinlock was already taken). + +The function will never sleep. + +:: int hwspin_trylock_irqsave(struct hwspinlock *hwlock, unsigned long *flags); - - attempt to lock a previously-assigned hwspinlock, but immediately fail if - it is already taken. - Upon a successful return from this function, preemption is disabled, - the local interrupts are disabled and their previous state is saved - at the given flags placeholder. The caller must not sleep, and is advised - to release the hwspinlock as soon as possible. - Returns 0 on success and an appropriate error code otherwise (most - notably -EBUSY if the hwspinlock was already taken). - The function will never sleep. + +Attempt to lock a previously-assigned hwspinlock, but immediately fail if +it is already taken. + +Upon a successful return from this function, preemption is disabled, +the local interrupts are disabled and their previous state is saved +at the given flags placeholder. The caller must not sleep, and is advised +to release the hwspinlock as soon as possible. + +Returns 0 on success and an appropriate error code otherwise (most +notably -EBUSY if the hwspinlock was already taken). +The function will never sleep. + +:: void hwspin_unlock(struct hwspinlock *hwlock); - - unlock a previously-locked hwspinlock. Always succeed, and can be called - from any context (the function never sleeps). Note: code should _never_ - unlock an hwspinlock which is already unlocked (there is no protection - against this). + +Unlock a previously-locked hwspinlock. Always succeed, and can be called +from any context (the function never sleeps). + +.. note:: + + code should **never** unlock an hwspinlock which is already unlocked + (there is no protection against this). + +:: void hwspin_unlock_irq(struct hwspinlock *hwlock); - - unlock a previously-locked hwspinlock and enable local interrupts. - The caller should _never_ unlock an hwspinlock which is already unlocked. - Doing so is considered a bug (there is no protection against this). - Upon a successful return from this function, preemption and local - interrupts are enabled. This function will never sleep. + +Unlock a previously-locked hwspinlock and enable local interrupts. +The caller should **never** unlock an hwspinlock which is already unlocked. + +Doing so is considered a bug (there is no protection against this). +Upon a successful return from this function, preemption and local +interrupts are enabled. This function will never sleep. + +:: void hwspin_unlock_irqrestore(struct hwspinlock *hwlock, unsigned long *flags); - - unlock a previously-locked hwspinlock. - The caller should _never_ unlock an hwspinlock which is already unlocked. - Doing so is considered a bug (there is no protection against this). - Upon a successful return from this function, preemption is reenabled, - and the state of the local interrupts is restored to the state saved at - the given flags. This function will never sleep. + +Unlock a previously-locked hwspinlock. + +The caller should **never** unlock an hwspinlock which is already unlocked. +Doing so is considered a bug (there is no protection against this). +Upon a successful return from this function, preemption is reenabled, +and the state of the local interrupts is restored to the state saved at +the given flags. This function will never sleep. + +:: int hwspin_lock_get_id(struct hwspinlock *hwlock); - - retrieve id number of a given hwspinlock. This is needed when an - hwspinlock is dynamically assigned: before it can be used to achieve - mutual exclusion with a remote cpu, the id number should be communicated - to the remote task with which we want to synchronize. - Returns the hwspinlock id number, or -EINVAL if hwlock is null. - -3. Typical usage - -#include <linux/hwspinlock.h> -#include <linux/err.h> - -int hwspinlock_example1(void) -{ - struct hwspinlock *hwlock; - int ret; - - /* dynamically assign a hwspinlock */ - hwlock = hwspin_lock_request(); - if (!hwlock) - ... - - id = hwspin_lock_get_id(hwlock); - /* probably need to communicate id to a remote processor now */ - - /* take the lock, spin for 1 sec if it's already taken */ - ret = hwspin_lock_timeout(hwlock, 1000); - if (ret) - ... - - /* - * we took the lock, do our thing now, but do NOT sleep - */ - - /* release the lock */ - hwspin_unlock(hwlock); - - /* free the lock */ - ret = hwspin_lock_free(hwlock); - if (ret) - ... - - return ret; -} - -int hwspinlock_example2(void) -{ - struct hwspinlock *hwlock; - int ret; - - /* - * assign a specific hwspinlock id - this should be called early - * by board init code. - */ - hwlock = hwspin_lock_request_specific(PREDEFINED_LOCK_ID); - if (!hwlock) - ... - - /* try to take it, but don't spin on it */ - ret = hwspin_trylock(hwlock); - if (!ret) { - pr_info("lock is already taken\n"); - return -EBUSY; - } - /* - * we took the lock, do our thing now, but do NOT sleep - */ +Retrieve id number of a given hwspinlock. This is needed when an +hwspinlock is dynamically assigned: before it can be used to achieve +mutual exclusion with a remote cpu, the id number should be communicated +to the remote task with which we want to synchronize. + +Returns the hwspinlock id number, or -EINVAL if hwlock is null. + +Typical usage +============= - /* release the lock */ - hwspin_unlock(hwlock); +:: - /* free the lock */ - ret = hwspin_lock_free(hwlock); - if (ret) - ... + #include <linux/hwspinlock.h> + #include <linux/err.h> - return ret; -} + int hwspinlock_example1(void) + { + struct hwspinlock *hwlock; + int ret; + /* dynamically assign a hwspinlock */ + hwlock = hwspin_lock_request(); + if (!hwlock) + ... -4. API for implementors + id = hwspin_lock_get_id(hwlock); + /* probably need to communicate id to a remote processor now */ + + /* take the lock, spin for 1 sec if it's already taken */ + ret = hwspin_lock_timeout(hwlock, 1000); + if (ret) + ... + + /* + * we took the lock, do our thing now, but do NOT sleep + */ + + /* release the lock */ + hwspin_unlock(hwlock); + + /* free the lock */ + ret = hwspin_lock_free(hwlock); + if (ret) + ... + + return ret; + } + + int hwspinlock_example2(void) + { + struct hwspinlock *hwlock; + int ret; + + /* + * assign a specific hwspinlock id - this should be called early + * by board init code. + */ + hwlock = hwspin_lock_request_specific(PREDEFINED_LOCK_ID); + if (!hwlock) + ... + + /* try to take it, but don't spin on it */ + ret = hwspin_trylock(hwlock); + if (!ret) { + pr_info("lock is already taken\n"); + return -EBUSY; + } + + /* + * we took the lock, do our thing now, but do NOT sleep + */ + + /* release the lock */ + hwspin_unlock(hwlock); + + /* free the lock */ + ret = hwspin_lock_free(hwlock); + if (ret) + ... + + return ret; + } + + +API for implementors +==================== + +:: int hwspin_lock_register(struct hwspinlock_device *bank, struct device *dev, const struct hwspinlock_ops *ops, int base_id, int num_locks); - - to be called from the underlying platform-specific implementation, in - order to register a new hwspinlock device (which is usually a bank of - numerous locks). Should be called from a process context (this function - might sleep). - Returns 0 on success, or appropriate error code on failure. + +To be called from the underlying platform-specific implementation, in +order to register a new hwspinlock device (which is usually a bank of +numerous locks). Should be called from a process context (this function +might sleep). + +Returns 0 on success, or appropriate error code on failure. + +:: int hwspin_lock_unregister(struct hwspinlock_device *bank); - - to be called from the underlying vendor-specific implementation, in order - to unregister an hwspinlock device (which is usually a bank of numerous - locks). - Should be called from a process context (this function might sleep). - Returns the address of hwspinlock on success, or NULL on error (e.g. - if the hwspinlock is still in use). -5. Important structs +To be called from the underlying vendor-specific implementation, in order +to unregister an hwspinlock device (which is usually a bank of numerous +locks). + +Should be called from a process context (this function might sleep). + +Returns the address of hwspinlock on success, or NULL on error (e.g. +if the hwspinlock is still in use). + +Important structs +================= struct hwspinlock_device is a device which usually contains a bank of hardware locks. It is registered by the underlying hwspinlock implementation using the hwspin_lock_register() API. -/** - * struct hwspinlock_device - a device which usually spans numerous hwspinlocks - * @dev: underlying device, will be used to invoke runtime PM api - * @ops: platform-specific hwspinlock handlers - * @base_id: id index of the first lock in this device - * @num_locks: number of locks in this device - * @lock: dynamically allocated array of 'struct hwspinlock' - */ -struct hwspinlock_device { - struct device *dev; - const struct hwspinlock_ops *ops; - int base_id; - int num_locks; - struct hwspinlock lock[0]; -}; +:: + + /** + * struct hwspinlock_device - a device which usually spans numerous hwspinlocks + * @dev: underlying device, will be used to invoke runtime PM api + * @ops: platform-specific hwspinlock handlers + * @base_id: id index of the first lock in this device + * @num_locks: number of locks in this device + * @lock: dynamically allocated array of 'struct hwspinlock' + */ + struct hwspinlock_device { + struct device *dev; + const struct hwspinlock_ops *ops; + int base_id; + int num_locks; + struct hwspinlock lock[0]; + }; struct hwspinlock_device contains an array of hwspinlock structs, each -of which represents a single hardware lock: - -/** - * struct hwspinlock - this struct represents a single hwspinlock instance - * @bank: the hwspinlock_device structure which owns this lock - * @lock: initialized and used by hwspinlock core - * @priv: private data, owned by the underlying platform-specific hwspinlock drv - */ -struct hwspinlock { - struct hwspinlock_device *bank; - spinlock_t lock; - void *priv; -}; +of which represents a single hardware lock:: + + /** + * struct hwspinlock - this struct represents a single hwspinlock instance + * @bank: the hwspinlock_device structure which owns this lock + * @lock: initialized and used by hwspinlock core + * @priv: private data, owned by the underlying platform-specific hwspinlock drv + */ + struct hwspinlock { + struct hwspinlock_device *bank; + spinlock_t lock; + void *priv; + }; When registering a bank of locks, the hwspinlock driver only needs to set the priv members of the locks. The rest of the members are set and initialized by the hwspinlock core itself. -6. Implementation callbacks +Implementation callbacks +======================== -There are three possible callbacks defined in 'struct hwspinlock_ops': +There are three possible callbacks defined in 'struct hwspinlock_ops':: -struct hwspinlock_ops { - int (*trylock)(struct hwspinlock *lock); - void (*unlock)(struct hwspinlock *lock); - void (*relax)(struct hwspinlock *lock); -}; + struct hwspinlock_ops { + int (*trylock)(struct hwspinlock *lock); + void (*unlock)(struct hwspinlock *lock); + void (*relax)(struct hwspinlock *lock); + }; The first two callbacks are mandatory: The ->trylock() callback should make a single attempt to take the lock, and -return 0 on failure and 1 on success. This callback may _not_ sleep. +return 0 on failure and 1 on success. This callback may **not** sleep. The ->unlock() callback releases the lock. It always succeed, and it, too, -may _not_ sleep. +may **not** sleep. The ->relax() callback is optional. It is called by hwspinlock core while spinning on a lock, and can be used by the underlying implementation to force -a delay between two successive invocations of ->trylock(). It may _not_ sleep. +a delay between two successive invocations of ->trylock(). It may **not** sleep. diff --git a/Documentation/i2c/busses/i2c-i801 b/Documentation/i2c/busses/i2c-i801 index 820d9040de16..0500193434cb 100644 --- a/Documentation/i2c/busses/i2c-i801 +++ b/Documentation/i2c/busses/i2c-i801 @@ -34,6 +34,8 @@ Supported adapters: * Intel Broxton (SOC) * Intel Lewisburg (PCH) * Intel Gemini Lake (SOC) + * Intel Cannon Lake-H (PCH) + * Intel Cannon Lake-LP (PCH) Datasheets: Publicly available at the Intel website On Intel Patsburg and later chipsets, both the normal host SMBus controller diff --git a/Documentation/i2c/dev-interface b/Documentation/i2c/dev-interface index bcf919d8625c..5ff19447ac44 100644 --- a/Documentation/i2c/dev-interface +++ b/Documentation/i2c/dev-interface @@ -191,7 +191,7 @@ checking on future transactions.) 4* Other ioctl() calls are converted to in-kernel function calls by i2c-dev. Examples include I2C_FUNCS, which queries the I2C adapter functionality using i2c.h:i2c_get_functionality(), and I2C_SMBUS, which -performs an SMBus transaction using i2c-core.c:i2c_smbus_xfer(). +performs an SMBus transaction using i2c-core-smbus.c:i2c_smbus_xfer(). The i2c-dev driver is responsible for checking all the parameters that come from user-space for validity. After this point, there is no @@ -200,13 +200,13 @@ and calls that would have been performed by kernel I2C chip drivers directly. This means that I2C bus drivers don't need to implement anything special to support access from user-space. -5* These i2c-core.c/i2c.h functions are wrappers to the actual -implementation of your I2C bus driver. Each adapter must declare -callback functions implementing these standard calls. -i2c.h:i2c_get_functionality() calls i2c_adapter.algo->functionality(), -while i2c-core.c:i2c_smbus_xfer() calls either +5* These i2c.h functions are wrappers to the actual implementation of +your I2C bus driver. Each adapter must declare callback functions +implementing these standard calls. i2c.h:i2c_get_functionality() calls +i2c_adapter.algo->functionality(), while +i2c-core-smbus.c:i2c_smbus_xfer() calls either adapter.algo->smbus_xfer() if it is implemented, or if not, -i2c-core.c:i2c_smbus_xfer_emulated() which in turn calls +i2c-core-smbus.c:i2c_smbus_xfer_emulated() which in turn calls i2c_adapter.algo->master_xfer(). After your I2C bus driver has processed these requests, execution runs diff --git a/Documentation/index.rst b/Documentation/index.rst index bc67dbf76eb0..cb7f1ba5b3b1 100644 --- a/Documentation/index.rst +++ b/Documentation/index.rst @@ -3,8 +3,8 @@ You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. -Welcome to The Linux Kernel's documentation -=========================================== +The Linux Kernel documentation +============================== This is the top level of the kernel's documentation tree. Kernel documentation, like the kernel itself, is very much a work in progress; @@ -51,6 +51,7 @@ merged much easier. process/index dev-tools/index doc-guide/index + kernel-hacking/index Kernel API documentation ------------------------ @@ -67,11 +68,24 @@ needed). driver-api/index core-api/index media/index + networking/index input/index gpu/index security/index sound/index crypto/index + filesystems/index + +Architecture-specific documentation +----------------------------------- + +These books provide programming details about architecture-specific +implementation. + +.. toctree:: + :maxdepth: 2 + + sh/index Korean translations ------------------- diff --git a/Documentation/input/index.rst b/Documentation/input/index.rst index 7a3e71c2bd00..9888f5cbf6d5 100644 --- a/Documentation/input/index.rst +++ b/Documentation/input/index.rst @@ -6,7 +6,6 @@ Contents: .. toctree:: :maxdepth: 2 - :numbered: input_uapi input_kapi diff --git a/Documentation/intel_txt.txt b/Documentation/intel_txt.txt index 91d89c540709..d83c1a2122c9 100644 --- a/Documentation/intel_txt.txt +++ b/Documentation/intel_txt.txt @@ -1,4 +1,5 @@ -Intel(R) TXT Overview: +===================== +Intel(R) TXT Overview ===================== Intel's technology for safer computing, Intel(R) Trusted Execution @@ -8,9 +9,10 @@ provide the building blocks for creating trusted platforms. Intel TXT was formerly known by the code name LaGrande Technology (LT). Intel TXT in Brief: -o Provides dynamic root of trust for measurement (DRTM) -o Data protection in case of improper shutdown -o Measurement and verification of launched environment + +- Provides dynamic root of trust for measurement (DRTM) +- Data protection in case of improper shutdown +- Measurement and verification of launched environment Intel TXT is part of the vPro(TM) brand and is also available some non-vPro systems. It is currently available on desktop systems @@ -24,16 +26,21 @@ which has been updated for the new released platforms. Intel TXT has been presented at various events over the past few years, some of which are: - LinuxTAG 2008: + + - LinuxTAG 2008: http://www.linuxtag.org/2008/en/conf/events/vp-donnerstag.html - TRUST2008: + + - TRUST2008: http://www.trust-conference.eu/downloads/Keynote-Speakers/ 3_David-Grawrock_The-Front-Door-of-Trusted-Computing.pdf - IDF, Shanghai: + + - IDF, Shanghai: http://www.prcidf.com.cn/index_en.html - IDFs 2006, 2007 (I'm not sure if/where they are online) -Trusted Boot Project Overview: + - IDFs 2006, 2007 + (I'm not sure if/where they are online) + +Trusted Boot Project Overview ============================= Trusted Boot (tboot) is an open source, pre-kernel/VMM module that @@ -87,11 +94,12 @@ Intel-provided firmware). How Does it Work? ================= -o Tboot is an executable that is launched by the bootloader as +- Tboot is an executable that is launched by the bootloader as the "kernel" (the binary the bootloader executes). -o It performs all of the work necessary to determine if the +- It performs all of the work necessary to determine if the platform supports Intel TXT and, if so, executes the GETSEC[SENTER] processor instruction that initiates the dynamic root of trust. + - If tboot determines that the system does not support Intel TXT or is not configured correctly (e.g. the SINIT AC Module was incorrect), it will directly launch the kernel with no changes @@ -99,12 +107,14 @@ o It performs all of the work necessary to determine if the - Tboot will output various information about its progress to the terminal, serial port, and/or an in-memory log; the output locations can be configured with a command line switch. -o The GETSEC[SENTER] instruction will return control to tboot and + +- The GETSEC[SENTER] instruction will return control to tboot and tboot then verifies certain aspects of the environment (e.g. TPM NV lock, e820 table does not have invalid entries, etc.). -o It will wake the APs from the special sleep state the GETSEC[SENTER] +- It will wake the APs from the special sleep state the GETSEC[SENTER] instruction had put them in and place them into a wait-for-SIPI state. + - Because the processors will not respond to an INIT or SIPI when in the TXT environment, it is necessary to create a small VT-x guest for the APs. When they run in this guest, they will @@ -112,8 +122,10 @@ o It will wake the APs from the special sleep state the GETSEC[SENTER] VMEXITs, and then disable VT and jump to the SIPI vector. This approach seemed like a better choice than having to insert special code into the kernel's MP wakeup sequence. -o Tboot then applies an (optional) user-defined launch policy to + +- Tboot then applies an (optional) user-defined launch policy to verify the kernel and initrd. + - This policy is rooted in TPM NV and is described in the tboot project. The tboot project also contains code for tools to create and provision the policy. @@ -121,30 +133,34 @@ o Tboot then applies an (optional) user-defined launch policy to then any kernel will be launched. - Policy action is flexible and can include halting on failures or simply logging them and continuing. -o Tboot adjusts the e820 table provided by the bootloader to reserve + +- Tboot adjusts the e820 table provided by the bootloader to reserve its own location in memory as well as to reserve certain other TXT-related regions. -o As part of its launch, tboot DMA protects all of RAM (using the +- As part of its launch, tboot DMA protects all of RAM (using the VT-d PMRs). Thus, the kernel must be booted with 'intel_iommu=on' in order to remove this blanket protection and use VT-d's page-level protection. -o Tboot will populate a shared page with some data about itself and +- Tboot will populate a shared page with some data about itself and pass this to the Linux kernel as it transfers control. + - The location of the shared page is passed via the boot_params struct as a physical address. -o The kernel will look for the tboot shared page address and, if it + +- The kernel will look for the tboot shared page address and, if it exists, map it. -o As one of the checks/protections provided by TXT, it makes a copy +- As one of the checks/protections provided by TXT, it makes a copy of the VT-d DMARs in a DMA-protected region of memory and verifies them for correctness. The VT-d code will detect if the kernel was launched with tboot and use this copy instead of the one in the ACPI table. -o At this point, tboot and TXT are out of the picture until a +- At this point, tboot and TXT are out of the picture until a shutdown (S<n>) -o In order to put a system into any of the sleep states after a TXT +- In order to put a system into any of the sleep states after a TXT launch, TXT must first be exited. This is to prevent attacks that attempt to crash the system to gain control on reboot and steal data left in memory. + - The kernel will perform all of its sleep preparation and populate the shared page with the ACPI data needed to put the platform in the desired sleep state. @@ -172,7 +188,7 @@ o In order to put a system into any of the sleep states after a TXT That's pretty much it for TXT support. -Configuring the System: +Configuring the System ====================== This code works with 32bit, 32bit PAE, and 64bit (x86_64) kernels. @@ -181,7 +197,8 @@ In BIOS, the user must enable: TPM, TXT, VT-x, VT-d. Not all BIOSes allow these to be individually enabled/disabled and the screens in which to find them are BIOS-specific. -grub.conf needs to be modified as follows: +grub.conf needs to be modified as follows:: + title Linux 2.6.29-tip w/ tboot root (hd0,0) kernel /tboot.gz logging=serial,vga,memory diff --git a/Documentation/io-mapping.txt b/Documentation/io-mapping.txt index 5ca78426f54c..a966239f04e4 100644 --- a/Documentation/io-mapping.txt +++ b/Documentation/io-mapping.txt @@ -1,66 +1,81 @@ +======================== +The io_mapping functions +======================== + +API +=== + The io_mapping functions in linux/io-mapping.h provide an abstraction for efficiently mapping small regions of an I/O device to the CPU. The initial usage is to support the large graphics aperture on 32-bit processors where ioremap_wc cannot be used to statically map the entire aperture to the CPU as it would consume too much of the kernel address space. -A mapping object is created during driver initialization using +A mapping object is created during driver initialization using:: struct io_mapping *io_mapping_create_wc(unsigned long base, unsigned long size) - 'base' is the bus address of the region to be made - mappable, while 'size' indicates how large a mapping region to - enable. Both are in bytes. +'base' is the bus address of the region to be made +mappable, while 'size' indicates how large a mapping region to +enable. Both are in bytes. - This _wc variant provides a mapping which may only be used - with the io_mapping_map_atomic_wc or io_mapping_map_wc. +This _wc variant provides a mapping which may only be used +with the io_mapping_map_atomic_wc or io_mapping_map_wc. With this mapping object, individual pages can be mapped either atomically or not, depending on the necessary scheduling environment. Of course, atomic -maps are more efficient: +maps are more efficient:: void *io_mapping_map_atomic_wc(struct io_mapping *mapping, unsigned long offset) - 'offset' is the offset within the defined mapping region. - Accessing addresses beyond the region specified in the - creation function yields undefined results. Using an offset - which is not page aligned yields an undefined result. The - return value points to a single page in CPU address space. +'offset' is the offset within the defined mapping region. +Accessing addresses beyond the region specified in the +creation function yields undefined results. Using an offset +which is not page aligned yields an undefined result. The +return value points to a single page in CPU address space. + +This _wc variant returns a write-combining map to the +page and may only be used with mappings created by +io_mapping_create_wc - This _wc variant returns a write-combining map to the - page and may only be used with mappings created by - io_mapping_create_wc +Note that the task may not sleep while holding this page +mapped. - Note that the task may not sleep while holding this page - mapped. +:: void io_mapping_unmap_atomic(void *vaddr) - 'vaddr' must be the value returned by the last - io_mapping_map_atomic_wc call. This unmaps the specified - page and allows the task to sleep once again. +'vaddr' must be the value returned by the last +io_mapping_map_atomic_wc call. This unmaps the specified +page and allows the task to sleep once again. If you need to sleep while holding the lock, you can use the non-atomic variant, although they may be significantly slower. +:: + void *io_mapping_map_wc(struct io_mapping *mapping, unsigned long offset) - This works like io_mapping_map_atomic_wc except it allows - the task to sleep while holding the page mapped. +This works like io_mapping_map_atomic_wc except it allows +the task to sleep while holding the page mapped. + + +:: void io_mapping_unmap(void *vaddr) - This works like io_mapping_unmap_atomic, except it is used - for pages mapped with io_mapping_map_wc. +This works like io_mapping_unmap_atomic, except it is used +for pages mapped with io_mapping_map_wc. -At driver close time, the io_mapping object must be freed: +At driver close time, the io_mapping object must be freed:: void io_mapping_free(struct io_mapping *mapping) -Current Implementation: +Current Implementation +====================== The initial implementation of these functions uses existing mapping mechanisms and so provides only an abstraction layer and no new diff --git a/Documentation/io_ordering.txt b/Documentation/io_ordering.txt index 9faae6f26d32..2ab303ce9a0d 100644 --- a/Documentation/io_ordering.txt +++ b/Documentation/io_ordering.txt @@ -1,3 +1,7 @@ +============================================== +Ordering I/O writes to memory-mapped addresses +============================================== + On some platforms, so-called memory-mapped I/O is weakly ordered. On such platforms, driver writers are responsible for ensuring that I/O writes to memory-mapped addresses on their device arrive in the order intended. This is @@ -8,39 +12,39 @@ critical section of code protected by spinlocks. This would ensure that subsequent writes to I/O space arrived only after all prior writes (much like a memory barrier op, mb(), only with respect to I/O). -A more concrete example from a hypothetical device driver: +A more concrete example from a hypothetical device driver:: - ... -CPU A: spin_lock_irqsave(&dev_lock, flags) -CPU A: val = readl(my_status); -CPU A: ... -CPU A: writel(newval, ring_ptr); -CPU A: spin_unlock_irqrestore(&dev_lock, flags) - ... -CPU B: spin_lock_irqsave(&dev_lock, flags) -CPU B: val = readl(my_status); -CPU B: ... -CPU B: writel(newval2, ring_ptr); -CPU B: spin_unlock_irqrestore(&dev_lock, flags) - ... + ... + CPU A: spin_lock_irqsave(&dev_lock, flags) + CPU A: val = readl(my_status); + CPU A: ... + CPU A: writel(newval, ring_ptr); + CPU A: spin_unlock_irqrestore(&dev_lock, flags) + ... + CPU B: spin_lock_irqsave(&dev_lock, flags) + CPU B: val = readl(my_status); + CPU B: ... + CPU B: writel(newval2, ring_ptr); + CPU B: spin_unlock_irqrestore(&dev_lock, flags) + ... In the case above, the device may receive newval2 before it receives newval, -which could cause problems. Fixing it is easy enough though: +which could cause problems. Fixing it is easy enough though:: - ... -CPU A: spin_lock_irqsave(&dev_lock, flags) -CPU A: val = readl(my_status); -CPU A: ... -CPU A: writel(newval, ring_ptr); -CPU A: (void)readl(safe_register); /* maybe a config register? */ -CPU A: spin_unlock_irqrestore(&dev_lock, flags) - ... -CPU B: spin_lock_irqsave(&dev_lock, flags) -CPU B: val = readl(my_status); -CPU B: ... -CPU B: writel(newval2, ring_ptr); -CPU B: (void)readl(safe_register); /* maybe a config register? */ -CPU B: spin_unlock_irqrestore(&dev_lock, flags) + ... + CPU A: spin_lock_irqsave(&dev_lock, flags) + CPU A: val = readl(my_status); + CPU A: ... + CPU A: writel(newval, ring_ptr); + CPU A: (void)readl(safe_register); /* maybe a config register? */ + CPU A: spin_unlock_irqrestore(&dev_lock, flags) + ... + CPU B: spin_lock_irqsave(&dev_lock, flags) + CPU B: val = readl(my_status); + CPU B: ... + CPU B: writel(newval2, ring_ptr); + CPU B: (void)readl(safe_register); /* maybe a config register? */ + CPU B: spin_unlock_irqrestore(&dev_lock, flags) Here, the reads from safe_register will cause the I/O chipset to flush any pending writes before actually posting the read to the chipset, preventing diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt index 1e9fcb4d0ec8..3e3fdae5f3ed 100644 --- a/Documentation/ioctl/ioctl-number.txt +++ b/Documentation/ioctl/ioctl-number.txt @@ -326,7 +326,7 @@ Code Seq#(hex) Include File Comments 0xB5 00-0F uapi/linux/rpmsg.h <mailto:linux-remoteproc@vger.kernel.org> 0xC0 00-0F linux/usb/iowarrior.h 0xCA 00-0F uapi/misc/cxl.h -0xCA 80-8F uapi/scsi/cxlflash_ioctl.h +0xCA 80-BF uapi/scsi/cxlflash_ioctl.h 0xCB 00-1F CBM serial IEC bus in development: <mailto:michael.klein@puffin.lb.shuttle.de> 0xCD 01 linux/reiserfs_fs.h diff --git a/Documentation/iostats.txt b/Documentation/iostats.txt index 65f694f2d1c9..04d394a2e06c 100644 --- a/Documentation/iostats.txt +++ b/Documentation/iostats.txt @@ -1,49 +1,50 @@ +===================== I/O statistics fields ---------------- +===================== Since 2.4.20 (and some versions before, with patches), and 2.5.45, more extensive disk statistics have been introduced to help measure disk -activity. Tools such as sar and iostat typically interpret these and do +activity. Tools such as ``sar`` and ``iostat`` typically interpret these and do the work for you, but in case you are interested in creating your own tools, the fields are explained here. In 2.4 now, the information is found as additional fields in -/proc/partitions. In 2.6, the same information is found in two -places: one is in the file /proc/diskstats, and the other is within +``/proc/partitions``. In 2.6 and upper, the same information is found in two +places: one is in the file ``/proc/diskstats``, and the other is within the sysfs file system, which must be mounted in order to obtain the information. Throughout this document we'll assume that sysfs -is mounted on /sys, although of course it may be mounted anywhere. -Both /proc/diskstats and sysfs use the same source for the information +is mounted on ``/sys``, although of course it may be mounted anywhere. +Both ``/proc/diskstats`` and sysfs use the same source for the information and so should not differ. -Here are examples of these different formats: +Here are examples of these different formats:: -2.4: - 3 0 39082680 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160 - 3 1 9221278 hda1 35486 0 35496 38030 0 0 0 0 0 38030 38030 + 2.4: + 3 0 39082680 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160 + 3 1 9221278 hda1 35486 0 35496 38030 0 0 0 0 0 38030 38030 + 2.6+ sysfs: + 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160 + 35486 38030 38030 38030 -2.6 sysfs: - 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160 - 35486 38030 38030 38030 + 2.6+ diskstats: + 3 0 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160 + 3 1 hda1 35486 38030 38030 38030 -2.6 diskstats: - 3 0 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160 - 3 1 hda1 35486 38030 38030 38030 +On 2.4 you might execute ``grep 'hda ' /proc/partitions``. On 2.6+, you have +a choice of ``cat /sys/block/hda/stat`` or ``grep 'hda ' /proc/diskstats``. -On 2.4 you might execute "grep 'hda ' /proc/partitions". On 2.6, you have -a choice of "cat /sys/block/hda/stat" or "grep 'hda ' /proc/diskstats". The advantage of one over the other is that the sysfs choice works well -if you are watching a known, small set of disks. /proc/diskstats may +if you are watching a known, small set of disks. ``/proc/diskstats`` may be a better choice if you are watching a large number of disks because you'll avoid the overhead of 50, 100, or 500 or more opens/closes with each snapshot of your disk statistics. In 2.4, the statistics fields are those after the device name. In the above example, the first field of statistics would be 446216. -By contrast, in 2.6 if you look at /sys/block/hda/stat, you'll +By contrast, in 2.6+ if you look at ``/sys/block/hda/stat``, you'll find just the eleven fields, beginning with 446216. If you look at -/proc/diskstats, the eleven fields will be preceded by the major and +``/proc/diskstats``, the eleven fields will be preceded by the major and minor device numbers, and device name. Each of these formats provides eleven fields of statistics, each meaning exactly the same things. All fields except field 9 are cumulative since boot. Field 9 should @@ -59,30 +60,40 @@ system-wide stats you'll have to find all the devices and sum them all up. Field 1 -- # of reads completed This is the total number of reads completed successfully. + Field 2 -- # of reads merged, field 6 -- # of writes merged Reads and writes which are adjacent to each other may be merged for efficiency. Thus two 4K reads may become one 8K read before it is ultimately handed to the disk, and so it will be counted (and queued) as only one I/O. This field lets you know how often this was done. + Field 3 -- # of sectors read This is the total number of sectors read successfully. + Field 4 -- # of milliseconds spent reading This is the total number of milliseconds spent by all reads (as measured from __make_request() to end_that_request_last()). + Field 5 -- # of writes completed This is the total number of writes completed successfully. + Field 6 -- # of writes merged See the description of field 2. + Field 7 -- # of sectors written This is the total number of sectors written successfully. + Field 8 -- # of milliseconds spent writing This is the total number of milliseconds spent by all writes (as measured from __make_request() to end_that_request_last()). + Field 9 -- # of I/Os currently in progress The only field that should go to zero. Incremented as requests are given to appropriate struct request_queue and decremented as they finish. + Field 10 -- # of milliseconds spent doing I/Os This field increases so long as field 9 is nonzero. + Field 11 -- weighted # of milliseconds spent doing I/Os This field is incremented at each I/O start, I/O completion, I/O merge, or read of these stats by the number of I/Os in progress @@ -97,7 +108,7 @@ introduced when changes collide, so (for instance) adding up all the read I/Os issued per partition should equal those made to the disks ... but due to the lack of locking it may only be very close. -In 2.6, there are counters for each CPU, which make the lack of locking +In 2.6+, there are counters for each CPU, which make the lack of locking almost a non-issue. When the statistics are read, the per-CPU counters are summed (possibly overflowing the unsigned long variable they are summed to) and the result given to the user. There is no convenient @@ -106,22 +117,25 @@ user interface for accessing the per-CPU counters themselves. Disks vs Partitions ------------------- -There were significant changes between 2.4 and 2.6 in the I/O subsystem. +There were significant changes between 2.4 and 2.6+ in the I/O subsystem. As a result, some statistic information disappeared. The translation from a disk address relative to a partition to the disk address relative to the host disk happens much earlier. All merges and timings now happen at the disk level rather than at both the disk and partition level as -in 2.4. Consequently, you'll see a different statistics output on 2.6 for +in 2.4. Consequently, you'll see a different statistics output on 2.6+ for partitions from that for disks. There are only *four* fields available -for partitions on 2.6 machines. This is reflected in the examples above. +for partitions on 2.6+ machines. This is reflected in the examples above. Field 1 -- # of reads issued This is the total number of reads issued to this partition. + Field 2 -- # of sectors read This is the total number of sectors requested to be read from this partition. + Field 3 -- # of writes issued This is the total number of writes issued to this partition. + Field 4 -- # of sectors written This is the total number of sectors requested to be written to this partition. @@ -149,16 +163,16 @@ to some (probably insignificant) inaccuracy. Additional notes ---------------- -In 2.6, sysfs is not mounted by default. If your distribution of +In 2.6+, sysfs is not mounted by default. If your distribution of Linux hasn't added it already, here's the line you'll want to add to -your /etc/fstab: +your ``/etc/fstab``:: -none /sys sysfs defaults 0 0 + none /sys sysfs defaults 0 0 -In 2.6, all disk statistics were removed from /proc/stat. In 2.4, they -appear in both /proc/partitions and /proc/stat, although the ones in -/proc/stat take a very different format from those in /proc/partitions +In 2.6+, all disk statistics were removed from ``/proc/stat``. In 2.4, they +appear in both ``/proc/partitions`` and ``/proc/stat``, although the ones in +``/proc/stat`` take a very different format from those in ``/proc/partitions`` (see proc(5), if your system has it.) -- ricklind@us.ibm.com diff --git a/Documentation/irqflags-tracing.txt b/Documentation/irqflags-tracing.txt index f6da05670e16..bdd208259fb3 100644 --- a/Documentation/irqflags-tracing.txt +++ b/Documentation/irqflags-tracing.txt @@ -1,8 +1,10 @@ +======================= IRQ-flags state tracing +======================= -started by Ingo Molnar <mingo@redhat.com> +:Author: started by Ingo Molnar <mingo@redhat.com> -the "irq-flags tracing" feature "traces" hardirq and softirq state, in +The "irq-flags tracing" feature "traces" hardirq and softirq state, in that it gives interested subsystems an opportunity to be notified of every hardirqs-off/hardirqs-on, softirqs-off/softirqs-on event that happens in the kernel. @@ -14,7 +16,7 @@ CONFIG_PROVE_RWSEM_LOCKING will be offered on an architecture - these are locking APIs that are not used in IRQ context. (the one exception for rwsems is worked around) -architecture support for this is certainly not in the "trivial" +Architecture support for this is certainly not in the "trivial" category, because lots of lowlevel assembly code deal with irq-flags state changes. But an architecture can be irq-flags-tracing enabled in a rather straightforward and risk-free manner. @@ -41,7 +43,7 @@ irq-flags-tracing support: excluded from the irq-tracing [and lock validation] mechanism via lockdep_off()/lockdep_on(). -in general there is no risk from having an incomplete irq-flags-tracing +In general there is no risk from having an incomplete irq-flags-tracing implementation in an architecture: lockdep will detect that and will turn itself off. I.e. the lock validator will still be reliable. There should be no crashes due to irq-tracing bugs. (except if the assembly diff --git a/Documentation/isa.txt b/Documentation/isa.txt index f232c26a40be..def4a7b690b5 100644 --- a/Documentation/isa.txt +++ b/Documentation/isa.txt @@ -1,5 +1,6 @@ +=========== ISA Drivers ------------ +=========== The following text is adapted from the commit message of the initial commit of the ISA bus driver authored by Rene Herman. @@ -23,17 +24,17 @@ that all device creation has been made internal as well. The usage model this provides is nice, and has been acked from the ALSA side by Takashi Iwai and Jaroslav Kysela. The ALSA driver module_init's -now (for oldisa-only drivers) become: +now (for oldisa-only drivers) become:: -static int __init alsa_card_foo_init(void) -{ - return isa_register_driver(&snd_foo_isa_driver, SNDRV_CARDS); -} + static int __init alsa_card_foo_init(void) + { + return isa_register_driver(&snd_foo_isa_driver, SNDRV_CARDS); + } -static void __exit alsa_card_foo_exit(void) -{ - isa_unregister_driver(&snd_foo_isa_driver); -} + static void __exit alsa_card_foo_exit(void) + { + isa_unregister_driver(&snd_foo_isa_driver); + } Quite like the other bus models therefore. This removes a lot of duplicated init code from the ALSA ISA drivers. @@ -47,11 +48,11 @@ parameter, indicating how many devices to create and call our methods with. The platform_driver callbacks are called with a platform_device param; -the isa_driver callbacks are being called with a "struct device *dev, -unsigned int id" pair directly -- with the device creation completely +the isa_driver callbacks are being called with a ``struct device *dev, +unsigned int id`` pair directly -- with the device creation completely internal to the bus it's much cleaner to not leak isa_dev's by passing them in at all. The id is the only thing we ever want other then the -struct device * anyways, and it makes for nicer code in the callbacks as +struct device anyways, and it makes for nicer code in the callbacks as well. With this additional .match() callback ISA drivers have all options. If @@ -75,20 +76,20 @@ This exports only two functions; isa_{,un}register_driver(). isa_register_driver() register's the struct device_driver, and then loops over the passed in ndev creating devices and registering them. -This causes the bus match method to be called for them, which is: +This causes the bus match method to be called for them, which is:: -int isa_bus_match(struct device *dev, struct device_driver *driver) -{ - struct isa_driver *isa_driver = to_isa_driver(driver); + int isa_bus_match(struct device *dev, struct device_driver *driver) + { + struct isa_driver *isa_driver = to_isa_driver(driver); - if (dev->platform_data == isa_driver) { - if (!isa_driver->match || - isa_driver->match(dev, to_isa_dev(dev)->id)) - return 1; - dev->platform_data = NULL; - } - return 0; -} + if (dev->platform_data == isa_driver) { + if (!isa_driver->match || + isa_driver->match(dev, to_isa_dev(dev)->id)) + return 1; + dev->platform_data = NULL; + } + return 0; + } The first thing this does is check if this device is in fact one of this driver's devices by seeing if the device's platform_data pointer is set @@ -102,7 +103,7 @@ well. Then, if the the driver did not provide a .match, it matches. If it did, the driver match() method is called to determine a match. -If it did _not_ match, dev->platform_data is reset to indicate this to +If it did **not** match, dev->platform_data is reset to indicate this to isa_register_driver which can then unregister the device again. If during all this, there's any error, or no devices matched at all diff --git a/Documentation/isapnp.txt b/Documentation/isapnp.txt index 400d1b5b523d..8d0840ac847b 100644 --- a/Documentation/isapnp.txt +++ b/Documentation/isapnp.txt @@ -1,3 +1,4 @@ +========================================================== ISA Plug & Play support by Jaroslav Kysela <perex@suse.cz> ========================================================== diff --git a/Documentation/kbuild/kbuild.txt b/Documentation/kbuild/kbuild.txt index 0ff6a466a05b..ac2363ea05c5 100644 --- a/Documentation/kbuild/kbuild.txt +++ b/Documentation/kbuild/kbuild.txt @@ -236,5 +236,9 @@ Files specified with KBUILD_VMLINUX_INIT are linked first. KBUILD_VMLINUX_MAIN -------------------------------------------------- All object files for the main part of vmlinux. -KBUILD_VMLINUX_INIT and KBUILD_VMLINUX_MAIN together specify -all the object files used to link vmlinux. + +KBUILD_VMLINUX_LIBS +-------------------------------------------------- +All .a "lib" files for vmlinux. +KBUILD_VMLINUX_INIT, KBUILD_VMLINUX_MAIN, and KBUILD_VMLINUX_LIBS together +specify all the object files used to link vmlinux. diff --git a/Documentation/kbuild/makefiles.txt b/Documentation/kbuild/makefiles.txt index e18daca65ccd..7003141a6d4f 100644 --- a/Documentation/kbuild/makefiles.txt +++ b/Documentation/kbuild/makefiles.txt @@ -45,10 +45,9 @@ This document describes the Linux kernel Makefiles. === 7 Kbuild syntax for exported headers --- 7.1 no-export-headers - --- 7.2 genhdr-y - --- 7.3 generic-y - --- 7.4 generated-y - --- 7.5 mandatory-y + --- 7.2 generic-y + --- 7.3 generated-y + --- 7.4 mandatory-y === 8 Kbuild Variables === 9 Makefile language @@ -487,22 +486,6 @@ more details, with real examples. respectively. Note: cc-option-yn uses KBUILD_CFLAGS for $(CC) options - cc-option-align - gcc versions >= 3.0 changed the type of options used to specify - alignment of functions, loops etc. $(cc-option-align), when used - as prefix to the align options, will select the right prefix: - gcc < 3.00 - cc-option-align = -malign - gcc >= 3.00 - cc-option-align = -falign - - Example: - KBUILD_CFLAGS += $(cc-option-align)-functions=4 - - In the above example, the option -falign-functions=4 is used for - gcc >= 3.00. For gcc < 3.00, -malign-functions=4 is used. - Note: cc-option-align uses KBUILD_CFLAGS for $(CC) options - cc-disable-warning cc-disable-warning checks if gcc supports a given warning and returns the commandline switch to disable it. This special function is needed, @@ -1277,18 +1260,7 @@ See subsequent chapter for the syntax of the Kbuild file. avoid exporting specific headers (e.g. kvm.h) on architectures that do not support it. It should be avoided as much as possible. - --- 7.2 genhdr-y - - genhdr-y specifies asm files to be generated. - - Example: - #arch/x86/include/uapi/asm/Kbuild - genhdr-y += unistd_32.h - genhdr-y += unistd_64.h - genhdr-y += unistd_x32.h - - - --- 7.3 generic-y + --- 7.2 generic-y If an architecture uses a verbatim copy of a header from include/asm-generic then this is listed in the file @@ -1315,11 +1287,10 @@ See subsequent chapter for the syntax of the Kbuild file. Example: termios.h #include <asm-generic/termios.h> - --- 7.4 generated-y + --- 7.3 generated-y If an architecture generates other header files alongside generic-y - wrappers, and not included in genhdr-y, then generated-y specifies - them. + wrappers, generated-y specifies them. This prevents them being treated as stale asm-generic wrappers and removed. @@ -1331,7 +1302,7 @@ See subsequent chapter for the syntax of the Kbuild file. --- 7.5 mandatory-y mandatory-y is essentially used by include/uapi/asm-generic/Kbuild.asm - to define the minimun set of headers that must be exported in + to define the minimum set of headers that must be exported in include/asm. The convention is to list one subdir per line and diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt index 615434d81108..51814450a7f8 100644 --- a/Documentation/kdump/kdump.txt +++ b/Documentation/kdump/kdump.txt @@ -112,8 +112,8 @@ There are two possible methods of using Kdump. 2) Or use the system kernel binary itself as dump-capture kernel and there is no need to build a separate dump-capture kernel. This is possible only with the architectures which support a relocatable kernel. As - of today, i386, x86_64, ppc64, ia64 and arm architectures support relocatable - kernel. + of today, i386, x86_64, ppc64, ia64, arm and arm64 architectures support + relocatable kernel. Building a relocatable kernel is advantageous from the point of view that one does not have to build a second kernel for capturing the dump. But @@ -339,7 +339,7 @@ For arm: For arm64: - Use vmlinux or Image -If you are using a uncompressed vmlinux image then use following command +If you are using an uncompressed vmlinux image then use following command to load dump-capture kernel. kexec -p <dump-capture-kernel-vmlinux-image> \ @@ -361,6 +361,12 @@ to load dump-capture kernel. --dtb=<dtb-for-dump-capture-kernel> \ --append="root=<root-dev> <arch-specific-options>" +If you are using an uncompressed Image, then use following command +to load dump-capture kernel. + + kexec -p <dump-capture-kernel-Image> \ + --initrd=<initrd-for-dump-capture-kernel> \ + --append="root=<root-dev> <arch-specific-options>" Please note, that --args-linux does not need to be specified for ia64. It is planned to make this a no-op on that architecture, but for now diff --git a/Documentation/kernel-doc-nano-HOWTO.txt b/Documentation/kernel-doc-nano-HOWTO.txt index 104740ea0041..c23e2c5ab80d 100644 --- a/Documentation/kernel-doc-nano-HOWTO.txt +++ b/Documentation/kernel-doc-nano-HOWTO.txt @@ -17,8 +17,8 @@ The format for this documentation is called the kernel-doc format. It is documented in this Documentation/kernel-doc-nano-HOWTO.txt file. This style embeds the documentation within the source files, using -a few simple conventions. The scripts/kernel-doc perl script, some -SGML templates in Documentation/DocBook, and other tools understand +a few simple conventions. The scripts/kernel-doc perl script, the +Documentation/sphinx/kerneldoc.py Sphinx extension and other tools understand these conventions, and are used to extract this embedded documentation into various documents. @@ -122,15 +122,9 @@ are: - scripts/kernel-doc This is a perl script that hunts for the block comments and can mark - them up directly into DocBook, man, text, and HTML. (No, not + them up directly into DocBook, ReST, man, text, and HTML. (No, not texinfo.) -- Documentation/DocBook/*.tmpl - - These are SGML template files, which are normal SGML files with - special place-holders for where the extracted documentation should - go. - - scripts/docproc.c This is a program for converting SGML template files into SGML @@ -145,25 +139,18 @@ are: - Makefile - The targets 'xmldocs', 'psdocs', 'pdfdocs', and 'htmldocs' are used - to build XML DocBook files, PostScript files, PDF files, and html files - in Documentation/DocBook. The older target 'sgmldocs' is equivalent - to 'xmldocs'. - -- Documentation/DocBook/Makefile - - This is where C files are associated with SGML templates. - + The targets 'xmldocs', 'latexdocs', 'pdfdocs', 'epubdocs'and 'htmldocs' + are used to build XML DocBook files, LaTeX files, PDF files, + ePub files and html files in Documentation/. How to extract the documentation -------------------------------- If you just want to read the ready-made books on the various -subsystems (see Documentation/DocBook/*.tmpl), just type 'make -psdocs', or 'make pdfdocs', or 'make htmldocs', depending on your -preference. If you would rather read a different format, you can type -'make xmldocs' and then use DocBook tools to convert -Documentation/DocBook/*.xml to a format of your choice (for example, +subsystems, just type 'make epubdocs', or 'make pdfdocs', or 'make htmldocs', +depending on your preference. If you would rather read a different format, +you can type 'make xmldocs' and then use DocBook tools to convert +Documentation/output/*.xml to a format of your choice (for example, 'db2html ...' if 'make htmldocs' was not defined). If you want to see man pages instead, you can do this: @@ -329,37 +316,7 @@ This is done by using a DOC: section keyword with a section title. E.g.: * hardware, software, or its subject(s). */ -DOC: sections are used in SGML templates files as indicated below. - - -How to make new SGML template files ------------------------------------ - -SGML template files (*.tmpl) are like normal SGML files, except that -they can contain escape sequences where extracted documentation should -be inserted. - -!E<filename> is replaced by the documentation, in <filename>, for -functions that are exported using EXPORT_SYMBOL: the function list is -collected from files listed in Documentation/DocBook/Makefile. - -!I<filename> is replaced by the documentation for functions that are -_not_ exported using EXPORT_SYMBOL. - -!D<filename> is used to name additional files to search for functions -exported using EXPORT_SYMBOL. - -!F<filename> <function [functions...]> is replaced by the -documentation, in <filename>, for the functions listed. - -!P<filename> <section title> is replaced by the contents of the DOC: -section titled <section title> from <filename>. -Spaces are allowed in <section title>; do not quote the <section title>. - -!C<filename> is replaced by nothing, but makes the tools check that -all DOC: sections and documented functions, symbols, etc. are used. -This makes sense to use when you use !F/!P only and want to verify -that all documentation is included. +DOC: sections are used in ReST files. Tim. */ <twaugh@redhat.com> diff --git a/Documentation/kernel-hacking/conf.py b/Documentation/kernel-hacking/conf.py new file mode 100644 index 000000000000..3d8acf0f33ad --- /dev/null +++ b/Documentation/kernel-hacking/conf.py @@ -0,0 +1,10 @@ +# -*- coding: utf-8; mode: python -*- + +project = "Kernel Hacking Guides" + +tags.add("subproject") + +latex_documents = [ + ('index', 'kernel-hacking.tex', project, + 'The kernel development community', 'manual'), +] diff --git a/Documentation/kernel-hacking/hacking.rst b/Documentation/kernel-hacking/hacking.rst new file mode 100644 index 000000000000..daf3883b2694 --- /dev/null +++ b/Documentation/kernel-hacking/hacking.rst @@ -0,0 +1,811 @@ +============================================ +Unreliable Guide To Hacking The Linux Kernel +============================================ + +:Author: Rusty Russell + +Introduction +============ + +Welcome, gentle reader, to Rusty's Remarkably Unreliable Guide to Linux +Kernel Hacking. This document describes the common routines and general +requirements for kernel code: its goal is to serve as a primer for Linux +kernel development for experienced C programmers. I avoid implementation +details: that's what the code is for, and I ignore whole tracts of +useful routines. + +Before you read this, please understand that I never wanted to write +this document, being grossly under-qualified, but I always wanted to +read it, and this was the only way. I hope it will grow into a +compendium of best practice, common starting points and random +information. + +The Players +=========== + +At any time each of the CPUs in a system can be: + +- not associated with any process, serving a hardware interrupt; + +- not associated with any process, serving a softirq or tasklet; + +- running in kernel space, associated with a process (user context); + +- running a process in user space. + +There is an ordering between these. The bottom two can preempt each +other, but above that is a strict hierarchy: each can only be preempted +by the ones above it. For example, while a softirq is running on a CPU, +no other softirq will preempt it, but a hardware interrupt can. However, +any other CPUs in the system execute independently. + +We'll see a number of ways that the user context can block interrupts, +to become truly non-preemptable. + +User Context +------------ + +User context is when you are coming in from a system call or other trap: +like userspace, you can be preempted by more important tasks and by +interrupts. You can sleep, by calling :c:func:`schedule()`. + +.. note:: + + You are always in user context on module load and unload, and on + operations on the block device layer. + +In user context, the ``current`` pointer (indicating the task we are +currently executing) is valid, and :c:func:`in_interrupt()` +(``include/linux/preempt.h``) is false. + +.. warning:: + + Beware that if you have preemption or softirqs disabled (see below), + :c:func:`in_interrupt()` will return a false positive. + +Hardware Interrupts (Hard IRQs) +------------------------------- + +Timer ticks, network cards and keyboard are examples of real hardware +which produce interrupts at any time. The kernel runs interrupt +handlers, which services the hardware. The kernel guarantees that this +handler is never re-entered: if the same interrupt arrives, it is queued +(or dropped). Because it disables interrupts, this handler has to be +fast: frequently it simply acknowledges the interrupt, marks a 'software +interrupt' for execution and exits. + +You can tell you are in a hardware interrupt, because +:c:func:`in_irq()` returns true. + +.. warning:: + + Beware that this will return a false positive if interrupts are + disabled (see below). + +Software Interrupt Context: Softirqs and Tasklets +------------------------------------------------- + +Whenever a system call is about to return to userspace, or a hardware +interrupt handler exits, any 'software interrupts' which are marked +pending (usually by hardware interrupts) are run (``kernel/softirq.c``). + +Much of the real interrupt handling work is done here. Early in the +transition to SMP, there were only 'bottom halves' (BHs), which didn't +take advantage of multiple CPUs. Shortly after we switched from wind-up +computers made of match-sticks and snot, we abandoned this limitation +and switched to 'softirqs'. + +``include/linux/interrupt.h`` lists the different softirqs. A very +important softirq is the timer softirq (``include/linux/timer.h``): you +can register to have it call functions for you in a given length of +time. + +Softirqs are often a pain to deal with, since the same softirq will run +simultaneously on more than one CPU. For this reason, tasklets +(``include/linux/interrupt.h``) are more often used: they are +dynamically-registrable (meaning you can have as many as you want), and +they also guarantee that any tasklet will only run on one CPU at any +time, although different tasklets can run simultaneously. + +.. warning:: + + The name 'tasklet' is misleading: they have nothing to do with + 'tasks', and probably more to do with some bad vodka Alexey + Kuznetsov had at the time. + +You can tell you are in a softirq (or tasklet) using the +:c:func:`in_softirq()` macro (``include/linux/preempt.h``). + +.. warning:: + + Beware that this will return a false positive if a + :ref:`botton half lock <local_bh_disable>` is held. + +Some Basic Rules +================ + +No memory protection + If you corrupt memory, whether in user context or interrupt context, + the whole machine will crash. Are you sure you can't do what you + want in userspace? + +No floating point or MMX + The FPU context is not saved; even in user context the FPU state + probably won't correspond with the current process: you would mess + with some user process' FPU state. If you really want to do this, + you would have to explicitly save/restore the full FPU state (and + avoid context switches). It is generally a bad idea; use fixed point + arithmetic first. + +A rigid stack limit + Depending on configuration options the kernel stack is about 3K to + 6K for most 32-bit architectures: it's about 14K on most 64-bit + archs, and often shared with interrupts so you can't use it all. + Avoid deep recursion and huge local arrays on the stack (allocate + them dynamically instead). + +The Linux kernel is portable + Let's keep it that way. Your code should be 64-bit clean, and + endian-independent. You should also minimize CPU specific stuff, + e.g. inline assembly should be cleanly encapsulated and minimized to + ease porting. Generally it should be restricted to the + architecture-dependent part of the kernel tree. + +ioctls: Not writing a new system call +===================================== + +A system call generally looks like this:: + + asmlinkage long sys_mycall(int arg) + { + return 0; + } + + +First, in most cases you don't want to create a new system call. You +create a character device and implement an appropriate ioctl for it. +This is much more flexible than system calls, doesn't have to be entered +in every architecture's ``include/asm/unistd.h`` and +``arch/kernel/entry.S`` file, and is much more likely to be accepted by +Linus. + +If all your routine does is read or write some parameter, consider +implementing a :c:func:`sysfs()` interface instead. + +Inside the ioctl you're in user context to a process. When a error +occurs you return a negated errno (see +``include/uapi/asm-generic/errno-base.h``, +``include/uapi/asm-generic/errno.h`` and ``include/linux/errno.h``), +otherwise you return 0. + +After you slept you should check if a signal occurred: the Unix/Linux +way of handling signals is to temporarily exit the system call with the +``-ERESTARTSYS`` error. The system call entry code will switch back to +user context, process the signal handler and then your system call will +be restarted (unless the user disabled that). So you should be prepared +to process the restart, e.g. if you're in the middle of manipulating +some data structure. + +:: + + if (signal_pending(current)) + return -ERESTARTSYS; + + +If you're doing longer computations: first think userspace. If you +**really** want to do it in kernel you should regularly check if you need +to give up the CPU (remember there is cooperative multitasking per CPU). +Idiom:: + + cond_resched(); /* Will sleep */ + + +A short note on interface design: the UNIX system call motto is "Provide +mechanism not policy". + +Recipes for Deadlock +==================== + +You cannot call any routines which may sleep, unless: + +- You are in user context. + +- You do not own any spinlocks. + +- You have interrupts enabled (actually, Andi Kleen says that the + scheduling code will enable them for you, but that's probably not + what you wanted). + +Note that some functions may sleep implicitly: common ones are the user +space access functions (\*_user) and memory allocation functions +without ``GFP_ATOMIC``. + +You should always compile your kernel ``CONFIG_DEBUG_ATOMIC_SLEEP`` on, +and it will warn you if you break these rules. If you **do** break the +rules, you will eventually lock up your box. + +Really. + +Common Routines +=============== + +:c:func:`printk()` +------------------ + +Defined in ``include/linux/printk.h`` + +:c:func:`printk()` feeds kernel messages to the console, dmesg, and +the syslog daemon. It is useful for debugging and reporting errors, and +can be used inside interrupt context, but use with caution: a machine +which has its console flooded with printk messages is unusable. It uses +a format string mostly compatible with ANSI C printf, and C string +concatenation to give it a first "priority" argument:: + + printk(KERN_INFO "i = %u\n", i); + + +See ``include/linux/kern_levels.h``; for other ``KERN_`` values; these are +interpreted by syslog as the level. Special case: for printing an IP +address use:: + + __be32 ipaddress; + printk(KERN_INFO "my ip: %pI4\n", &ipaddress); + + +:c:func:`printk()` internally uses a 1K buffer and does not catch +overruns. Make sure that will be enough. + +.. note:: + + You will know when you are a real kernel hacker when you start + typoing printf as printk in your user programs :) + +.. note:: + + Another sidenote: the original Unix Version 6 sources had a comment + on top of its printf function: "Printf should not be used for + chit-chat". You should follow that advice. + +:c:func:`copy_to_user()` / :c:func:`copy_from_user()` / :c:func:`get_user()` / :c:func:`put_user()` +--------------------------------------------------------------------------------------------------- + +Defined in ``include/linux/uaccess.h`` / ``asm/uaccess.h`` + +**[SLEEPS]** + +:c:func:`put_user()` and :c:func:`get_user()` are used to get +and put single values (such as an int, char, or long) from and to +userspace. A pointer into userspace should never be simply dereferenced: +data should be copied using these routines. Both return ``-EFAULT`` or +0. + +:c:func:`copy_to_user()` and :c:func:`copy_from_user()` are +more general: they copy an arbitrary amount of data to and from +userspace. + +.. warning:: + + Unlike :c:func:`put_user()` and :c:func:`get_user()`, they + return the amount of uncopied data (ie. 0 still means success). + +[Yes, this moronic interface makes me cringe. The flamewar comes up +every year or so. --RR.] + +The functions may sleep implicitly. This should never be called outside +user context (it makes no sense), with interrupts disabled, or a +spinlock held. + +:c:func:`kmalloc()`/:c:func:`kfree()` +------------------------------------- + +Defined in ``include/linux/slab.h`` + +**[MAY SLEEP: SEE BELOW]** + +These routines are used to dynamically request pointer-aligned chunks of +memory, like malloc and free do in userspace, but +:c:func:`kmalloc()` takes an extra flag word. Important values: + +``GFP_KERNEL`` + May sleep and swap to free memory. Only allowed in user context, but + is the most reliable way to allocate memory. + +``GFP_ATOMIC`` + Don't sleep. Less reliable than ``GFP_KERNEL``, but may be called + from interrupt context. You should **really** have a good + out-of-memory error-handling strategy. + +``GFP_DMA`` + Allocate ISA DMA lower than 16MB. If you don't know what that is you + don't need it. Very unreliable. + +If you see a sleeping function called from invalid context warning +message, then maybe you called a sleeping allocation function from +interrupt context without ``GFP_ATOMIC``. You should really fix that. +Run, don't walk. + +If you are allocating at least ``PAGE_SIZE`` (``asm/page.h`` or +``asm/page_types.h``) bytes, consider using :c:func:`__get_free_pages()` +(``include/linux/gfp.h``). It takes an order argument (0 for page sized, +1 for double page, 2 for four pages etc.) and the same memory priority +flag word as above. + +If you are allocating more than a page worth of bytes you can use +:c:func:`vmalloc()`. It'll allocate virtual memory in the kernel +map. This block is not contiguous in physical memory, but the MMU makes +it look like it is for you (so it'll only look contiguous to the CPUs, +not to external device drivers). If you really need large physically +contiguous memory for some weird device, you have a problem: it is +poorly supported in Linux because after some time memory fragmentation +in a running kernel makes it hard. The best way is to allocate the block +early in the boot process via the :c:func:`alloc_bootmem()` +routine. + +Before inventing your own cache of often-used objects consider using a +slab cache in ``include/linux/slab.h`` + +:c:func:`current()` +------------------- + +Defined in ``include/asm/current.h`` + +This global variable (really a macro) contains a pointer to the current +task structure, so is only valid in user context. For example, when a +process makes a system call, this will point to the task structure of +the calling process. It is **not NULL** in interrupt context. + +:c:func:`mdelay()`/:c:func:`udelay()` +------------------------------------- + +Defined in ``include/asm/delay.h`` / ``include/linux/delay.h`` + +The :c:func:`udelay()` and :c:func:`ndelay()` functions can be +used for small pauses. Do not use large values with them as you risk +overflow - the helper function :c:func:`mdelay()` is useful here, or +consider :c:func:`msleep()`. + +:c:func:`cpu_to_be32()`/:c:func:`be32_to_cpu()`/:c:func:`cpu_to_le32()`/:c:func:`le32_to_cpu()` +----------------------------------------------------------------------------------------------- + +Defined in ``include/asm/byteorder.h`` + +The :c:func:`cpu_to_be32()` family (where the "32" can be replaced +by 64 or 16, and the "be" can be replaced by "le") are the general way +to do endian conversions in the kernel: they return the converted value. +All variations supply the reverse as well: +:c:func:`be32_to_cpu()`, etc. + +There are two major variations of these functions: the pointer +variation, such as :c:func:`cpu_to_be32p()`, which take a pointer +to the given type, and return the converted value. The other variation +is the "in-situ" family, such as :c:func:`cpu_to_be32s()`, which +convert value referred to by the pointer, and return void. + +:c:func:`local_irq_save()`/:c:func:`local_irq_restore()` +-------------------------------------------------------- + +Defined in ``include/linux/irqflags.h`` + +These routines disable hard interrupts on the local CPU, and restore +them. They are reentrant; saving the previous state in their one +``unsigned long flags`` argument. If you know that interrupts are +enabled, you can simply use :c:func:`local_irq_disable()` and +:c:func:`local_irq_enable()`. + +.. _local_bh_disable: + +:c:func:`local_bh_disable()`/:c:func:`local_bh_enable()` +-------------------------------------------------------- + +Defined in ``include/linux/bottom_half.h`` + + +These routines disable soft interrupts on the local CPU, and restore +them. They are reentrant; if soft interrupts were disabled before, they +will still be disabled after this pair of functions has been called. +They prevent softirqs and tasklets from running on the current CPU. + +:c:func:`smp_processor_id()` +---------------------------- + +Defined in ``include/linux/smp.h`` + +:c:func:`get_cpu()` disables preemption (so you won't suddenly get +moved to another CPU) and returns the current processor number, between +0 and ``NR_CPUS``. Note that the CPU numbers are not necessarily +continuous. You return it again with :c:func:`put_cpu()` when you +are done. + +If you know you cannot be preempted by another task (ie. you are in +interrupt context, or have preemption disabled) you can use +smp_processor_id(). + +``__init``/``__exit``/``__initdata`` +------------------------------------ + +Defined in ``include/linux/init.h`` + +After boot, the kernel frees up a special section; functions marked with +``__init`` and data structures marked with ``__initdata`` are dropped +after boot is complete: similarly modules discard this memory after +initialization. ``__exit`` is used to declare a function which is only +required on exit: the function will be dropped if this file is not +compiled as a module. See the header file for use. Note that it makes no +sense for a function marked with ``__init`` to be exported to modules +with :c:func:`EXPORT_SYMBOL()` or :c:func:`EXPORT_SYMBOL_GPL()`- this +will break. + +:c:func:`__initcall()`/:c:func:`module_init()` +---------------------------------------------- + +Defined in ``include/linux/init.h`` / ``include/linux/module.h`` + +Many parts of the kernel are well served as a module +(dynamically-loadable parts of the kernel). Using the +:c:func:`module_init()` and :c:func:`module_exit()` macros it +is easy to write code without #ifdefs which can operate both as a module +or built into the kernel. + +The :c:func:`module_init()` macro defines which function is to be +called at module insertion time (if the file is compiled as a module), +or at boot time: if the file is not compiled as a module the +:c:func:`module_init()` macro becomes equivalent to +:c:func:`__initcall()`, which through linker magic ensures that +the function is called on boot. + +The function can return a negative error number to cause module loading +to fail (unfortunately, this has no effect if the module is compiled +into the kernel). This function is called in user context with +interrupts enabled, so it can sleep. + +:c:func:`module_exit()` +----------------------- + + +Defined in ``include/linux/module.h`` + +This macro defines the function to be called at module removal time (or +never, in the case of the file compiled into the kernel). It will only +be called if the module usage count has reached zero. This function can +also sleep, but cannot fail: everything must be cleaned up by the time +it returns. + +Note that this macro is optional: if it is not present, your module will +not be removable (except for 'rmmod -f'). + +:c:func:`try_module_get()`/:c:func:`module_put()` +------------------------------------------------- + +Defined in ``include/linux/module.h`` + +These manipulate the module usage count, to protect against removal (a +module also can't be removed if another module uses one of its exported +symbols: see below). Before calling into module code, you should call +:c:func:`try_module_get()` on that module: if it fails, then the +module is being removed and you should act as if it wasn't there. +Otherwise, you can safely enter the module, and call +:c:func:`module_put()` when you're finished. + +Most registerable structures have an owner field, such as in the +:c:type:`struct file_operations <file_operations>` structure. +Set this field to the macro ``THIS_MODULE``. + +Wait Queues ``include/linux/wait.h`` +==================================== + +**[SLEEPS]** + +A wait queue is used to wait for someone to wake you up when a certain +condition is true. They must be used carefully to ensure there is no +race condition. You declare a :c:type:`wait_queue_head_t`, and then processes +which want to wait for that condition declare a :c:type:`wait_queue_entry_t` +referring to themselves, and place that in the queue. + +Declaring +--------- + +You declare a ``wait_queue_head_t`` using the +:c:func:`DECLARE_WAIT_QUEUE_HEAD()` macro, or using the +:c:func:`init_waitqueue_head()` routine in your initialization +code. + +Queuing +------- + +Placing yourself in the waitqueue is fairly complex, because you must +put yourself in the queue before checking the condition. There is a +macro to do this: :c:func:`wait_event_interruptible()` +(``include/linux/wait.h``) The first argument is the wait queue head, and +the second is an expression which is evaluated; the macro returns 0 when +this expression is true, or ``-ERESTARTSYS`` if a signal is received. The +:c:func:`wait_event()` version ignores signals. + +Waking Up Queued Tasks +---------------------- + +Call :c:func:`wake_up()` (``include/linux/wait.h``);, which will wake +up every process in the queue. The exception is if one has +``TASK_EXCLUSIVE`` set, in which case the remainder of the queue will +not be woken. There are other variants of this basic function available +in the same header. + +Atomic Operations +================= + +Certain operations are guaranteed atomic on all platforms. The first +class of operations work on :c:type:`atomic_t` (``include/asm/atomic.h``); +this contains a signed integer (at least 32 bits long), and you must use +these functions to manipulate or read :c:type:`atomic_t` variables. +:c:func:`atomic_read()` and :c:func:`atomic_set()` get and set +the counter, :c:func:`atomic_add()`, :c:func:`atomic_sub()`, +:c:func:`atomic_inc()`, :c:func:`atomic_dec()`, and +:c:func:`atomic_dec_and_test()` (returns true if it was +decremented to zero). + +Yes. It returns true (i.e. != 0) if the atomic variable is zero. + +Note that these functions are slower than normal arithmetic, and so +should not be used unnecessarily. + +The second class of atomic operations is atomic bit operations on an +``unsigned long``, defined in ``include/linux/bitops.h``. These +operations generally take a pointer to the bit pattern, and a bit +number: 0 is the least significant bit. :c:func:`set_bit()`, +:c:func:`clear_bit()` and :c:func:`change_bit()` set, clear, +and flip the given bit. :c:func:`test_and_set_bit()`, +:c:func:`test_and_clear_bit()` and +:c:func:`test_and_change_bit()` do the same thing, except return +true if the bit was previously set; these are particularly useful for +atomically setting flags. + +It is possible to call these operations with bit indices greater than +``BITS_PER_LONG``. The resulting behavior is strange on big-endian +platforms though so it is a good idea not to do this. + +Symbols +======= + +Within the kernel proper, the normal linking rules apply (ie. unless a +symbol is declared to be file scope with the ``static`` keyword, it can +be used anywhere in the kernel). However, for modules, a special +exported symbol table is kept which limits the entry points to the +kernel proper. Modules can also export symbols. + +:c:func:`EXPORT_SYMBOL()` +------------------------- + +Defined in ``include/linux/export.h`` + +This is the classic method of exporting a symbol: dynamically loaded +modules will be able to use the symbol as normal. + +:c:func:`EXPORT_SYMBOL_GPL()` +----------------------------- + +Defined in ``include/linux/export.h`` + +Similar to :c:func:`EXPORT_SYMBOL()` except that the symbols +exported by :c:func:`EXPORT_SYMBOL_GPL()` can only be seen by +modules with a :c:func:`MODULE_LICENSE()` that specifies a GPL +compatible license. It implies that the function is considered an +internal implementation issue, and not really an interface. Some +maintainers and developers may however require EXPORT_SYMBOL_GPL() +when adding any new APIs or functionality. + +Routines and Conventions +======================== + +Double-linked lists ``include/linux/list.h`` +-------------------------------------------- + +There used to be three sets of linked-list routines in the kernel +headers, but this one is the winner. If you don't have some particular +pressing need for a single list, it's a good choice. + +In particular, :c:func:`list_for_each_entry()` is useful. + +Return Conventions +------------------ + +For code called in user context, it's very common to defy C convention, +and return 0 for success, and a negative error number (eg. ``-EFAULT``) for +failure. This can be unintuitive at first, but it's fairly widespread in +the kernel. + +Using :c:func:`ERR_PTR()` (``include/linux/err.h``) to encode a +negative error number into a pointer, and :c:func:`IS_ERR()` and +:c:func:`PTR_ERR()` to get it back out again: avoids a separate +pointer parameter for the error number. Icky, but in a good way. + +Breaking Compilation +-------------------- + +Linus and the other developers sometimes change function or structure +names in development kernels; this is not done just to keep everyone on +their toes: it reflects a fundamental change (eg. can no longer be +called with interrupts on, or does extra checks, or doesn't do checks +which were caught before). Usually this is accompanied by a fairly +complete note to the linux-kernel mailing list; search the archive. +Simply doing a global replace on the file usually makes things **worse**. + +Initializing structure members +------------------------------ + +The preferred method of initializing structures is to use designated +initialisers, as defined by ISO C99, eg:: + + static struct block_device_operations opt_fops = { + .open = opt_open, + .release = opt_release, + .ioctl = opt_ioctl, + .check_media_change = opt_media_change, + }; + + +This makes it easy to grep for, and makes it clear which structure +fields are set. You should do this because it looks cool. + +GNU Extensions +-------------- + +GNU Extensions are explicitly allowed in the Linux kernel. Note that +some of the more complex ones are not very well supported, due to lack +of general use, but the following are considered standard (see the GCC +info page section "C Extensions" for more details - Yes, really the info +page, the man page is only a short summary of the stuff in info). + +- Inline functions + +- Statement expressions (ie. the ({ and }) constructs). + +- Declaring attributes of a function / variable / type + (__attribute__) + +- typeof + +- Zero length arrays + +- Macro varargs + +- Arithmetic on void pointers + +- Non-Constant initializers + +- Assembler Instructions (not outside arch/ and include/asm/) + +- Function names as strings (__func__). + +- __builtin_constant_p() + +Be wary when using long long in the kernel, the code gcc generates for +it is horrible and worse: division and multiplication does not work on +i386 because the GCC runtime functions for it are missing from the +kernel environment. + +C++ +--- + +Using C++ in the kernel is usually a bad idea, because the kernel does +not provide the necessary runtime environment and the include files are +not tested for it. It is still possible, but not recommended. If you +really want to do this, forget about exceptions at least. + +NUMif +----- + +It is generally considered cleaner to use macros in header files (or at +the top of .c files) to abstract away functions rather than using \`#if' +pre-processor statements throughout the source code. + +Putting Your Stuff in the Kernel +================================ + +In order to get your stuff into shape for official inclusion, or even to +make a neat patch, there's administrative work to be done: + +- Figure out whose pond you've been pissing in. Look at the top of the + source files, inside the ``MAINTAINERS`` file, and last of all in the + ``CREDITS`` file. You should coordinate with this person to make sure + you're not duplicating effort, or trying something that's already + been rejected. + + Make sure you put your name and EMail address at the top of any files + you create or mangle significantly. This is the first place people + will look when they find a bug, or when **they** want to make a change. + +- Usually you want a configuration option for your kernel hack. Edit + ``Kconfig`` in the appropriate directory. The Config language is + simple to use by cut and paste, and there's complete documentation in + ``Documentation/kbuild/kconfig-language.txt``. + + In your description of the option, make sure you address both the + expert user and the user who knows nothing about your feature. + Mention incompatibilities and issues here. **Definitely** end your + description with โif in doubt, say Nโ (or, occasionally, \`Y'); this + is for people who have no idea what you are talking about. + +- Edit the ``Makefile``: the CONFIG variables are exported here so you + can usually just add a "obj-$(CONFIG_xxx) += xxx.o" line. The syntax + is documented in ``Documentation/kbuild/makefiles.txt``. + +- Put yourself in ``CREDITS`` if you've done something noteworthy, + usually beyond a single file (your name should be at the top of the + source files anyway). ``MAINTAINERS`` means you want to be consulted + when changes are made to a subsystem, and hear about bugs; it implies + a more-than-passing commitment to some part of the code. + +- Finally, don't forget to read + ``Documentation/process/submitting-patches.rst`` and possibly + ``Documentation/process/submitting-drivers.rst``. + +Kernel Cantrips +=============== + +Some favorites from browsing the source. Feel free to add to this list. + +``arch/x86/include/asm/delay.h``:: + + #define ndelay(n) (__builtin_constant_p(n) ? \ + ((n) > 20000 ? __bad_ndelay() : __const_udelay((n) * 5ul)) : \ + __ndelay(n)) + + +``include/linux/fs.h``:: + + /* + * Kernel pointers have redundant information, so we can use a + * scheme where we can return either an error code or a dentry + * pointer with the same return value. + * + * This should be a per-architecture thing, to allow different + * error and pointer decisions. + */ + #define ERR_PTR(err) ((void *)((long)(err))) + #define PTR_ERR(ptr) ((long)(ptr)) + #define IS_ERR(ptr) ((unsigned long)(ptr) > (unsigned long)(-1000)) + +``arch/x86/include/asm/uaccess_32.h:``:: + + #define copy_to_user(to,from,n) \ + (__builtin_constant_p(n) ? \ + __constant_copy_to_user((to),(from),(n)) : \ + __generic_copy_to_user((to),(from),(n))) + + +``arch/sparc/kernel/head.S:``:: + + /* + * Sun people can't spell worth damn. "compatability" indeed. + * At least we *know* we can't spell, and use a spell-checker. + */ + + /* Uh, actually Linus it is I who cannot spell. Too much murky + * Sparc assembly will do this to ya. + */ + C_LABEL(cputypvar): + .asciz "compatibility" + + /* Tested on SS-5, SS-10. Probably someone at Sun applied a spell-checker. */ + .align 4 + C_LABEL(cputypvar_sun4m): + .asciz "compatible" + + +``arch/sparc/lib/checksum.S:``:: + + /* Sun, you just can't beat me, you just can't. Stop trying, + * give up. I'm serious, I am going to kick the living shit + * out of you, game over, lights out. + */ + + +Thanks +====== + +Thanks to Andi Kleen for the idea, answering my questions, fixing my +mistakes, filling content, etc. Philipp Rumpf for more spelling and +clarity fixes, and some excellent non-obvious points. Werner Almesberger +for giving me a great summary of :c:func:`disable_irq()`, and Jes +Sorensen and Andrea Arcangeli added caveats. Michael Elizabeth Chastain +for checking and adding to the Configure section. Telsa Gwynne for +teaching me DocBook. diff --git a/Documentation/kernel-hacking/index.rst b/Documentation/kernel-hacking/index.rst new file mode 100644 index 000000000000..fcb0eda3cca3 --- /dev/null +++ b/Documentation/kernel-hacking/index.rst @@ -0,0 +1,9 @@ +===================== +Kernel Hacking Guides +===================== + +.. toctree:: + :maxdepth: 2 + + hacking + locking diff --git a/Documentation/kernel-hacking/locking.rst b/Documentation/kernel-hacking/locking.rst new file mode 100644 index 000000000000..f937c0fd11aa --- /dev/null +++ b/Documentation/kernel-hacking/locking.rst @@ -0,0 +1,1446 @@ +=========================== +Unreliable Guide To Locking +=========================== + +:Author: Rusty Russell + +Introduction +============ + +Welcome, to Rusty's Remarkably Unreliable Guide to Kernel Locking +issues. This document describes the locking systems in the Linux Kernel +in 2.6. + +With the wide availability of HyperThreading, and preemption in the +Linux Kernel, everyone hacking on the kernel needs to know the +fundamentals of concurrency and locking for SMP. + +The Problem With Concurrency +============================ + +(Skip this if you know what a Race Condition is). + +In a normal program, you can increment a counter like so: + +:: + + very_important_count++; + + +This is what they would expect to happen: + + +.. table:: Expected Results + + +------------------------------------+------------------------------------+ + | Instance 1 | Instance 2 | + +====================================+====================================+ + | read very_important_count (5) | | + +------------------------------------+------------------------------------+ + | add 1 (6) | | + +------------------------------------+------------------------------------+ + | write very_important_count (6) | | + +------------------------------------+------------------------------------+ + | | read very_important_count (6) | + +------------------------------------+------------------------------------+ + | | add 1 (7) | + +------------------------------------+------------------------------------+ + | | write very_important_count (7) | + +------------------------------------+------------------------------------+ + +This is what might happen: + +.. table:: Possible Results + + +------------------------------------+------------------------------------+ + | Instance 1 | Instance 2 | + +====================================+====================================+ + | read very_important_count (5) | | + +------------------------------------+------------------------------------+ + | | read very_important_count (5) | + +------------------------------------+------------------------------------+ + | add 1 (6) | | + +------------------------------------+------------------------------------+ + | | add 1 (6) | + +------------------------------------+------------------------------------+ + | write very_important_count (6) | | + +------------------------------------+------------------------------------+ + | | write very_important_count (6) | + +------------------------------------+------------------------------------+ + + +Race Conditions and Critical Regions +------------------------------------ + +This overlap, where the result depends on the relative timing of +multiple tasks, is called a race condition. The piece of code containing +the concurrency issue is called a critical region. And especially since +Linux starting running on SMP machines, they became one of the major +issues in kernel design and implementation. + +Preemption can have the same effect, even if there is only one CPU: by +preempting one task during the critical region, we have exactly the same +race condition. In this case the thread which preempts might run the +critical region itself. + +The solution is to recognize when these simultaneous accesses occur, and +use locks to make sure that only one instance can enter the critical +region at any time. There are many friendly primitives in the Linux +kernel to help you do this. And then there are the unfriendly +primitives, but I'll pretend they don't exist. + +Locking in the Linux Kernel +=========================== + +If I could give you one piece of advice: never sleep with anyone crazier +than yourself. But if I had to give you advice on locking: **keep it +simple**. + +Be reluctant to introduce new locks. + +Strangely enough, this last one is the exact reverse of my advice when +you **have** slept with someone crazier than yourself. And you should +think about getting a big dog. + +Two Main Types of Kernel Locks: Spinlocks and Mutexes +----------------------------------------------------- + +There are two main types of kernel locks. The fundamental type is the +spinlock (``include/asm/spinlock.h``), which is a very simple +single-holder lock: if you can't get the spinlock, you keep trying +(spinning) until you can. Spinlocks are very small and fast, and can be +used anywhere. + +The second type is a mutex (``include/linux/mutex.h``): it is like a +spinlock, but you may block holding a mutex. If you can't lock a mutex, +your task will suspend itself, and be woken up when the mutex is +released. This means the CPU can do something else while you are +waiting. There are many cases when you simply can't sleep (see +`What Functions Are Safe To Call From Interrupts? <#sleeping-things>`__), +and so have to use a spinlock instead. + +Neither type of lock is recursive: see +`Deadlock: Simple and Advanced <#deadlock>`__. + +Locks and Uniprocessor Kernels +------------------------------ + +For kernels compiled without ``CONFIG_SMP``, and without +``CONFIG_PREEMPT`` spinlocks do not exist at all. This is an excellent +design decision: when no-one else can run at the same time, there is no +reason to have a lock. + +If the kernel is compiled without ``CONFIG_SMP``, but ``CONFIG_PREEMPT`` +is set, then spinlocks simply disable preemption, which is sufficient to +prevent any races. For most purposes, we can think of preemption as +equivalent to SMP, and not worry about it separately. + +You should always test your locking code with ``CONFIG_SMP`` and +``CONFIG_PREEMPT`` enabled, even if you don't have an SMP test box, +because it will still catch some kinds of locking bugs. + +Mutexes still exist, because they are required for synchronization +between user contexts, as we will see below. + +Locking Only In User Context +---------------------------- + +If you have a data structure which is only ever accessed from user +context, then you can use a simple mutex (``include/linux/mutex.h``) to +protect it. This is the most trivial case: you initialize the mutex. +Then you can call :c:func:`mutex_lock_interruptible()` to grab the +mutex, and :c:func:`mutex_unlock()` to release it. There is also a +:c:func:`mutex_lock()`, which should be avoided, because it will +not return if a signal is received. + +Example: ``net/netfilter/nf_sockopt.c`` allows registration of new +:c:func:`setsockopt()` and :c:func:`getsockopt()` calls, with +:c:func:`nf_register_sockopt()`. Registration and de-registration +are only done on module load and unload (and boot time, where there is +no concurrency), and the list of registrations is only consulted for an +unknown :c:func:`setsockopt()` or :c:func:`getsockopt()` system +call. The ``nf_sockopt_mutex`` is perfect to protect this, especially +since the setsockopt and getsockopt calls may well sleep. + +Locking Between User Context and Softirqs +----------------------------------------- + +If a softirq shares data with user context, you have two problems. +Firstly, the current user context can be interrupted by a softirq, and +secondly, the critical region could be entered from another CPU. This is +where :c:func:`spin_lock_bh()` (``include/linux/spinlock.h``) is +used. It disables softirqs on that CPU, then grabs the lock. +:c:func:`spin_unlock_bh()` does the reverse. (The '_bh' suffix is +a historical reference to "Bottom Halves", the old name for software +interrupts. It should really be called spin_lock_softirq()' in a +perfect world). + +Note that you can also use :c:func:`spin_lock_irq()` or +:c:func:`spin_lock_irqsave()` here, which stop hardware interrupts +as well: see `Hard IRQ Context <#hardirq-context>`__. + +This works perfectly for UP as well: the spin lock vanishes, and this +macro simply becomes :c:func:`local_bh_disable()` +(``include/linux/interrupt.h``), which protects you from the softirq +being run. + +Locking Between User Context and Tasklets +----------------------------------------- + +This is exactly the same as above, because tasklets are actually run +from a softirq. + +Locking Between User Context and Timers +--------------------------------------- + +This, too, is exactly the same as above, because timers are actually run +from a softirq. From a locking point of view, tasklets and timers are +identical. + +Locking Between Tasklets/Timers +------------------------------- + +Sometimes a tasklet or timer might want to share data with another +tasklet or timer. + +The Same Tasklet/Timer +~~~~~~~~~~~~~~~~~~~~~~ + +Since a tasklet is never run on two CPUs at once, you don't need to +worry about your tasklet being reentrant (running twice at once), even +on SMP. + +Different Tasklets/Timers +~~~~~~~~~~~~~~~~~~~~~~~~~ + +If another tasklet/timer wants to share data with your tasklet or timer +, you will both need to use :c:func:`spin_lock()` and +:c:func:`spin_unlock()` calls. :c:func:`spin_lock_bh()` is +unnecessary here, as you are already in a tasklet, and none will be run +on the same CPU. + +Locking Between Softirqs +------------------------ + +Often a softirq might want to share data with itself or a tasklet/timer. + +The Same Softirq +~~~~~~~~~~~~~~~~ + +The same softirq can run on the other CPUs: you can use a per-CPU array +(see `Per-CPU Data <#per-cpu>`__) for better performance. If you're +going so far as to use a softirq, you probably care about scalable +performance enough to justify the extra complexity. + +You'll need to use :c:func:`spin_lock()` and +:c:func:`spin_unlock()` for shared data. + +Different Softirqs +~~~~~~~~~~~~~~~~~~ + +You'll need to use :c:func:`spin_lock()` and +:c:func:`spin_unlock()` for shared data, whether it be a timer, +tasklet, different softirq or the same or another softirq: any of them +could be running on a different CPU. + +Hard IRQ Context +================ + +Hardware interrupts usually communicate with a tasklet or softirq. +Frequently this involves putting work in a queue, which the softirq will +take out. + +Locking Between Hard IRQ and Softirqs/Tasklets +---------------------------------------------- + +If a hardware irq handler shares data with a softirq, you have two +concerns. Firstly, the softirq processing can be interrupted by a +hardware interrupt, and secondly, the critical region could be entered +by a hardware interrupt on another CPU. This is where +:c:func:`spin_lock_irq()` is used. It is defined to disable +interrupts on that cpu, then grab the lock. +:c:func:`spin_unlock_irq()` does the reverse. + +The irq handler does not to use :c:func:`spin_lock_irq()`, because +the softirq cannot run while the irq handler is running: it can use +:c:func:`spin_lock()`, which is slightly faster. The only exception +would be if a different hardware irq handler uses the same lock: +:c:func:`spin_lock_irq()` will stop that from interrupting us. + +This works perfectly for UP as well: the spin lock vanishes, and this +macro simply becomes :c:func:`local_irq_disable()` +(``include/asm/smp.h``), which protects you from the softirq/tasklet/BH +being run. + +:c:func:`spin_lock_irqsave()` (``include/linux/spinlock.h``) is a +variant which saves whether interrupts were on or off in a flags word, +which is passed to :c:func:`spin_unlock_irqrestore()`. This means +that the same code can be used inside an hard irq handler (where +interrupts are already off) and in softirqs (where the irq disabling is +required). + +Note that softirqs (and hence tasklets and timers) are run on return +from hardware interrupts, so :c:func:`spin_lock_irq()` also stops +these. In that sense, :c:func:`spin_lock_irqsave()` is the most +general and powerful locking function. + +Locking Between Two Hard IRQ Handlers +------------------------------------- + +It is rare to have to share data between two IRQ handlers, but if you +do, :c:func:`spin_lock_irqsave()` should be used: it is +architecture-specific whether all interrupts are disabled inside irq +handlers themselves. + +Cheat Sheet For Locking +======================= + +Pete Zaitcev gives the following summary: + +- If you are in a process context (any syscall) and want to lock other + process out, use a mutex. You can take a mutex and sleep + (``copy_from_user*(`` or ``kmalloc(x,GFP_KERNEL)``). + +- Otherwise (== data can be touched in an interrupt), use + :c:func:`spin_lock_irqsave()` and + :c:func:`spin_unlock_irqrestore()`. + +- Avoid holding spinlock for more than 5 lines of code and across any + function call (except accessors like :c:func:`readb()`). + +Table of Minimum Requirements +----------------------------- + +The following table lists the **minimum** locking requirements between +various contexts. In some cases, the same context can only be running on +one CPU at a time, so no locking is required for that context (eg. a +particular thread can only run on one CPU at a time, but if it needs +shares data with another thread, locking is required). + +Remember the advice above: you can always use +:c:func:`spin_lock_irqsave()`, which is a superset of all other +spinlock primitives. + +============== ============= ============= ========= ========= ========= ========= ======= ======= ============== ============== +. IRQ Handler A IRQ Handler B Softirq A Softirq B Tasklet A Tasklet B Timer A Timer B User Context A User Context B +============== ============= ============= ========= ========= ========= ========= ======= ======= ============== ============== +IRQ Handler A None +IRQ Handler B SLIS None +Softirq A SLI SLI SL +Softirq B SLI SLI SL SL +Tasklet A SLI SLI SL SL None +Tasklet B SLI SLI SL SL SL None +Timer A SLI SLI SL SL SL SL None +Timer B SLI SLI SL SL SL SL SL None +User Context A SLI SLI SLBH SLBH SLBH SLBH SLBH SLBH None +User Context B SLI SLI SLBH SLBH SLBH SLBH SLBH SLBH MLI None +============== ============= ============= ========= ========= ========= ========= ======= ======= ============== ============== + +Table: Table of Locking Requirements + ++--------+----------------------------+ +| SLIS | spin_lock_irqsave | ++--------+----------------------------+ +| SLI | spin_lock_irq | ++--------+----------------------------+ +| SL | spin_lock | ++--------+----------------------------+ +| SLBH | spin_lock_bh | ++--------+----------------------------+ +| MLI | mutex_lock_interruptible | ++--------+----------------------------+ + +Table: Legend for Locking Requirements Table + +The trylock Functions +===================== + +There are functions that try to acquire a lock only once and immediately +return a value telling about success or failure to acquire the lock. +They can be used if you need no access to the data protected with the +lock when some other thread is holding the lock. You should acquire the +lock later if you then need access to the data protected with the lock. + +:c:func:`spin_trylock()` does not spin but returns non-zero if it +acquires the spinlock on the first try or 0 if not. This function can be +used in all contexts like :c:func:`spin_lock()`: you must have +disabled the contexts that might interrupt you and acquire the spin +lock. + +:c:func:`mutex_trylock()` does not suspend your task but returns +non-zero if it could lock the mutex on the first try or 0 if not. This +function cannot be safely used in hardware or software interrupt +contexts despite not sleeping. + +Common Examples +=============== + +Let's step through a simple example: a cache of number to name mappings. +The cache keeps a count of how often each of the objects is used, and +when it gets full, throws out the least used one. + +All In User Context +------------------- + +For our first example, we assume that all operations are in user context +(ie. from system calls), so we can sleep. This means we can use a mutex +to protect the cache and all the objects within it. Here's the code:: + + #include <linux/list.h> + #include <linux/slab.h> + #include <linux/string.h> + #include <linux/mutex.h> + #include <asm/errno.h> + + struct object + { + struct list_head list; + int id; + char name[32]; + int popularity; + }; + + /* Protects the cache, cache_num, and the objects within it */ + static DEFINE_MUTEX(cache_lock); + static LIST_HEAD(cache); + static unsigned int cache_num = 0; + #define MAX_CACHE_SIZE 10 + + /* Must be holding cache_lock */ + static struct object *__cache_find(int id) + { + struct object *i; + + list_for_each_entry(i, &cache, list) + if (i->id == id) { + i->popularity++; + return i; + } + return NULL; + } + + /* Must be holding cache_lock */ + static void __cache_delete(struct object *obj) + { + BUG_ON(!obj); + list_del(&obj->list); + kfree(obj); + cache_num--; + } + + /* Must be holding cache_lock */ + static void __cache_add(struct object *obj) + { + list_add(&obj->list, &cache); + if (++cache_num > MAX_CACHE_SIZE) { + struct object *i, *outcast = NULL; + list_for_each_entry(i, &cache, list) { + if (!outcast || i->popularity < outcast->popularity) + outcast = i; + } + __cache_delete(outcast); + } + } + + int cache_add(int id, const char *name) + { + struct object *obj; + + if ((obj = kmalloc(sizeof(*obj), GFP_KERNEL)) == NULL) + return -ENOMEM; + + strlcpy(obj->name, name, sizeof(obj->name)); + obj->id = id; + obj->popularity = 0; + + mutex_lock(&cache_lock); + __cache_add(obj); + mutex_unlock(&cache_lock); + return 0; + } + + void cache_delete(int id) + { + mutex_lock(&cache_lock); + __cache_delete(__cache_find(id)); + mutex_unlock(&cache_lock); + } + + int cache_find(int id, char *name) + { + struct object *obj; + int ret = -ENOENT; + + mutex_lock(&cache_lock); + obj = __cache_find(id); + if (obj) { + ret = 0; + strcpy(name, obj->name); + } + mutex_unlock(&cache_lock); + return ret; + } + +Note that we always make sure we have the cache_lock when we add, +delete, or look up the cache: both the cache infrastructure itself and +the contents of the objects are protected by the lock. In this case it's +easy, since we copy the data for the user, and never let them access the +objects directly. + +There is a slight (and common) optimization here: in +:c:func:`cache_add()` we set up the fields of the object before +grabbing the lock. This is safe, as no-one else can access it until we +put it in cache. + +Accessing From Interrupt Context +-------------------------------- + +Now consider the case where :c:func:`cache_find()` can be called +from interrupt context: either a hardware interrupt or a softirq. An +example would be a timer which deletes object from the cache. + +The change is shown below, in standard patch format: the ``-`` are lines +which are taken away, and the ``+`` are lines which are added. + +:: + + --- cache.c.usercontext 2003-12-09 13:58:54.000000000 +1100 + +++ cache.c.interrupt 2003-12-09 14:07:49.000000000 +1100 + @@ -12,7 +12,7 @@ + int popularity; + }; + + -static DEFINE_MUTEX(cache_lock); + +static DEFINE_SPINLOCK(cache_lock); + static LIST_HEAD(cache); + static unsigned int cache_num = 0; + #define MAX_CACHE_SIZE 10 + @@ -55,6 +55,7 @@ + int cache_add(int id, const char *name) + { + struct object *obj; + + unsigned long flags; + + if ((obj = kmalloc(sizeof(*obj), GFP_KERNEL)) == NULL) + return -ENOMEM; + @@ -63,30 +64,33 @@ + obj->id = id; + obj->popularity = 0; + + - mutex_lock(&cache_lock); + + spin_lock_irqsave(&cache_lock, flags); + __cache_add(obj); + - mutex_unlock(&cache_lock); + + spin_unlock_irqrestore(&cache_lock, flags); + return 0; + } + + void cache_delete(int id) + { + - mutex_lock(&cache_lock); + + unsigned long flags; + + + + spin_lock_irqsave(&cache_lock, flags); + __cache_delete(__cache_find(id)); + - mutex_unlock(&cache_lock); + + spin_unlock_irqrestore(&cache_lock, flags); + } + + int cache_find(int id, char *name) + { + struct object *obj; + int ret = -ENOENT; + + unsigned long flags; + + - mutex_lock(&cache_lock); + + spin_lock_irqsave(&cache_lock, flags); + obj = __cache_find(id); + if (obj) { + ret = 0; + strcpy(name, obj->name); + } + - mutex_unlock(&cache_lock); + + spin_unlock_irqrestore(&cache_lock, flags); + return ret; + } + +Note that the :c:func:`spin_lock_irqsave()` will turn off +interrupts if they are on, otherwise does nothing (if we are already in +an interrupt handler), hence these functions are safe to call from any +context. + +Unfortunately, :c:func:`cache_add()` calls :c:func:`kmalloc()` +with the ``GFP_KERNEL`` flag, which is only legal in user context. I +have assumed that :c:func:`cache_add()` is still only called in +user context, otherwise this should become a parameter to +:c:func:`cache_add()`. + +Exposing Objects Outside This File +---------------------------------- + +If our objects contained more information, it might not be sufficient to +copy the information in and out: other parts of the code might want to +keep pointers to these objects, for example, rather than looking up the +id every time. This produces two problems. + +The first problem is that we use the ``cache_lock`` to protect objects: +we'd need to make this non-static so the rest of the code can use it. +This makes locking trickier, as it is no longer all in one place. + +The second problem is the lifetime problem: if another structure keeps a +pointer to an object, it presumably expects that pointer to remain +valid. Unfortunately, this is only guaranteed while you hold the lock, +otherwise someone might call :c:func:`cache_delete()` and even +worse, add another object, re-using the same address. + +As there is only one lock, you can't hold it forever: no-one else would +get any work done. + +The solution to this problem is to use a reference count: everyone who +has a pointer to the object increases it when they first get the object, +and drops the reference count when they're finished with it. Whoever +drops it to zero knows it is unused, and can actually delete it. + +Here is the code:: + + --- cache.c.interrupt 2003-12-09 14:25:43.000000000 +1100 + +++ cache.c.refcnt 2003-12-09 14:33:05.000000000 +1100 + @@ -7,6 +7,7 @@ + struct object + { + struct list_head list; + + unsigned int refcnt; + int id; + char name[32]; + int popularity; + @@ -17,6 +18,35 @@ + static unsigned int cache_num = 0; + #define MAX_CACHE_SIZE 10 + + +static void __object_put(struct object *obj) + +{ + + if (--obj->refcnt == 0) + + kfree(obj); + +} + + + +static void __object_get(struct object *obj) + +{ + + obj->refcnt++; + +} + + + +void object_put(struct object *obj) + +{ + + unsigned long flags; + + + + spin_lock_irqsave(&cache_lock, flags); + + __object_put(obj); + + spin_unlock_irqrestore(&cache_lock, flags); + +} + + + +void object_get(struct object *obj) + +{ + + unsigned long flags; + + + + spin_lock_irqsave(&cache_lock, flags); + + __object_get(obj); + + spin_unlock_irqrestore(&cache_lock, flags); + +} + + + /* Must be holding cache_lock */ + static struct object *__cache_find(int id) + { + @@ -35,6 +65,7 @@ + { + BUG_ON(!obj); + list_del(&obj->list); + + __object_put(obj); + cache_num--; + } + + @@ -63,6 +94,7 @@ + strlcpy(obj->name, name, sizeof(obj->name)); + obj->id = id; + obj->popularity = 0; + + obj->refcnt = 1; /* The cache holds a reference */ + + spin_lock_irqsave(&cache_lock, flags); + __cache_add(obj); + @@ -79,18 +111,15 @@ + spin_unlock_irqrestore(&cache_lock, flags); + } + + -int cache_find(int id, char *name) + +struct object *cache_find(int id) + { + struct object *obj; + - int ret = -ENOENT; + unsigned long flags; + + spin_lock_irqsave(&cache_lock, flags); + obj = __cache_find(id); + - if (obj) { + - ret = 0; + - strcpy(name, obj->name); + - } + + if (obj) + + __object_get(obj); + spin_unlock_irqrestore(&cache_lock, flags); + - return ret; + + return obj; + } + +We encapsulate the reference counting in the standard 'get' and 'put' +functions. Now we can return the object itself from +:c:func:`cache_find()` which has the advantage that the user can +now sleep holding the object (eg. to :c:func:`copy_to_user()` to +name to userspace). + +The other point to note is that I said a reference should be held for +every pointer to the object: thus the reference count is 1 when first +inserted into the cache. In some versions the framework does not hold a +reference count, but they are more complicated. + +Using Atomic Operations For The Reference Count +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In practice, :c:type:`atomic_t` would usually be used for refcnt. There are a +number of atomic operations defined in ``include/asm/atomic.h``: these +are guaranteed to be seen atomically from all CPUs in the system, so no +lock is required. In this case, it is simpler than using spinlocks, +although for anything non-trivial using spinlocks is clearer. The +:c:func:`atomic_inc()` and :c:func:`atomic_dec_and_test()` +are used instead of the standard increment and decrement operators, and +the lock is no longer used to protect the reference count itself. + +:: + + --- cache.c.refcnt 2003-12-09 15:00:35.000000000 +1100 + +++ cache.c.refcnt-atomic 2003-12-11 15:49:42.000000000 +1100 + @@ -7,7 +7,7 @@ + struct object + { + struct list_head list; + - unsigned int refcnt; + + atomic_t refcnt; + int id; + char name[32]; + int popularity; + @@ -18,33 +18,15 @@ + static unsigned int cache_num = 0; + #define MAX_CACHE_SIZE 10 + + -static void __object_put(struct object *obj) + -{ + - if (--obj->refcnt == 0) + - kfree(obj); + -} + - + -static void __object_get(struct object *obj) + -{ + - obj->refcnt++; + -} + - + void object_put(struct object *obj) + { + - unsigned long flags; + - + - spin_lock_irqsave(&cache_lock, flags); + - __object_put(obj); + - spin_unlock_irqrestore(&cache_lock, flags); + + if (atomic_dec_and_test(&obj->refcnt)) + + kfree(obj); + } + + void object_get(struct object *obj) + { + - unsigned long flags; + - + - spin_lock_irqsave(&cache_lock, flags); + - __object_get(obj); + - spin_unlock_irqrestore(&cache_lock, flags); + + atomic_inc(&obj->refcnt); + } + + /* Must be holding cache_lock */ + @@ -65,7 +47,7 @@ + { + BUG_ON(!obj); + list_del(&obj->list); + - __object_put(obj); + + object_put(obj); + cache_num--; + } + + @@ -94,7 +76,7 @@ + strlcpy(obj->name, name, sizeof(obj->name)); + obj->id = id; + obj->popularity = 0; + - obj->refcnt = 1; /* The cache holds a reference */ + + atomic_set(&obj->refcnt, 1); /* The cache holds a reference */ + + spin_lock_irqsave(&cache_lock, flags); + __cache_add(obj); + @@ -119,7 +101,7 @@ + spin_lock_irqsave(&cache_lock, flags); + obj = __cache_find(id); + if (obj) + - __object_get(obj); + + object_get(obj); + spin_unlock_irqrestore(&cache_lock, flags); + return obj; + } + +Protecting The Objects Themselves +--------------------------------- + +In these examples, we assumed that the objects (except the reference +counts) never changed once they are created. If we wanted to allow the +name to change, there are three possibilities: + +- You can make ``cache_lock`` non-static, and tell people to grab that + lock before changing the name in any object. + +- You can provide a :c:func:`cache_obj_rename()` which grabs this + lock and changes the name for the caller, and tell everyone to use + that function. + +- You can make the ``cache_lock`` protect only the cache itself, and + use another lock to protect the name. + +Theoretically, you can make the locks as fine-grained as one lock for +every field, for every object. In practice, the most common variants +are: + +- One lock which protects the infrastructure (the ``cache`` list in + this example) and all the objects. This is what we have done so far. + +- One lock which protects the infrastructure (including the list + pointers inside the objects), and one lock inside the object which + protects the rest of that object. + +- Multiple locks to protect the infrastructure (eg. one lock per hash + chain), possibly with a separate per-object lock. + +Here is the "lock-per-object" implementation: + +:: + + --- cache.c.refcnt-atomic 2003-12-11 15:50:54.000000000 +1100 + +++ cache.c.perobjectlock 2003-12-11 17:15:03.000000000 +1100 + @@ -6,11 +6,17 @@ + + struct object + { + + /* These two protected by cache_lock. */ + struct list_head list; + + int popularity; + + + atomic_t refcnt; + + + + /* Doesn't change once created. */ + int id; + + + + spinlock_t lock; /* Protects the name */ + char name[32]; + - int popularity; + }; + + static DEFINE_SPINLOCK(cache_lock); + @@ -77,6 +84,7 @@ + obj->id = id; + obj->popularity = 0; + atomic_set(&obj->refcnt, 1); /* The cache holds a reference */ + + spin_lock_init(&obj->lock); + + spin_lock_irqsave(&cache_lock, flags); + __cache_add(obj); + +Note that I decide that the popularity count should be protected by the +``cache_lock`` rather than the per-object lock: this is because it (like +the :c:type:`struct list_head <list_head>` inside the object) +is logically part of the infrastructure. This way, I don't need to grab +the lock of every object in :c:func:`__cache_add()` when seeking +the least popular. + +I also decided that the id member is unchangeable, so I don't need to +grab each object lock in :c:func:`__cache_find()` to examine the +id: the object lock is only used by a caller who wants to read or write +the name field. + +Note also that I added a comment describing what data was protected by +which locks. This is extremely important, as it describes the runtime +behavior of the code, and can be hard to gain from just reading. And as +Alan Cox says, โLock data, not codeโ. + +Common Problems +=============== + +Deadlock: Simple and Advanced +----------------------------- + +There is a coding bug where a piece of code tries to grab a spinlock +twice: it will spin forever, waiting for the lock to be released +(spinlocks, rwlocks and mutexes are not recursive in Linux). This is +trivial to diagnose: not a +stay-up-five-nights-talk-to-fluffy-code-bunnies kind of problem. + +For a slightly more complex case, imagine you have a region shared by a +softirq and user context. If you use a :c:func:`spin_lock()` call +to protect it, it is possible that the user context will be interrupted +by the softirq while it holds the lock, and the softirq will then spin +forever trying to get the same lock. + +Both of these are called deadlock, and as shown above, it can occur even +with a single CPU (although not on UP compiles, since spinlocks vanish +on kernel compiles with ``CONFIG_SMP``\ =n. You'll still get data +corruption in the second example). + +This complete lockup is easy to diagnose: on SMP boxes the watchdog +timer or compiling with ``DEBUG_SPINLOCK`` set +(``include/linux/spinlock.h``) will show this up immediately when it +happens. + +A more complex problem is the so-called 'deadly embrace', involving two +or more locks. Say you have a hash table: each entry in the table is a +spinlock, and a chain of hashed objects. Inside a softirq handler, you +sometimes want to alter an object from one place in the hash to another: +you grab the spinlock of the old hash chain and the spinlock of the new +hash chain, and delete the object from the old one, and insert it in the +new one. + +There are two problems here. First, if your code ever tries to move the +object to the same chain, it will deadlock with itself as it tries to +lock it twice. Secondly, if the same softirq on another CPU is trying to +move another object in the reverse direction, the following could +happen: + ++-----------------------+-----------------------+ +| CPU 1 | CPU 2 | ++=======================+=======================+ +| Grab lock A -> OK | Grab lock B -> OK | ++-----------------------+-----------------------+ +| Grab lock B -> spin | Grab lock A -> spin | ++-----------------------+-----------------------+ + +Table: Consequences + +The two CPUs will spin forever, waiting for the other to give up their +lock. It will look, smell, and feel like a crash. + +Preventing Deadlock +------------------- + +Textbooks will tell you that if you always lock in the same order, you +will never get this kind of deadlock. Practice will tell you that this +approach doesn't scale: when I create a new lock, I don't understand +enough of the kernel to figure out where in the 5000 lock hierarchy it +will fit. + +The best locks are encapsulated: they never get exposed in headers, and +are never held around calls to non-trivial functions outside the same +file. You can read through this code and see that it will never +deadlock, because it never tries to grab another lock while it has that +one. People using your code don't even need to know you are using a +lock. + +A classic problem here is when you provide callbacks or hooks: if you +call these with the lock held, you risk simple deadlock, or a deadly +embrace (who knows what the callback will do?). Remember, the other +programmers are out to get you, so don't do this. + +Overzealous Prevention Of Deadlocks +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Deadlocks are problematic, but not as bad as data corruption. Code which +grabs a read lock, searches a list, fails to find what it wants, drops +the read lock, grabs a write lock and inserts the object has a race +condition. + +If you don't see why, please stay the fuck away from my code. + +Racing Timers: A Kernel Pastime +------------------------------- + +Timers can produce their own special problems with races. Consider a +collection of objects (list, hash, etc) where each object has a timer +which is due to destroy it. + +If you want to destroy the entire collection (say on module removal), +you might do the following:: + + /* THIS CODE BAD BAD BAD BAD: IF IT WAS ANY WORSE IT WOULD USE + HUNGARIAN NOTATION */ + spin_lock_bh(&list_lock); + + while (list) { + struct foo *next = list->next; + del_timer(&list->timer); + kfree(list); + list = next; + } + + spin_unlock_bh(&list_lock); + + +Sooner or later, this will crash on SMP, because a timer can have just +gone off before the :c:func:`spin_lock_bh()`, and it will only get +the lock after we :c:func:`spin_unlock_bh()`, and then try to free +the element (which has already been freed!). + +This can be avoided by checking the result of +:c:func:`del_timer()`: if it returns 1, the timer has been deleted. +If 0, it means (in this case) that it is currently running, so we can +do:: + + retry: + spin_lock_bh(&list_lock); + + while (list) { + struct foo *next = list->next; + if (!del_timer(&list->timer)) { + /* Give timer a chance to delete this */ + spin_unlock_bh(&list_lock); + goto retry; + } + kfree(list); + list = next; + } + + spin_unlock_bh(&list_lock); + + +Another common problem is deleting timers which restart themselves (by +calling :c:func:`add_timer()` at the end of their timer function). +Because this is a fairly common case which is prone to races, you should +use :c:func:`del_timer_sync()` (``include/linux/timer.h``) to +handle this case. It returns the number of times the timer had to be +deleted before we finally stopped it from adding itself back in. + +Locking Speed +============= + +There are three main things to worry about when considering speed of +some code which does locking. First is concurrency: how many things are +going to be waiting while someone else is holding a lock. Second is the +time taken to actually acquire and release an uncontended lock. Third is +using fewer, or smarter locks. I'm assuming that the lock is used fairly +often: otherwise, you wouldn't be concerned about efficiency. + +Concurrency depends on how long the lock is usually held: you should +hold the lock for as long as needed, but no longer. In the cache +example, we always create the object without the lock held, and then +grab the lock only when we are ready to insert it in the list. + +Acquisition times depend on how much damage the lock operations do to +the pipeline (pipeline stalls) and how likely it is that this CPU was +the last one to grab the lock (ie. is the lock cache-hot for this CPU): +on a machine with more CPUs, this likelihood drops fast. Consider a +700MHz Intel Pentium III: an instruction takes about 0.7ns, an atomic +increment takes about 58ns, a lock which is cache-hot on this CPU takes +160ns, and a cacheline transfer from another CPU takes an additional 170 +to 360ns. (These figures from Paul McKenney's `Linux Journal RCU +article <http://www.linuxjournal.com/article.php?sid=6993>`__). + +These two aims conflict: holding a lock for a short time might be done +by splitting locks into parts (such as in our final per-object-lock +example), but this increases the number of lock acquisitions, and the +results are often slower than having a single lock. This is another +reason to advocate locking simplicity. + +The third concern is addressed below: there are some methods to reduce +the amount of locking which needs to be done. + +Read/Write Lock Variants +------------------------ + +Both spinlocks and mutexes have read/write variants: ``rwlock_t`` and +:c:type:`struct rw_semaphore <rw_semaphore>`. These divide +users into two classes: the readers and the writers. If you are only +reading the data, you can get a read lock, but to write to the data you +need the write lock. Many people can hold a read lock, but a writer must +be sole holder. + +If your code divides neatly along reader/writer lines (as our cache code +does), and the lock is held by readers for significant lengths of time, +using these locks can help. They are slightly slower than the normal +locks though, so in practice ``rwlock_t`` is not usually worthwhile. + +Avoiding Locks: Read Copy Update +-------------------------------- + +There is a special method of read/write locking called Read Copy Update. +Using RCU, the readers can avoid taking a lock altogether: as we expect +our cache to be read more often than updated (otherwise the cache is a +waste of time), it is a candidate for this optimization. + +How do we get rid of read locks? Getting rid of read locks means that +writers may be changing the list underneath the readers. That is +actually quite simple: we can read a linked list while an element is +being added if the writer adds the element very carefully. For example, +adding ``new`` to a single linked list called ``list``:: + + new->next = list->next; + wmb(); + list->next = new; + + +The :c:func:`wmb()` is a write memory barrier. It ensures that the +first operation (setting the new element's ``next`` pointer) is complete +and will be seen by all CPUs, before the second operation is (putting +the new element into the list). This is important, since modern +compilers and modern CPUs can both reorder instructions unless told +otherwise: we want a reader to either not see the new element at all, or +see the new element with the ``next`` pointer correctly pointing at the +rest of the list. + +Fortunately, there is a function to do this for standard +:c:type:`struct list_head <list_head>` lists: +:c:func:`list_add_rcu()` (``include/linux/list.h``). + +Removing an element from the list is even simpler: we replace the +pointer to the old element with a pointer to its successor, and readers +will either see it, or skip over it. + +:: + + list->next = old->next; + + +There is :c:func:`list_del_rcu()` (``include/linux/list.h``) which +does this (the normal version poisons the old object, which we don't +want). + +The reader must also be careful: some CPUs can look through the ``next`` +pointer to start reading the contents of the next element early, but +don't realize that the pre-fetched contents is wrong when the ``next`` +pointer changes underneath them. Once again, there is a +:c:func:`list_for_each_entry_rcu()` (``include/linux/list.h``) +to help you. Of course, writers can just use +:c:func:`list_for_each_entry()`, since there cannot be two +simultaneous writers. + +Our final dilemma is this: when can we actually destroy the removed +element? Remember, a reader might be stepping through this element in +the list right now: if we free this element and the ``next`` pointer +changes, the reader will jump off into garbage and crash. We need to +wait until we know that all the readers who were traversing the list +when we deleted the element are finished. We use +:c:func:`call_rcu()` to register a callback which will actually +destroy the object once all pre-existing readers are finished. +Alternatively, :c:func:`synchronize_rcu()` may be used to block +until all pre-existing are finished. + +But how does Read Copy Update know when the readers are finished? The +method is this: firstly, the readers always traverse the list inside +:c:func:`rcu_read_lock()`/:c:func:`rcu_read_unlock()` pairs: +these simply disable preemption so the reader won't go to sleep while +reading the list. + +RCU then waits until every other CPU has slept at least once: since +readers cannot sleep, we know that any readers which were traversing the +list during the deletion are finished, and the callback is triggered. +The real Read Copy Update code is a little more optimized than this, but +this is the fundamental idea. + +:: + + --- cache.c.perobjectlock 2003-12-11 17:15:03.000000000 +1100 + +++ cache.c.rcupdate 2003-12-11 17:55:14.000000000 +1100 + @@ -1,15 +1,18 @@ + #include <linux/list.h> + #include <linux/slab.h> + #include <linux/string.h> + +#include <linux/rcupdate.h> + #include <linux/mutex.h> + #include <asm/errno.h> + + struct object + { + - /* These two protected by cache_lock. */ + + /* This is protected by RCU */ + struct list_head list; + int popularity; + + + struct rcu_head rcu; + + + atomic_t refcnt; + + /* Doesn't change once created. */ + @@ -40,7 +43,7 @@ + { + struct object *i; + + - list_for_each_entry(i, &cache, list) { + + list_for_each_entry_rcu(i, &cache, list) { + if (i->id == id) { + i->popularity++; + return i; + @@ -49,19 +52,25 @@ + return NULL; + } + + +/* Final discard done once we know no readers are looking. */ + +static void cache_delete_rcu(void *arg) + +{ + + object_put(arg); + +} + + + /* Must be holding cache_lock */ + static void __cache_delete(struct object *obj) + { + BUG_ON(!obj); + - list_del(&obj->list); + - object_put(obj); + + list_del_rcu(&obj->list); + cache_num--; + + call_rcu(&obj->rcu, cache_delete_rcu); + } + + /* Must be holding cache_lock */ + static void __cache_add(struct object *obj) + { + - list_add(&obj->list, &cache); + + list_add_rcu(&obj->list, &cache); + if (++cache_num > MAX_CACHE_SIZE) { + struct object *i, *outcast = NULL; + list_for_each_entry(i, &cache, list) { + @@ -104,12 +114,11 @@ + struct object *cache_find(int id) + { + struct object *obj; + - unsigned long flags; + + - spin_lock_irqsave(&cache_lock, flags); + + rcu_read_lock(); + obj = __cache_find(id); + if (obj) + object_get(obj); + - spin_unlock_irqrestore(&cache_lock, flags); + + rcu_read_unlock(); + return obj; + } + +Note that the reader will alter the popularity member in +:c:func:`__cache_find()`, and now it doesn't hold a lock. One +solution would be to make it an ``atomic_t``, but for this usage, we +don't really care about races: an approximate result is good enough, so +I didn't change it. + +The result is that :c:func:`cache_find()` requires no +synchronization with any other functions, so is almost as fast on SMP as +it would be on UP. + +There is a further optimization possible here: remember our original +cache code, where there were no reference counts and the caller simply +held the lock whenever using the object? This is still possible: if you +hold the lock, no one can delete the object, so you don't need to get +and put the reference count. + +Now, because the 'read lock' in RCU is simply disabling preemption, a +caller which always has preemption disabled between calling +:c:func:`cache_find()` and :c:func:`object_put()` does not +need to actually get and put the reference count: we could expose +:c:func:`__cache_find()` by making it non-static, and such +callers could simply call that. + +The benefit here is that the reference count is not written to: the +object is not altered in any way, which is much faster on SMP machines +due to caching. + +Per-CPU Data +------------ + +Another technique for avoiding locking which is used fairly widely is to +duplicate information for each CPU. For example, if you wanted to keep a +count of a common condition, you could use a spin lock and a single +counter. Nice and simple. + +If that was too slow (it's usually not, but if you've got a really big +machine to test on and can show that it is), you could instead use a +counter for each CPU, then none of them need an exclusive lock. See +:c:func:`DEFINE_PER_CPU()`, :c:func:`get_cpu_var()` and +:c:func:`put_cpu_var()` (``include/linux/percpu.h``). + +Of particular use for simple per-cpu counters is the ``local_t`` type, +and the :c:func:`cpu_local_inc()` and related functions, which are +more efficient than simple code on some architectures +(``include/asm/local.h``). + +Note that there is no simple, reliable way of getting an exact value of +such a counter, without introducing more locks. This is not a problem +for some uses. + +Data Which Mostly Used By An IRQ Handler +---------------------------------------- + +If data is always accessed from within the same IRQ handler, you don't +need a lock at all: the kernel already guarantees that the irq handler +will not run simultaneously on multiple CPUs. + +Manfred Spraul points out that you can still do this, even if the data +is very occasionally accessed in user context or softirqs/tasklets. The +irq handler doesn't use a lock, and all other accesses are done as so:: + + spin_lock(&lock); + disable_irq(irq); + ... + enable_irq(irq); + spin_unlock(&lock); + +The :c:func:`disable_irq()` prevents the irq handler from running +(and waits for it to finish if it's currently running on other CPUs). +The spinlock prevents any other accesses happening at the same time. +Naturally, this is slower than just a :c:func:`spin_lock_irq()` +call, so it only makes sense if this type of access happens extremely +rarely. + +What Functions Are Safe To Call From Interrupts? +================================================ + +Many functions in the kernel sleep (ie. call schedule()) directly or +indirectly: you can never call them while holding a spinlock, or with +preemption disabled. This also means you need to be in user context: +calling them from an interrupt is illegal. + +Some Functions Which Sleep +-------------------------- + +The most common ones are listed below, but you usually have to read the +code to find out if other calls are safe. If everyone else who calls it +can sleep, you probably need to be able to sleep, too. In particular, +registration and deregistration functions usually expect to be called +from user context, and can sleep. + +- Accesses to userspace: + + - :c:func:`copy_from_user()` + + - :c:func:`copy_to_user()` + + - :c:func:`get_user()` + + - :c:func:`put_user()` + +- :c:func:`kmalloc(GFP_KERNEL) <kmalloc>` + +- :c:func:`mutex_lock_interruptible()` and + :c:func:`mutex_lock()` + + There is a :c:func:`mutex_trylock()` which does not sleep. + Still, it must not be used inside interrupt context since its + implementation is not safe for that. :c:func:`mutex_unlock()` + will also never sleep. It cannot be used in interrupt context either + since a mutex must be released by the same task that acquired it. + +Some Functions Which Don't Sleep +-------------------------------- + +Some functions are safe to call from any context, or holding almost any +lock. + +- :c:func:`printk()` + +- :c:func:`kfree()` + +- :c:func:`add_timer()` and :c:func:`del_timer()` + +Mutex API reference +=================== + +.. kernel-doc:: include/linux/mutex.h + :internal: + +.. kernel-doc:: kernel/locking/mutex.c + :export: + +Futex API reference +=================== + +.. kernel-doc:: kernel/futex.c + :internal: + +Further reading +=============== + +- ``Documentation/locking/spinlocks.txt``: Linus Torvalds' spinlocking + tutorial in the kernel sources. + +- Unix Systems for Modern Architectures: Symmetric Multiprocessing and + Caching for Kernel Programmers: + + Curt Schimmel's very good introduction to kernel level locking (not + written for Linux, but nearly everything applies). The book is + expensive, but really worth every penny to understand SMP locking. + [ISBN: 0201633388] + +Thanks +====== + +Thanks to Telsa Gwynne for DocBooking, neatening and adding style. + +Thanks to Martin Pool, Philipp Rumpf, Stephen Rothwell, Paul Mackerras, +Ruedi Aschwanden, Alan Cox, Manfred Spraul, Tim Waugh, Pete Zaitcev, +James Morris, Robert Love, Paul McKenney, John Ashby for proofreading, +correcting, flaming, commenting. + +Thanks to the cabal for having no influence on this document. + +Glossary +======== + +preemption + Prior to 2.5, or when ``CONFIG_PREEMPT`` is unset, processes in user + context inside the kernel would not preempt each other (ie. you had that + CPU until you gave it up, except for interrupts). With the addition of + ``CONFIG_PREEMPT`` in 2.5.4, this changed: when in user context, higher + priority tasks can "cut in": spinlocks were changed to disable + preemption, even on UP. + +bh + Bottom Half: for historical reasons, functions with '_bh' in them often + now refer to any software interrupt, e.g. :c:func:`spin_lock_bh()` + blocks any software interrupt on the current CPU. Bottom halves are + deprecated, and will eventually be replaced by tasklets. Only one bottom + half will be running at any time. + +Hardware Interrupt / Hardware IRQ + Hardware interrupt request. :c:func:`in_irq()` returns true in a + hardware interrupt handler. + +Interrupt Context + Not user context: processing a hardware irq or software irq. Indicated + by the :c:func:`in_interrupt()` macro returning true. + +SMP + Symmetric Multi-Processor: kernels compiled for multiple-CPU machines. + (``CONFIG_SMP=y``). + +Software Interrupt / softirq + Software interrupt handler. :c:func:`in_irq()` returns false; + :c:func:`in_softirq()` returns true. Tasklets and softirqs both + fall into the category of 'software interrupts'. + + Strictly speaking a softirq is one of up to 32 enumerated software + interrupts which can run on multiple CPUs at once. Sometimes used to + refer to tasklets as well (ie. all software interrupts). + +tasklet + A dynamically-registrable software interrupt, which is guaranteed to + only run on one CPU at a time. + +timer + A dynamically-registrable software interrupt, which is run at (or close + to) a given time. When running, it is just like a tasklet (in fact, they + are called from the ``TIMER_SOFTIRQ``). + +UP + Uni-Processor: Non-SMP. (``CONFIG_SMP=n``). + +User Context + The kernel executing on behalf of a particular process (ie. a system + call or trap) or kernel thread. You can tell which process with the + ``current`` macro.) Not to be confused with userspace. Can be + interrupted by software or hardware interrupts. + +Userspace + A process executing its own code outside the kernel. diff --git a/Documentation/kernel-per-CPU-kthreads.txt b/Documentation/kernel-per-CPU-kthreads.txt index df31e30b6a02..0f00f9c164ac 100644 --- a/Documentation/kernel-per-CPU-kthreads.txt +++ b/Documentation/kernel-per-CPU-kthreads.txt @@ -1,27 +1,29 @@ -REDUCING OS JITTER DUE TO PER-CPU KTHREADS +========================================== +Reducing OS jitter due to per-cpu kthreads +========================================== This document lists per-CPU kthreads in the Linux kernel and presents options to control their OS jitter. Note that non-per-CPU kthreads are not listed here. To reduce OS jitter from non-per-CPU kthreads, bind them to a "housekeeping" CPU dedicated to such work. +References +========== -REFERENCES +- Documentation/IRQ-affinity.txt: Binding interrupts to sets of CPUs. -o Documentation/IRQ-affinity.txt: Binding interrupts to sets of CPUs. +- Documentation/cgroup-v1: Using cgroups to bind tasks to sets of CPUs. -o Documentation/cgroup-v1: Using cgroups to bind tasks to sets of CPUs. - -o man taskset: Using the taskset command to bind tasks to sets +- man taskset: Using the taskset command to bind tasks to sets of CPUs. -o man sched_setaffinity: Using the sched_setaffinity() system +- man sched_setaffinity: Using the sched_setaffinity() system call to bind tasks to sets of CPUs. -o /sys/devices/system/cpu/cpuN/online: Control CPU N's hotplug state, +- /sys/devices/system/cpu/cpuN/online: Control CPU N's hotplug state, writing "0" to offline and "1" to online. -o In order to locate kernel-generated OS jitter on CPU N: +- In order to locate kernel-generated OS jitter on CPU N: cd /sys/kernel/debug/tracing echo 1 > max_graph_depth # Increase the "1" for more detail @@ -29,12 +31,17 @@ o In order to locate kernel-generated OS jitter on CPU N: # run workload cat per_cpu/cpuN/trace +kthreads +======== + +Name: + ehca_comp/%u -KTHREADS +Purpose: + Periodically process Infiniband-related work. -Name: ehca_comp/%u -Purpose: Periodically process Infiniband-related work. To reduce its OS jitter, do any of the following: + 1. Don't use eHCA Infiniband hardware, instead choosing hardware that does not require per-CPU kthreads. This will prevent these kthreads from being created in the first place. (This will @@ -46,26 +53,45 @@ To reduce its OS jitter, do any of the following: provisioned only on selected CPUs. -Name: irq/%d-%s -Purpose: Handle threaded interrupts. +Name: + irq/%d-%s + +Purpose: + Handle threaded interrupts. + To reduce its OS jitter, do the following: + 1. Use irq affinity to force the irq threads to execute on some other CPU. -Name: kcmtpd_ctr_%d -Purpose: Handle Bluetooth work. +Name: + kcmtpd_ctr_%d + +Purpose: + Handle Bluetooth work. + To reduce its OS jitter, do one of the following: + 1. Don't use Bluetooth, in which case these kthreads won't be created in the first place. 2. Use irq affinity to force Bluetooth-related interrupts to occur on some other CPU and furthermore initiate all Bluetooth activity on some other CPU. -Name: ksoftirqd/%u -Purpose: Execute softirq handlers when threaded or when under heavy load. +Name: + ksoftirqd/%u + +Purpose: + Execute softirq handlers when threaded or when under heavy load. + To reduce its OS jitter, each softirq vector must be handled separately as follows: -TIMER_SOFTIRQ: Do all of the following: + +TIMER_SOFTIRQ +------------- + +Do all of the following: + 1. To the extent possible, keep the CPU out of the kernel when it is non-idle, for example, by avoiding system calls and by forcing both kernel threads and interrupts to execute elsewhere. @@ -76,52 +102,81 @@ TIMER_SOFTIRQ: Do all of the following: first one back online. Once you have onlined the CPUs in question, do not offline any other CPUs, because doing so could force the timer back onto one of the CPUs in question. -NET_TX_SOFTIRQ and NET_RX_SOFTIRQ: Do all of the following: + +NET_TX_SOFTIRQ and NET_RX_SOFTIRQ +--------------------------------- + +Do all of the following: + 1. Force networking interrupts onto other CPUs. 2. Initiate any network I/O on other CPUs. 3. Once your application has started, prevent CPU-hotplug operations from being initiated from tasks that might run on the CPU to be de-jittered. (It is OK to force this CPU offline and then bring it back online before you start your application.) -BLOCK_SOFTIRQ: Do all of the following: + +BLOCK_SOFTIRQ +------------- + +Do all of the following: + 1. Force block-device interrupts onto some other CPU. 2. Initiate any block I/O on other CPUs. 3. Once your application has started, prevent CPU-hotplug operations from being initiated from tasks that might run on the CPU to be de-jittered. (It is OK to force this CPU offline and then bring it back online before you start your application.) -IRQ_POLL_SOFTIRQ: Do all of the following: + +IRQ_POLL_SOFTIRQ +---------------- + +Do all of the following: + 1. Force block-device interrupts onto some other CPU. 2. Initiate any block I/O and block-I/O polling on other CPUs. 3. Once your application has started, prevent CPU-hotplug operations from being initiated from tasks that might run on the CPU to be de-jittered. (It is OK to force this CPU offline and then bring it back online before you start your application.) -TASKLET_SOFTIRQ: Do one or more of the following: + +TASKLET_SOFTIRQ +--------------- + +Do one or more of the following: + 1. Avoid use of drivers that use tasklets. (Such drivers will contain calls to things like tasklet_schedule().) 2. Convert all drivers that you must use from tasklets to workqueues. 3. Force interrupts for drivers using tasklets onto other CPUs, and also do I/O involving these drivers on other CPUs. -SCHED_SOFTIRQ: Do all of the following: + +SCHED_SOFTIRQ +------------- + +Do all of the following: + 1. Avoid sending scheduler IPIs to the CPU to be de-jittered, for example, ensure that at most one runnable kthread is present on that CPU. If a thread that expects to run on the de-jittered CPU awakens, the scheduler will send an IPI that can result in a subsequent SCHED_SOFTIRQ. -2. Build with CONFIG_RCU_NOCB_CPU=y, CONFIG_RCU_NOCB_CPU_ALL=y, - CONFIG_NO_HZ_FULL=y, and, in addition, ensure that the CPU - to be de-jittered is marked as an adaptive-ticks CPU using the - "nohz_full=" boot parameter. This reduces the number of - scheduler-clock interrupts that the de-jittered CPU receives, - minimizing its chances of being selected to do the load balancing - work that runs in SCHED_SOFTIRQ context. +2. CONFIG_NO_HZ_FULL=y and ensure that the CPU to be de-jittered + is marked as an adaptive-ticks CPU using the "nohz_full=" + boot parameter. This reduces the number of scheduler-clock + interrupts that the de-jittered CPU receives, minimizing its + chances of being selected to do the load balancing work that + runs in SCHED_SOFTIRQ context. 3. To the extent possible, keep the CPU out of the kernel when it is non-idle, for example, by avoiding system calls and by forcing both kernel threads and interrupts to execute elsewhere. This further reduces the number of scheduler-clock interrupts received by the de-jittered CPU. -HRTIMER_SOFTIRQ: Do all of the following: + +HRTIMER_SOFTIRQ +--------------- + +Do all of the following: + 1. To the extent possible, keep the CPU out of the kernel when it is non-idle. For example, avoid system calls and force both kernel threads and interrupts to execute elsewhere. @@ -132,20 +187,27 @@ HRTIMER_SOFTIRQ: Do all of the following: back online. Once you have onlined the CPUs in question, do not offline any other CPUs, because doing so could force the timer back onto one of the CPUs in question. -RCU_SOFTIRQ: Do at least one of the following: + +RCU_SOFTIRQ +----------- + +Do at least one of the following: + 1. Offload callbacks and keep the CPU in either dyntick-idle or adaptive-ticks state by doing all of the following: - a. Build with CONFIG_RCU_NOCB_CPU=y, CONFIG_RCU_NOCB_CPU_ALL=y, - CONFIG_NO_HZ_FULL=y, and, in addition ensure that the CPU - to be de-jittered is marked as an adaptive-ticks CPU using - the "nohz_full=" boot parameter. Bind the rcuo kthreads - to housekeeping CPUs, which can tolerate OS jitter. + + a. CONFIG_NO_HZ_FULL=y and ensure that the CPU to be + de-jittered is marked as an adaptive-ticks CPU using the + "nohz_full=" boot parameter. Bind the rcuo kthreads to + housekeeping CPUs, which can tolerate OS jitter. b. To the extent possible, keep the CPU out of the kernel when it is non-idle, for example, by avoiding system calls and by forcing both kernel threads and interrupts to execute elsewhere. + 2. Enable RCU to do its processing remotely via dyntick-idle by doing all of the following: + a. Build with CONFIG_NO_HZ=y and CONFIG_RCU_FAST_NO_HZ=y. b. Ensure that the CPU goes idle frequently, allowing other CPUs to detect that it has passed through an RCU quiescent @@ -157,15 +219,20 @@ RCU_SOFTIRQ: Do at least one of the following: calls and by forcing both kernel threads and interrupts to execute elsewhere. -Name: kworker/%u:%d%s (cpu, id, priority) -Purpose: Execute workqueue requests +Name: + kworker/%u:%d%s (cpu, id, priority) + +Purpose: + Execute workqueue requests + To reduce its OS jitter, do any of the following: + 1. Run your workload at a real-time priority, which will allow preempting the kworker daemons. 2. A given workqueue can be made visible in the sysfs filesystem by passing the WQ_SYSFS to that workqueue's alloc_workqueue(). Such a workqueue can be confined to a given subset of the - CPUs using the /sys/devices/virtual/workqueue/*/cpumask sysfs + CPUs using the ``/sys/devices/virtual/workqueue/*/cpumask`` sysfs files. The set of WQ_SYSFS workqueues can be displayed using "ls sys/devices/virtual/workqueue". That said, the workqueues maintainer would like to caution people against indiscriminately @@ -175,6 +242,7 @@ To reduce its OS jitter, do any of the following: to remove it, even if its addition was a mistake. 3. Do any of the following needed to avoid jitter that your application cannot tolerate: + a. Build your kernel with CONFIG_SLUB=y rather than CONFIG_SLAB=y, thus avoiding the slab allocator's periodic use of each CPU's workqueues to run its cache_reap() @@ -188,6 +256,7 @@ To reduce its OS jitter, do any of the following: be able to build your kernel with CONFIG_CPU_FREQ=n to avoid the CPU-frequency governor periodically running on each CPU, including cs_dbs_timer() and od_dbs_timer(). + WARNING: Please check your CPU specifications to make sure that this is safe on your particular system. d. As of v3.18, Christoph Lameter's on-demand vmstat workers @@ -224,9 +293,14 @@ To reduce its OS jitter, do any of the following: CONFIG_PMAC_RACKMETER=n to disable the CPU-meter, avoiding OS jitter from rackmeter_do_timer(). -Name: rcuc/%u -Purpose: Execute RCU callbacks in CONFIG_RCU_BOOST=y kernels. +Name: + rcuc/%u + +Purpose: + Execute RCU callbacks in CONFIG_RCU_BOOST=y kernels. + To reduce its OS jitter, do at least one of the following: + 1. Build the kernel with CONFIG_PREEMPT=n. This prevents these kthreads from being created in the first place, and also obviates the need for RCU priority boosting. This approach is feasible @@ -236,20 +310,24 @@ To reduce its OS jitter, do at least one of the following: is feasible only if your workload never requires RCU priority boosting, for example, if you ensure frequent idle time on all CPUs that might execute within the kernel. -3. Build with CONFIG_RCU_NOCB_CPU=y and CONFIG_RCU_NOCB_CPU_ALL=y, - which offloads all RCU callbacks to kthreads that can be moved - off of CPUs susceptible to OS jitter. This approach prevents the - rcuc/%u kthreads from having any work to do, so that they are - never awakened. +3. Build with CONFIG_RCU_NOCB_CPU=y and boot with the rcu_nocbs= + boot parameter offloading RCU callbacks from all CPUs susceptible + to OS jitter. This approach prevents the rcuc/%u kthreads from + having any work to do, so that they are never awakened. 4. Ensure that the CPU never enters the kernel, and, in particular, avoid initiating any CPU hotplug operations on this CPU. This is another way of preventing any callbacks from being queued on the CPU, again preventing the rcuc/%u kthreads from having any work to do. -Name: rcuob/%d, rcuop/%d, and rcuos/%d -Purpose: Offload RCU callbacks from the corresponding CPU. +Name: + rcuob/%d, rcuop/%d, and rcuos/%d + +Purpose: + Offload RCU callbacks from the corresponding CPU. + To reduce its OS jitter, do at least one of the following: + 1. Use affinity, cgroups, or other mechanism to force these kthreads to execute on some other CPU. 2. Build with CONFIG_RCU_NOCB_CPU=n, which will prevent these @@ -257,9 +335,14 @@ To reduce its OS jitter, do at least one of the following: note that this will not eliminate OS jitter, but will instead shift it to RCU_SOFTIRQ. -Name: watchdog/%u -Purpose: Detect software lockups on each CPU. +Name: + watchdog/%u + +Purpose: + Detect software lockups on each CPU. + To reduce its OS jitter, do at least one of the following: + 1. Build with CONFIG_LOCKUP_DETECTOR=n, which will prevent these kthreads from being created in the first place. 2. Boot with "nosoftlockup=0", which will also prevent these kthreads diff --git a/Documentation/kobject.txt b/Documentation/kobject.txt index 1be59a3a521c..fc9485d79061 100644 --- a/Documentation/kobject.txt +++ b/Documentation/kobject.txt @@ -1,13 +1,13 @@ +===================================================================== Everything you never wanted to know about kobjects, ksets, and ktypes +===================================================================== -Greg Kroah-Hartman <gregkh@linuxfoundation.org> +:Author: Greg Kroah-Hartman <gregkh@linuxfoundation.org> +:Last updated: December 19, 2007 Based on an original article by Jon Corbet for lwn.net written October 1, 2003 and located at http://lwn.net/Articles/51437/ -Last updated December 19, 2007 - - Part of the difficulty in understanding the driver model - and the kobject abstraction upon which it is built - is that there is no obvious starting place. Dealing with kobjects requires understanding a few different types, @@ -47,6 +47,7 @@ approach will be taken, so we'll go back to kobjects. Embedding kobjects +================== It is rare for kernel code to create a standalone kobject, with one major exception explained below. Instead, kobjects are used to control access to @@ -65,7 +66,7 @@ their own, but are invariably found embedded in the larger objects of interest.) So, for example, the UIO code in drivers/uio/uio.c has a structure that -defines the memory region associated with a uio device: +defines the memory region associated with a uio device:: struct uio_map { struct kobject kobj; @@ -77,7 +78,7 @@ just a matter of using the kobj member. Code that works with kobjects will often have the opposite problem, however: given a struct kobject pointer, what is the pointer to the containing structure? You must avoid tricks (such as assuming that the kobject is at the beginning of the structure) -and, instead, use the container_of() macro, found in <linux/kernel.h>: +and, instead, use the container_of() macro, found in <linux/kernel.h>:: container_of(pointer, type, member) @@ -90,13 +91,13 @@ where: The return value from container_of() is a pointer to the corresponding container type. So, for example, a pointer "kp" to a struct kobject embedded *within* a struct uio_map could be converted to a pointer to the -*containing* uio_map structure with: +*containing* uio_map structure with:: struct uio_map *u_map = container_of(kp, struct uio_map, kobj); For convenience, programmers often define a simple macro for "back-casting" kobject pointers to the containing type. Exactly this happens in the -earlier drivers/uio/uio.c, as you can see here: +earlier drivers/uio/uio.c, as you can see here:: struct uio_map { struct kobject kobj; @@ -106,23 +107,25 @@ earlier drivers/uio/uio.c, as you can see here: #define to_map(map) container_of(map, struct uio_map, kobj) where the macro argument "map" is a pointer to the struct kobject in -question. That macro is subsequently invoked with: +question. That macro is subsequently invoked with:: struct uio_map *map = to_map(kobj); Initialization of kobjects +========================== Code which creates a kobject must, of course, initialize that object. Some -of the internal fields are setup with a (mandatory) call to kobject_init(): +of the internal fields are setup with a (mandatory) call to kobject_init():: void kobject_init(struct kobject *kobj, struct kobj_type *ktype); The ktype is required for a kobject to be created properly, as every kobject must have an associated kobj_type. After calling kobject_init(), to -register the kobject with sysfs, the function kobject_add() must be called: +register the kobject with sysfs, the function kobject_add() must be called:: - int kobject_add(struct kobject *kobj, struct kobject *parent, const char *fmt, ...); + int kobject_add(struct kobject *kobj, struct kobject *parent, + const char *fmt, ...); This sets up the parent of the kobject and the name for the kobject properly. If the kobject is to be associated with a specific kset, @@ -133,7 +136,7 @@ kset itself. As the name of the kobject is set when it is added to the kernel, the name of the kobject should never be manipulated directly. If you must change -the name of the kobject, call kobject_rename(): +the name of the kobject, call kobject_rename():: int kobject_rename(struct kobject *kobj, const char *new_name); @@ -146,12 +149,12 @@ is being removed. If your code needs to call this function, it is incorrect and needs to be fixed. To properly access the name of the kobject, use the function -kobject_name(): +kobject_name():: const char *kobject_name(const struct kobject * kobj); There is a helper function to both initialize and add the kobject to the -kernel at the same time, called surprisingly enough kobject_init_and_add(): +kernel at the same time, called surprisingly enough kobject_init_and_add():: int kobject_init_and_add(struct kobject *kobj, struct kobj_type *ktype, struct kobject *parent, const char *fmt, ...); @@ -161,10 +164,11 @@ kobject_add() functions described above. Uevents +======= After a kobject has been registered with the kobject core, you need to announce to the world that it has been created. This can be done with a -call to kobject_uevent(): +call to kobject_uevent():: int kobject_uevent(struct kobject *kobj, enum kobject_action action); @@ -180,11 +184,12 @@ hand. Reference counts +================ One of the key functions of a kobject is to serve as a reference counter for the object in which it is embedded. As long as references to the object exist, the object (and the code which supports it) must continue to exist. -The low-level functions for manipulating a kobject's reference counts are: +The low-level functions for manipulating a kobject's reference counts are:: struct kobject *kobject_get(struct kobject *kobj); void kobject_put(struct kobject *kobj); @@ -209,21 +214,24 @@ file Documentation/kref.txt in the Linux kernel source tree. Creating "simple" kobjects +========================== Sometimes all that a developer wants is a way to create a simple directory in the sysfs hierarchy, and not have to mess with the whole complication of ksets, show and store functions, and other details. This is the one exception where a single kobject should be created. To create such an -entry, use the function: +entry, use the function:: struct kobject *kobject_create_and_add(char *name, struct kobject *parent); This function will create a kobject and place it in sysfs in the location underneath the specified parent kobject. To create simple attributes -associated with this kobject, use: +associated with this kobject, use:: int sysfs_create_file(struct kobject *kobj, struct attribute *attr); -or + +or:: + int sysfs_create_group(struct kobject *kobj, struct attribute_group *grp); Both types of attributes used here, with a kobject that has been created @@ -236,6 +244,7 @@ implementation of a simple kobject and attributes. ktypes and release methods +========================== One important thing still missing from the discussion is what happens to a kobject when its reference count reaches zero. The code which created the @@ -257,7 +266,7 @@ is good practice to always use kobject_put() after kobject_init() to avoid errors creeping in. This notification is done through a kobject's release() method. Usually -such a method has a form like: +such a method has a form like:: void my_object_release(struct kobject *kobj) { @@ -281,7 +290,7 @@ leak in the kobject core, which makes people unhappy. Interestingly, the release() method is not stored in the kobject itself; instead, it is associated with the ktype. So let us introduce struct -kobj_type: +kobj_type:: struct kobj_type { void (*release)(struct kobject *kobj); @@ -306,6 +315,7 @@ automatically created for any kobject that is registered with this ktype. ksets +===== A kset is merely a collection of kobjects that want to be associated with each other. There is no restriction that they be of the same ktype, but be @@ -335,13 +345,16 @@ kobject) in their parent. As a kset contains a kobject within it, it should always be dynamically created and never declared statically or on the stack. To create a new -kset use: +kset use:: + struct kset *kset_create_and_add(const char *name, struct kset_uevent_ops *u, struct kobject *parent); -When you are finished with the kset, call: +When you are finished with the kset, call:: + void kset_unregister(struct kset *kset); + to destroy it. This removes the kset from sysfs and decrements its reference count. When the reference count goes to zero, the kset will be released. Because other references to the kset may still exist, the release may happen @@ -351,14 +364,14 @@ An example of using a kset can be seen in the samples/kobject/kset-example.c file in the kernel tree. If a kset wishes to control the uevent operations of the kobjects -associated with it, it can use the struct kset_uevent_ops to handle it: +associated with it, it can use the struct kset_uevent_ops to handle it:: -struct kset_uevent_ops { + struct kset_uevent_ops { int (*filter)(struct kset *kset, struct kobject *kobj); const char *(*name)(struct kset *kset, struct kobject *kobj); int (*uevent)(struct kset *kset, struct kobject *kobj, struct kobj_uevent_env *env); -}; + }; The filter function allows a kset to prevent a uevent from being emitted to @@ -386,6 +399,7 @@ added below the parent kobject. Kobject removal +=============== After a kobject has been registered with the kobject core successfully, it must be cleaned up when the code is finished with it. To do that, call @@ -409,6 +423,7 @@ called, and the objects in the former circle release each other. Example code to copy from +========================= For a more complete example of using ksets and kobjects properly, see the example programs samples/kobject/{kobject-example.c,kset-example.c}, diff --git a/Documentation/kprobes.txt b/Documentation/kprobes.txt index 1f6d45abfe42..2335715bf471 100644 --- a/Documentation/kprobes.txt +++ b/Documentation/kprobes.txt @@ -1,30 +1,36 @@ -Title : Kernel Probes (Kprobes) -Authors : Jim Keniston <jkenisto@us.ibm.com> - : Prasanna S Panchamukhi <prasanna.panchamukhi@gmail.com> - : Masami Hiramatsu <mhiramat@redhat.com> - -CONTENTS - -1. Concepts: Kprobes, Jprobes, Return Probes -2. Architectures Supported -3. Configuring Kprobes -4. API Reference -5. Kprobes Features and Limitations -6. Probe Overhead -7. TODO -8. Kprobes Example -9. Jprobes Example -10. Kretprobes Example -Appendix A: The kprobes debugfs interface -Appendix B: The kprobes sysctl interface - -1. Concepts: Kprobes, Jprobes, Return Probes +======================= +Kernel Probes (Kprobes) +======================= + +:Author: Jim Keniston <jkenisto@us.ibm.com> +:Author: Prasanna S Panchamukhi <prasanna.panchamukhi@gmail.com> +:Author: Masami Hiramatsu <mhiramat@redhat.com> + +.. CONTENTS + + 1. Concepts: Kprobes, Jprobes, Return Probes + 2. Architectures Supported + 3. Configuring Kprobes + 4. API Reference + 5. Kprobes Features and Limitations + 6. Probe Overhead + 7. TODO + 8. Kprobes Example + 9. Jprobes Example + 10. Kretprobes Example + Appendix A: The kprobes debugfs interface + Appendix B: The kprobes sysctl interface + +Concepts: Kprobes, Jprobes, Return Probes +========================================= Kprobes enables you to dynamically break into any kernel routine and collect debugging and performance information non-disruptively. You -can trap at almost any kernel code address(*), specifying a handler +can trap at almost any kernel code address [1]_, specifying a handler routine to be invoked when the breakpoint is hit. -(*: some parts of the kernel code can not be trapped, see 1.5 Blacklist) + +.. [1] some parts of the kernel code can not be trapped, see + :ref:`kprobes_blacklist`) There are currently three types of probes: kprobes, jprobes, and kretprobes (also called return probes). A kprobe can be inserted @@ -40,8 +46,8 @@ registration function such as register_kprobe() specifies where the probe is to be inserted and what handler is to be called when the probe is hit. -There are also register_/unregister_*probes() functions for batch -registration/unregistration of a group of *probes. These functions +There are also ``register_/unregister_*probes()`` functions for batch +registration/unregistration of a group of ``*probes``. These functions can speed up unregistration process when you have to unregister a lot of probes at once. @@ -51,9 +57,10 @@ things that you'll need to know in order to make the best use of Kprobes -- e.g., the difference between a pre_handler and a post_handler, and how to use the maxactive and nmissed fields of a kretprobe. But if you're in a hurry to start using Kprobes, you -can skip ahead to section 2. +can skip ahead to :ref:`kprobes_archs_supported`. -1.1 How Does a Kprobe Work? +How Does a Kprobe Work? +----------------------- When a kprobe is registered, Kprobes makes a copy of the probed instruction and replaces the first byte(s) of the probed instruction @@ -75,7 +82,8 @@ After the instruction is single-stepped, Kprobes executes the "post_handler," if any, that is associated with the kprobe. Execution then continues with the instruction following the probepoint. -1.2 How Does a Jprobe Work? +How Does a Jprobe Work? +----------------------- A jprobe is implemented using a kprobe that is placed on a function's entry point. It employs a simple mirroring principle to allow @@ -113,9 +121,11 @@ more than eight function arguments, an argument of more than sixteen bytes, or more than 64 bytes of argument data, depending on architecture). -1.3 Return Probes +Return Probes +------------- -1.3.1 How Does a Return Probe Work? +How Does a Return Probe Work? +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ When you call register_kretprobe(), Kprobes establishes a kprobe at the entry to the function. When the probed function is called and this @@ -150,7 +160,8 @@ zero when the return probe is registered, and is incremented every time the probed function is entered but there is no kretprobe_instance object available for establishing the return probe. -1.3.2 Kretprobe entry-handler +Kretprobe entry-handler +^^^^^^^^^^^^^^^^^^^^^^^ Kretprobes also provides an optional user-specified handler which runs on function entry. This handler is specified by setting the entry_handler @@ -174,7 +185,10 @@ In case probed function is entered but there is no kretprobe_instance object available, then in addition to incrementing the nmissed count, the user entry_handler invocation is also skipped. -1.4 How Does Jump Optimization Work? +.. _kprobes_jump_optimization: + +How Does Jump Optimization Work? +-------------------------------- If your kernel is built with CONFIG_OPTPROBES=y (currently this flag is automatically set 'y' on x86/x86-64, non-preemptive kernel) and @@ -182,53 +196,60 @@ the "debug.kprobes_optimization" kernel parameter is set to 1 (see sysctl(8)), Kprobes tries to reduce probe-hit overhead by using a jump instruction instead of a breakpoint instruction at each probepoint. -1.4.1 Init a Kprobe +Init a Kprobe +^^^^^^^^^^^^^ When a probe is registered, before attempting this optimization, Kprobes inserts an ordinary, breakpoint-based kprobe at the specified address. So, even if it's not possible to optimize this particular probepoint, there'll be a probe there. -1.4.2 Safety Check +Safety Check +^^^^^^^^^^^^ Before optimizing a probe, Kprobes performs the following safety checks: - Kprobes verifies that the region that will be replaced by the jump -instruction (the "optimized region") lies entirely within one function. -(A jump instruction is multiple bytes, and so may overlay multiple -instructions.) + instruction (the "optimized region") lies entirely within one function. + (A jump instruction is multiple bytes, and so may overlay multiple + instructions.) - Kprobes analyzes the entire function and verifies that there is no -jump into the optimized region. Specifically: + jump into the optimized region. Specifically: + - the function contains no indirect jump; - the function contains no instruction that causes an exception (since - the fixup code triggered by the exception could jump back into the - optimized region -- Kprobes checks the exception tables to verify this); - and + the fixup code triggered by the exception could jump back into the + optimized region -- Kprobes checks the exception tables to verify this); - there is no near jump to the optimized region (other than to the first - byte). + byte). - For each instruction in the optimized region, Kprobes verifies that -the instruction can be executed out of line. + the instruction can be executed out of line. -1.4.3 Preparing Detour Buffer +Preparing Detour Buffer +^^^^^^^^^^^^^^^^^^^^^^^ Next, Kprobes prepares a "detour" buffer, which contains the following instruction sequence: + - code to push the CPU's registers (emulating a breakpoint trap) - a call to the trampoline code which calls user's probe handlers. - code to restore registers - the instructions from the optimized region - a jump back to the original execution path. -1.4.4 Pre-optimization +Pre-optimization +^^^^^^^^^^^^^^^^ After preparing the detour buffer, Kprobes verifies that none of the following situations exist: + - The probe has either a break_handler (i.e., it's a jprobe) or a -post_handler. + post_handler. - Other instructions in the optimized region are probed. - The probe is disabled. + In any of the above cases, Kprobes won't start optimizing the probe. Since these are temporary situations, Kprobes tries to start optimizing it again if the situation is changed. @@ -240,21 +261,23 @@ Kprobes returns control to the original instruction path by setting the CPU's instruction pointer to the copied code in the detour buffer -- thus at least avoiding the single-step. -1.4.5 Optimization +Optimization +^^^^^^^^^^^^ The Kprobe-optimizer doesn't insert the jump instruction immediately; rather, it calls synchronize_sched() for safety first, because it's possible for a CPU to be interrupted in the middle of executing the -optimized region(*). As you know, synchronize_sched() can ensure +optimized region [3]_. As you know, synchronize_sched() can ensure that all interruptions that were active when synchronize_sched() was called are done, but only if CONFIG_PREEMPT=n. So, this version -of kprobe optimization supports only kernels with CONFIG_PREEMPT=n.(**) +of kprobe optimization supports only kernels with CONFIG_PREEMPT=n [4]_. After that, the Kprobe-optimizer calls stop_machine() to replace the optimized region with a jump instruction to the detour buffer, using text_poke_smp(). -1.4.6 Unoptimization +Unoptimization +^^^^^^^^^^^^^^ When an optimized kprobe is unregistered, disabled, or blocked by another kprobe, it will be unoptimized. If this happens before @@ -263,15 +286,15 @@ optimized list. If the optimization has been done, the jump is replaced with the original code (except for an int3 breakpoint in the first byte) by using text_poke_smp(). -(*)Please imagine that the 2nd instruction is interrupted and then -the optimizer replaces the 2nd instruction with the jump *address* -while the interrupt handler is running. When the interrupt -returns to original address, there is no valid instruction, -and it causes an unexpected result. +.. [3] Please imagine that the 2nd instruction is interrupted and then + the optimizer replaces the 2nd instruction with the jump *address* + while the interrupt handler is running. When the interrupt + returns to original address, there is no valid instruction, + and it causes an unexpected result. -(**)This optimization-safety checking may be replaced with the -stop-machine method that ksplice uses for supporting a CONFIG_PREEMPT=y -kernel. +.. [4] This optimization-safety checking may be replaced with the + stop-machine method that ksplice uses for supporting a CONFIG_PREEMPT=y + kernel. NOTE for geeks: The jump optimization changes the kprobe's pre_handler behavior. @@ -280,11 +303,17 @@ path by changing regs->ip and returning 1. However, when the probe is optimized, that modification is ignored. Thus, if you want to tweak the kernel's execution path, you need to suppress optimization, using one of the following techniques: + - Specify an empty function for the kprobe's post_handler or break_handler. - or + +or + - Execute 'sysctl -w debug.kprobes_optimization=n' -1.5 Blacklist +.. _kprobes_blacklist: + +Blacklist +--------- Kprobes can probe most of the kernel except itself. This means that there are some functions where kprobes cannot probe. Probing @@ -297,7 +326,10 @@ to specify a blacklisted function. Kprobes checks the given probe address against the blacklist and rejects registering it, if the given address is in the blacklist. -2. Architectures Supported +.. _kprobes_archs_supported: + +Architectures Supported +======================= Kprobes, jprobes, and return probes are implemented on the following architectures: @@ -312,7 +344,8 @@ architectures: - mips - s390 -3. Configuring Kprobes +Configuring Kprobes +=================== When configuring the kernel using make menuconfig/xconfig/oldconfig, ensure that CONFIG_KPROBES is set to "y". Under "General setup", look @@ -331,7 +364,8 @@ it useful to "Compile the kernel with debug info" (CONFIG_DEBUG_INFO), so you can use "objdump -d -l vmlinux" to see the source-to-object code mapping. -4. API Reference +API Reference +============= The Kprobes API includes a "register" function and an "unregister" function for each type of probe. The API also includes "register_*probes" @@ -340,10 +374,13 @@ Here are terse, mini-man-page specifications for these functions and the associated probe handlers that you'll write. See the files in the samples/kprobes/ sub-directory for examples. -4.1 register_kprobe +register_kprobe +--------------- + +:: -#include <linux/kprobes.h> -int register_kprobe(struct kprobe *kp); + #include <linux/kprobes.h> + int register_kprobe(struct kprobe *kp); Sets a breakpoint at the address kp->addr. When the breakpoint is hit, Kprobes calls kp->pre_handler. After the probed instruction @@ -354,61 +391,68 @@ kp->fault_handler. Any or all handlers can be NULL. If kp->flags is set KPROBE_FLAG_DISABLED, that kp will be registered but disabled, so, its handlers aren't hit until calling enable_kprobe(kp). -NOTE: -1. With the introduction of the "symbol_name" field to struct kprobe, -the probepoint address resolution will now be taken care of by the kernel. -The following will now work: +.. note:: + + 1. With the introduction of the "symbol_name" field to struct kprobe, + the probepoint address resolution will now be taken care of by the kernel. + The following will now work:: kp.symbol_name = "symbol_name"; -(64-bit powerpc intricacies such as function descriptors are handled -transparently) + (64-bit powerpc intricacies such as function descriptors are handled + transparently) -2. Use the "offset" field of struct kprobe if the offset into the symbol -to install a probepoint is known. This field is used to calculate the -probepoint. + 2. Use the "offset" field of struct kprobe if the offset into the symbol + to install a probepoint is known. This field is used to calculate the + probepoint. -3. Specify either the kprobe "symbol_name" OR the "addr". If both are -specified, kprobe registration will fail with -EINVAL. + 3. Specify either the kprobe "symbol_name" OR the "addr". If both are + specified, kprobe registration will fail with -EINVAL. -4. With CISC architectures (such as i386 and x86_64), the kprobes code -does not validate if the kprobe.addr is at an instruction boundary. -Use "offset" with caution. + 4. With CISC architectures (such as i386 and x86_64), the kprobes code + does not validate if the kprobe.addr is at an instruction boundary. + Use "offset" with caution. register_kprobe() returns 0 on success, or a negative errno otherwise. -User's pre-handler (kp->pre_handler): -#include <linux/kprobes.h> -#include <linux/ptrace.h> -int pre_handler(struct kprobe *p, struct pt_regs *regs); +User's pre-handler (kp->pre_handler):: + + #include <linux/kprobes.h> + #include <linux/ptrace.h> + int pre_handler(struct kprobe *p, struct pt_regs *regs); Called with p pointing to the kprobe associated with the breakpoint, and regs pointing to the struct containing the registers saved when the breakpoint was hit. Return 0 here unless you're a Kprobes geek. -User's post-handler (kp->post_handler): -#include <linux/kprobes.h> -#include <linux/ptrace.h> -void post_handler(struct kprobe *p, struct pt_regs *regs, - unsigned long flags); +User's post-handler (kp->post_handler):: + + #include <linux/kprobes.h> + #include <linux/ptrace.h> + void post_handler(struct kprobe *p, struct pt_regs *regs, + unsigned long flags); p and regs are as described for the pre_handler. flags always seems to be zero. -User's fault-handler (kp->fault_handler): -#include <linux/kprobes.h> -#include <linux/ptrace.h> -int fault_handler(struct kprobe *p, struct pt_regs *regs, int trapnr); +User's fault-handler (kp->fault_handler):: + + #include <linux/kprobes.h> + #include <linux/ptrace.h> + int fault_handler(struct kprobe *p, struct pt_regs *regs, int trapnr); p and regs are as described for the pre_handler. trapnr is the architecture-specific trap number associated with the fault (e.g., on i386, 13 for a general protection fault or 14 for a page fault). Returns 1 if it successfully handled the exception. -4.2 register_jprobe +register_jprobe +--------------- -#include <linux/kprobes.h> -int register_jprobe(struct jprobe *jp) +:: + + #include <linux/kprobes.h> + int register_jprobe(struct jprobe *jp) Sets a breakpoint at the address jp->kp.addr, which must be the address of the first instruction of a function. When the breakpoint is hit, @@ -423,10 +467,13 @@ declaration must match. register_jprobe() returns 0 on success, or a negative errno otherwise. -4.3 register_kretprobe +register_kretprobe +------------------ + +:: -#include <linux/kprobes.h> -int register_kretprobe(struct kretprobe *rp); + #include <linux/kprobes.h> + int register_kretprobe(struct kretprobe *rp); Establishes a return probe for the function whose address is rp->kp.addr. When that function returns, Kprobes calls rp->handler. @@ -436,14 +483,17 @@ register_kretprobe(); see "How Does a Return Probe Work?" for details. register_kretprobe() returns 0 on success, or a negative errno otherwise. -User's return-probe handler (rp->handler): -#include <linux/kprobes.h> -#include <linux/ptrace.h> -int kretprobe_handler(struct kretprobe_instance *ri, struct pt_regs *regs); +User's return-probe handler (rp->handler):: + + #include <linux/kprobes.h> + #include <linux/ptrace.h> + int kretprobe_handler(struct kretprobe_instance *ri, + struct pt_regs *regs); regs is as described for kprobe.pre_handler. ri points to the kretprobe_instance object, of which the following fields may be of interest: + - ret_addr: the return address - rp: points to the corresponding kretprobe object - task: points to the corresponding task struct @@ -456,74 +506,94 @@ the architecture's ABI. The handler's return value is currently ignored. -4.4 unregister_*probe +unregister_*probe +------------------ + +:: -#include <linux/kprobes.h> -void unregister_kprobe(struct kprobe *kp); -void unregister_jprobe(struct jprobe *jp); -void unregister_kretprobe(struct kretprobe *rp); + #include <linux/kprobes.h> + void unregister_kprobe(struct kprobe *kp); + void unregister_jprobe(struct jprobe *jp); + void unregister_kretprobe(struct kretprobe *rp); Removes the specified probe. The unregister function can be called at any time after the probe has been registered. -NOTE: -If the functions find an incorrect probe (ex. an unregistered probe), -they clear the addr field of the probe. +.. note:: + + If the functions find an incorrect probe (ex. an unregistered probe), + they clear the addr field of the probe. + +register_*probes +---------------- -4.5 register_*probes +:: -#include <linux/kprobes.h> -int register_kprobes(struct kprobe **kps, int num); -int register_kretprobes(struct kretprobe **rps, int num); -int register_jprobes(struct jprobe **jps, int num); + #include <linux/kprobes.h> + int register_kprobes(struct kprobe **kps, int num); + int register_kretprobes(struct kretprobe **rps, int num); + int register_jprobes(struct jprobe **jps, int num); Registers each of the num probes in the specified array. If any error occurs during registration, all probes in the array, up to the bad probe, are safely unregistered before the register_*probes function returns. -- kps/rps/jps: an array of pointers to *probe data structures + +- kps/rps/jps: an array of pointers to ``*probe`` data structures - num: the number of the array entries. -NOTE: -You have to allocate(or define) an array of pointers and set all -of the array entries before using these functions. +.. note:: + + You have to allocate(or define) an array of pointers and set all + of the array entries before using these functions. -4.6 unregister_*probes +unregister_*probes +------------------ -#include <linux/kprobes.h> -void unregister_kprobes(struct kprobe **kps, int num); -void unregister_kretprobes(struct kretprobe **rps, int num); -void unregister_jprobes(struct jprobe **jps, int num); +:: + + #include <linux/kprobes.h> + void unregister_kprobes(struct kprobe **kps, int num); + void unregister_kretprobes(struct kretprobe **rps, int num); + void unregister_jprobes(struct jprobe **jps, int num); Removes each of the num probes in the specified array at once. -NOTE: -If the functions find some incorrect probes (ex. unregistered -probes) in the specified array, they clear the addr field of those -incorrect probes. However, other probes in the array are -unregistered correctly. +.. note:: + + If the functions find some incorrect probes (ex. unregistered + probes) in the specified array, they clear the addr field of those + incorrect probes. However, other probes in the array are + unregistered correctly. -4.7 disable_*probe +disable_*probe +-------------- -#include <linux/kprobes.h> -int disable_kprobe(struct kprobe *kp); -int disable_kretprobe(struct kretprobe *rp); -int disable_jprobe(struct jprobe *jp); +:: -Temporarily disables the specified *probe. You can enable it again by using + #include <linux/kprobes.h> + int disable_kprobe(struct kprobe *kp); + int disable_kretprobe(struct kretprobe *rp); + int disable_jprobe(struct jprobe *jp); + +Temporarily disables the specified ``*probe``. You can enable it again by using enable_*probe(). You must specify the probe which has been registered. -4.8 enable_*probe +enable_*probe +------------- + +:: -#include <linux/kprobes.h> -int enable_kprobe(struct kprobe *kp); -int enable_kretprobe(struct kretprobe *rp); -int enable_jprobe(struct jprobe *jp); + #include <linux/kprobes.h> + int enable_kprobe(struct kprobe *kp); + int enable_kretprobe(struct kretprobe *rp); + int enable_jprobe(struct jprobe *jp); -Enables *probe which has been disabled by disable_*probe(). You must specify +Enables ``*probe`` which has been disabled by disable_*probe(). You must specify the probe which has been registered. -5. Kprobes Features and Limitations +Kprobes Features and Limitations +================================ Kprobes allows multiple probes at the same address. Currently, however, there cannot be multiple jprobes on the same function at @@ -538,7 +608,7 @@ are discussed in this section. The register_*probe functions will return -EINVAL if you attempt to install a probe in the code that implements Kprobes (mostly -kernel/kprobes.c and arch/*/kernel/kprobes.c, but also functions such +kernel/kprobes.c and ``arch/*/kernel/kprobes.c``, but also functions such as do_page_fault and notifier_call_chain). If you install a probe in an inline-able function, Kprobes makes @@ -602,19 +672,21 @@ explain it, we introduce some terminology. Imagine a 3-instruction sequence consisting of a two 2-byte instructions and one 3-byte instruction. - IA - | -[-2][-1][0][1][2][3][4][5][6][7] - [ins1][ins2][ ins3 ] - [<- DCR ->] - [<- JTPR ->] +:: -ins1: 1st Instruction -ins2: 2nd Instruction -ins3: 3rd Instruction -IA: Insertion Address -JTPR: Jump Target Prohibition Region -DCR: Detoured Code Region + IA + | + [-2][-1][0][1][2][3][4][5][6][7] + [ins1][ins2][ ins3 ] + [<- DCR ->] + [<- JTPR ->] + + ins1: 1st Instruction + ins2: 2nd Instruction + ins3: 3rd Instruction + IA: Insertion Address + JTPR: Jump Target Prohibition Region + DCR: Detoured Code Region The instructions in DCR are copied to the out-of-line buffer of the kprobe, because the bytes in DCR are replaced by @@ -628,7 +700,8 @@ d) DCR must not straddle the border between functions. Anyway, these limitations are checked by the in-kernel instruction decoder, so you don't need to worry about that. -6. Probe Overhead +Probe Overhead +============== On a typical CPU in use in 2005, a kprobe hit takes 0.5 to 1.0 microseconds to process. Specifically, a benchmark that hits the same @@ -638,70 +711,80 @@ return-probe hit typically takes 50-75% longer than a kprobe hit. When you have a return probe set on a function, adding a kprobe at the entry to that function adds essentially no overhead. -Here are sample overhead figures (in usec) for different architectures. -k = kprobe; j = jprobe; r = return probe; kr = kprobe + return probe -on same function; jr = jprobe + return probe on same function +Here are sample overhead figures (in usec) for different architectures:: + + k = kprobe; j = jprobe; r = return probe; kr = kprobe + return probe + on same function; jr = jprobe + return probe on same function:: -i386: Intel Pentium M, 1495 MHz, 2957.31 bogomips -k = 0.57 usec; j = 1.00; r = 0.92; kr = 0.99; jr = 1.40 + i386: Intel Pentium M, 1495 MHz, 2957.31 bogomips + k = 0.57 usec; j = 1.00; r = 0.92; kr = 0.99; jr = 1.40 -x86_64: AMD Opteron 246, 1994 MHz, 3971.48 bogomips -k = 0.49 usec; j = 0.76; r = 0.80; kr = 0.82; jr = 1.07 + x86_64: AMD Opteron 246, 1994 MHz, 3971.48 bogomips + k = 0.49 usec; j = 0.76; r = 0.80; kr = 0.82; jr = 1.07 -ppc64: POWER5 (gr), 1656 MHz (SMT disabled, 1 virtual CPU per physical CPU) -k = 0.77 usec; j = 1.31; r = 1.26; kr = 1.45; jr = 1.99 + ppc64: POWER5 (gr), 1656 MHz (SMT disabled, 1 virtual CPU per physical CPU) + k = 0.77 usec; j = 1.31; r = 1.26; kr = 1.45; jr = 1.99 -6.1 Optimized Probe Overhead +Optimized Probe Overhead +------------------------ Typically, an optimized kprobe hit takes 0.07 to 0.1 microseconds to -process. Here are sample overhead figures (in usec) for x86 architectures. -k = unoptimized kprobe, b = boosted (single-step skipped), o = optimized kprobe, -r = unoptimized kretprobe, rb = boosted kretprobe, ro = optimized kretprobe. +process. Here are sample overhead figures (in usec) for x86 architectures:: -i386: Intel(R) Xeon(R) E5410, 2.33GHz, 4656.90 bogomips -k = 0.80 usec; b = 0.33; o = 0.05; r = 1.10; rb = 0.61; ro = 0.33 + k = unoptimized kprobe, b = boosted (single-step skipped), o = optimized kprobe, + r = unoptimized kretprobe, rb = boosted kretprobe, ro = optimized kretprobe. -x86-64: Intel(R) Xeon(R) E5410, 2.33GHz, 4656.90 bogomips -k = 0.99 usec; b = 0.43; o = 0.06; r = 1.24; rb = 0.68; ro = 0.30 + i386: Intel(R) Xeon(R) E5410, 2.33GHz, 4656.90 bogomips + k = 0.80 usec; b = 0.33; o = 0.05; r = 1.10; rb = 0.61; ro = 0.33 -7. TODO + x86-64: Intel(R) Xeon(R) E5410, 2.33GHz, 4656.90 bogomips + k = 0.99 usec; b = 0.43; o = 0.06; r = 1.24; rb = 0.68; ro = 0.30 + +TODO +==== a. SystemTap (http://sourceware.org/systemtap): Provides a simplified -programming interface for probe-based instrumentation. Try it out. + programming interface for probe-based instrumentation. Try it out. b. Kernel return probes for sparc64. c. Support for other architectures. d. User-space probes. e. Watchpoint probes (which fire on data references). -8. Kprobes Example +Kprobes Example +=============== See samples/kprobes/kprobe_example.c -9. Jprobes Example +Jprobes Example +=============== See samples/kprobes/jprobe_example.c -10. Kretprobes Example +Kretprobes Example +================== See samples/kprobes/kretprobe_example.c For additional information on Kprobes, refer to the following URLs: -http://www-106.ibm.com/developerworks/library/l-kprobes.html?ca=dgr-lnxw42Kprobe -http://www.redhat.com/magazine/005mar05/features/kprobes/ -http://www-users.cs.umn.edu/~boutcher/kprobes/ -http://www.linuxsymposium.org/2006/linuxsymposium_procv2.pdf (pages 101-115) + +- http://www-106.ibm.com/developerworks/library/l-kprobes.html?ca=dgr-lnxw42Kprobe +- http://www.redhat.com/magazine/005mar05/features/kprobes/ +- http://www-users.cs.umn.edu/~boutcher/kprobes/ +- http://www.linuxsymposium.org/2006/linuxsymposium_procv2.pdf (pages 101-115) -Appendix A: The kprobes debugfs interface +The kprobes debugfs interface +============================= + With recent kernels (> 2.6.20) the list of registered kprobes is visible under the /sys/kernel/debug/kprobes/ directory (assuming debugfs is mounted at //sys/kernel/debug). -/sys/kernel/debug/kprobes/list: Lists all registered probes on the system +/sys/kernel/debug/kprobes/list: Lists all registered probes on the system:: -c015d71a k vfs_read+0x0 -c011a316 j do_fork+0x0 -c03dedc5 r tcp_v4_rcv+0x0 + c015d71a k vfs_read+0x0 + c011a316 j do_fork+0x0 + c03dedc5 r tcp_v4_rcv+0x0 The first column provides the kernel address where the probe is inserted. The second column identifies the type of probe (k - kprobe, r - kretprobe @@ -725,17 +808,19 @@ change each probe's disabling state. This means that disabled kprobes (marked [DISABLED]) will be not enabled if you turn ON all kprobes by this knob. -Appendix B: The kprobes sysctl interface +The kprobes sysctl interface +============================ /proc/sys/debug/kprobes-optimization: Turn kprobes optimization ON/OFF. When CONFIG_OPTPROBES=y, this sysctl interface appears and it provides a knob to globally and forcibly turn jump optimization (see section -1.4) ON or OFF. By default, jump optimization is allowed (ON). -If you echo "0" to this file or set "debug.kprobes_optimization" to -0 via sysctl, all optimized probes will be unoptimized, and any new -probes registered after that will not be optimized. Note that this -knob *changes* the optimized state. This means that optimized probes -(marked [OPTIMIZED]) will be unoptimized ([OPTIMIZED] tag will be +:ref:`kprobes_jump_optimization`) ON or OFF. By default, jump optimization +is allowed (ON). If you echo "0" to this file or set +"debug.kprobes_optimization" to 0 via sysctl, all optimized probes will be +unoptimized, and any new probes registered after that will not be optimized. + +Note that this knob *changes* the optimized state. This means that optimized +probes (marked [OPTIMIZED]) will be unoptimized ([OPTIMIZED] tag will be removed). If the knob is turned on, they will be optimized again. diff --git a/Documentation/kref.txt b/Documentation/kref.txt index d26a27ca964d..3af384156d7e 100644 --- a/Documentation/kref.txt +++ b/Documentation/kref.txt @@ -1,24 +1,42 @@ +=================================================== +Adding reference counters (krefs) to kernel objects +=================================================== + +:Author: Corey Minyard <minyard@acm.org> +:Author: Thomas Hellstrom <thellstrom@vmware.com> + +A lot of this was lifted from Greg Kroah-Hartman's 2004 OLS paper and +presentation on krefs, which can be found at: + + - http://www.kroah.com/linux/talks/ols_2004_kref_paper/Reprint-Kroah-Hartman-OLS2004.pdf + - http://www.kroah.com/linux/talks/ols_2004_kref_talk/ + +Introduction +============ krefs allow you to add reference counters to your objects. If you have objects that are used in multiple places and passed around, and you don't have refcounts, your code is almost certainly broken. If you want refcounts, krefs are the way to go. -To use a kref, add one to your data structures like: +To use a kref, add one to your data structures like:: -struct my_data -{ + struct my_data + { . . struct kref refcount; . . -}; + }; The kref can occur anywhere within the data structure. +Initialization +============== + You must initialize the kref after you allocate it. To do this, call -kref_init as so: +kref_init as so:: struct my_data *data; @@ -29,18 +47,25 @@ kref_init as so: This sets the refcount in the kref to 1. +Kref rules +========== + Once you have an initialized kref, you must follow the following rules: 1) If you make a non-temporary copy of a pointer, especially if it can be passed to another thread of execution, you must - increment the refcount with kref_get() before passing it off: + increment the refcount with kref_get() before passing it off:: + kref_get(&data->refcount); + If you already have a valid pointer to a kref-ed structure (the refcount cannot go to zero) you may do this without a lock. -2) When you are done with a pointer, you must call kref_put(): +2) When you are done with a pointer, you must call kref_put():: + kref_put(&data->refcount, data_release); + If this is the last reference to the pointer, the release routine will be called. If the code never tries to get a valid pointer to a kref-ed structure without already @@ -53,25 +78,25 @@ rules: structure must remain valid during the kref_get(). For example, if you allocate some data and then pass it to another -thread to process: +thread to process:: -void data_release(struct kref *ref) -{ + void data_release(struct kref *ref) + { struct my_data *data = container_of(ref, struct my_data, refcount); kfree(data); -} + } -void more_data_handling(void *cb_data) -{ + void more_data_handling(void *cb_data) + { struct my_data *data = cb_data; . . do stuff with data here . kref_put(&data->refcount, data_release); -} + } -int my_data_handler(void) -{ + int my_data_handler(void) + { int rv = 0; struct my_data *data; struct task_struct *task; @@ -91,10 +116,10 @@ int my_data_handler(void) . . do stuff with data here . - out: + out: kref_put(&data->refcount, data_release); return rv; -} + } This way, it doesn't matter what order the two threads handle the data, the kref_put() handles knowing when the data is not referenced @@ -104,7 +129,7 @@ put needs no lock because nothing tries to get the data without already holding a pointer. Note that the "before" in rule 1 is very important. You should never -do something like: +do something like:: task = kthread_run(more_data_handling, data, "more_data_handling"); if (task == ERR_PTR(-ENOMEM)) { @@ -124,14 +149,14 @@ bad style. Don't do it. There are some situations where you can optimize the gets and puts. For instance, if you are done with an object and enqueuing it for something else or passing it off to something else, there is no reason -to do a get then a put: +to do a get then a put:: /* Silly extra get and put */ kref_get(&obj->ref); enqueue(obj); kref_put(&obj->ref, obj_cleanup); -Just do the enqueue. A comment about this is always welcome: +Just do the enqueue. A comment about this is always welcome:: enqueue(obj); /* We are done with obj, so we pass our refcount off @@ -142,109 +167,99 @@ instance, you have a list of items that are each kref-ed, and you wish to get the first one. You can't just pull the first item off the list and kref_get() it. That violates rule 3 because you are not already holding a valid pointer. You must add a mutex (or some other lock). -For instance: - -static DEFINE_MUTEX(mutex); -static LIST_HEAD(q); -struct my_data -{ - struct kref refcount; - struct list_head link; -}; - -static struct my_data *get_entry() -{ - struct my_data *entry = NULL; - mutex_lock(&mutex); - if (!list_empty(&q)) { - entry = container_of(q.next, struct my_data, link); - kref_get(&entry->refcount); +For instance:: + + static DEFINE_MUTEX(mutex); + static LIST_HEAD(q); + struct my_data + { + struct kref refcount; + struct list_head link; + }; + + static struct my_data *get_entry() + { + struct my_data *entry = NULL; + mutex_lock(&mutex); + if (!list_empty(&q)) { + entry = container_of(q.next, struct my_data, link); + kref_get(&entry->refcount); + } + mutex_unlock(&mutex); + return entry; } - mutex_unlock(&mutex); - return entry; -} -static void release_entry(struct kref *ref) -{ - struct my_data *entry = container_of(ref, struct my_data, refcount); + static void release_entry(struct kref *ref) + { + struct my_data *entry = container_of(ref, struct my_data, refcount); - list_del(&entry->link); - kfree(entry); -} + list_del(&entry->link); + kfree(entry); + } -static void put_entry(struct my_data *entry) -{ - mutex_lock(&mutex); - kref_put(&entry->refcount, release_entry); - mutex_unlock(&mutex); -} + static void put_entry(struct my_data *entry) + { + mutex_lock(&mutex); + kref_put(&entry->refcount, release_entry); + mutex_unlock(&mutex); + } The kref_put() return value is useful if you do not want to hold the lock during the whole release operation. Say you didn't want to call kfree() with the lock held in the example above (since it is kind of -pointless to do so). You could use kref_put() as follows: +pointless to do so). You could use kref_put() as follows:: -static void release_entry(struct kref *ref) -{ - /* All work is done after the return from kref_put(). */ -} + static void release_entry(struct kref *ref) + { + /* All work is done after the return from kref_put(). */ + } -static void put_entry(struct my_data *entry) -{ - mutex_lock(&mutex); - if (kref_put(&entry->refcount, release_entry)) { - list_del(&entry->link); - mutex_unlock(&mutex); - kfree(entry); - } else - mutex_unlock(&mutex); -} + static void put_entry(struct my_data *entry) + { + mutex_lock(&mutex); + if (kref_put(&entry->refcount, release_entry)) { + list_del(&entry->link); + mutex_unlock(&mutex); + kfree(entry); + } else + mutex_unlock(&mutex); + } This is really more useful if you have to call other routines as part of the free operations that could take a long time or might claim the same lock. Note that doing everything in the release routine is still preferred as it is a little neater. - -Corey Minyard <minyard@acm.org> - -A lot of this was lifted from Greg Kroah-Hartman's 2004 OLS paper and -presentation on krefs, which can be found at: - http://www.kroah.com/linux/talks/ols_2004_kref_paper/Reprint-Kroah-Hartman-OLS2004.pdf -and: - http://www.kroah.com/linux/talks/ols_2004_kref_talk/ - - The above example could also be optimized using kref_get_unless_zero() in -the following way: - -static struct my_data *get_entry() -{ - struct my_data *entry = NULL; - mutex_lock(&mutex); - if (!list_empty(&q)) { - entry = container_of(q.next, struct my_data, link); - if (!kref_get_unless_zero(&entry->refcount)) - entry = NULL; +the following way:: + + static struct my_data *get_entry() + { + struct my_data *entry = NULL; + mutex_lock(&mutex); + if (!list_empty(&q)) { + entry = container_of(q.next, struct my_data, link); + if (!kref_get_unless_zero(&entry->refcount)) + entry = NULL; + } + mutex_unlock(&mutex); + return entry; } - mutex_unlock(&mutex); - return entry; -} -static void release_entry(struct kref *ref) -{ - struct my_data *entry = container_of(ref, struct my_data, refcount); + static void release_entry(struct kref *ref) + { + struct my_data *entry = container_of(ref, struct my_data, refcount); - mutex_lock(&mutex); - list_del(&entry->link); - mutex_unlock(&mutex); - kfree(entry); -} + mutex_lock(&mutex); + list_del(&entry->link); + mutex_unlock(&mutex); + kfree(entry); + } -static void put_entry(struct my_data *entry) -{ - kref_put(&entry->refcount, release_entry); -} + static void put_entry(struct my_data *entry) + { + kref_put(&entry->refcount, release_entry); + } Which is useful to remove the mutex lock around kref_put() in put_entry(), but it's important that kref_get_unless_zero is enclosed in the same critical @@ -254,51 +269,51 @@ Note that it is illegal to use kref_get_unless_zero without checking its return value. If you are sure (by already having a valid pointer) that kref_get_unless_zero() will return true, then use kref_get() instead. -The function kref_get_unless_zero also makes it possible to use rcu -locking for lookups in the above example: +Krefs and RCU +============= -struct my_data -{ - struct rcu_head rhead; - . - struct kref refcount; - . - . -}; - -static struct my_data *get_entry_rcu() -{ - struct my_data *entry = NULL; - rcu_read_lock(); - if (!list_empty(&q)) { - entry = container_of(q.next, struct my_data, link); - if (!kref_get_unless_zero(&entry->refcount)) - entry = NULL; +The function kref_get_unless_zero also makes it possible to use rcu +locking for lookups in the above example:: + + struct my_data + { + struct rcu_head rhead; + . + struct kref refcount; + . + . + }; + + static struct my_data *get_entry_rcu() + { + struct my_data *entry = NULL; + rcu_read_lock(); + if (!list_empty(&q)) { + entry = container_of(q.next, struct my_data, link); + if (!kref_get_unless_zero(&entry->refcount)) + entry = NULL; + } + rcu_read_unlock(); + return entry; } - rcu_read_unlock(); - return entry; -} -static void release_entry_rcu(struct kref *ref) -{ - struct my_data *entry = container_of(ref, struct my_data, refcount); + static void release_entry_rcu(struct kref *ref) + { + struct my_data *entry = container_of(ref, struct my_data, refcount); - mutex_lock(&mutex); - list_del_rcu(&entry->link); - mutex_unlock(&mutex); - kfree_rcu(entry, rhead); -} + mutex_lock(&mutex); + list_del_rcu(&entry->link); + mutex_unlock(&mutex); + kfree_rcu(entry, rhead); + } -static void put_entry(struct my_data *entry) -{ - kref_put(&entry->refcount, release_entry_rcu); -} + static void put_entry(struct my_data *entry) + { + kref_put(&entry->refcount, release_entry_rcu); + } But note that the struct kref member needs to remain in valid memory for a rcu grace period after release_entry_rcu was called. That can be accomplished by using kfree_rcu(entry, rhead) as done above, or by calling synchronize_rcu() before using kfree, but note that synchronize_rcu() may sleep for a substantial amount of time. - - -Thomas Hellstrom <thellstrom@vmware.com> diff --git a/Documentation/ldm.txt b/Documentation/ldm.txt index 4f80edd14d0a..12c571368e73 100644 --- a/Documentation/ldm.txt +++ b/Documentation/ldm.txt @@ -1,9 +1,9 @@ +========================================== +LDM - Logical Disk Manager (Dynamic Disks) +========================================== - LDM - Logical Disk Manager (Dynamic Disks) - ------------------------------------------ - -Originally Written by FlatCap - Richard Russon <ldm@flatcap.org>. -Last Updated by Anton Altaparmakov on 30 March 2007 for Windows Vista. +:Author: Originally Written by FlatCap - Richard Russon <ldm@flatcap.org>. +:Last Updated: Anton Altaparmakov on 30 March 2007 for Windows Vista. Overview -------- @@ -37,24 +37,36 @@ Example ------- Below we have a 50MiB disk, divided into seven partitions. -N.B. The missing 1MiB at the end of the disk is where the LDM database is - stored. - - Device | Offset Bytes Sectors MiB | Size Bytes Sectors MiB - -------+----------------------------+--------------------------- - hda | 0 0 0 | 52428800 102400 50 - hda1 | 51380224 100352 49 | 1048576 2048 1 - hda2 | 16384 32 0 | 6979584 13632 6 - hda3 | 6995968 13664 6 | 10485760 20480 10 - hda4 | 17481728 34144 16 | 4194304 8192 4 - hda5 | 21676032 42336 20 | 5242880 10240 5 - hda6 | 26918912 52576 25 | 10485760 20480 10 - hda7 | 37404672 73056 35 | 13959168 27264 13 + +.. note:: + + The missing 1MiB at the end of the disk is where the LDM database is + stored. + ++-------++--------------+---------+-----++--------------+---------+----+ +|Device || Offset Bytes | Sectors | MiB || Size Bytes | Sectors | MiB| ++=======++==============+=========+=====++==============+=========+====+ +|hda || 0 | 0 | 0 || 52428800 | 102400 | 50| ++-------++--------------+---------+-----++--------------+---------+----+ +|hda1 || 51380224 | 100352 | 49 || 1048576 | 2048 | 1| ++-------++--------------+---------+-----++--------------+---------+----+ +|hda2 || 16384 | 32 | 0 || 6979584 | 13632 | 6| ++-------++--------------+---------+-----++--------------+---------+----+ +|hda3 || 6995968 | 13664 | 6 || 10485760 | 20480 | 10| ++-------++--------------+---------+-----++--------------+---------+----+ +|hda4 || 17481728 | 34144 | 16 || 4194304 | 8192 | 4| ++-------++--------------+---------+-----++--------------+---------+----+ +|hda5 || 21676032 | 42336 | 20 || 5242880 | 10240 | 5| ++-------++--------------+---------+-----++--------------+---------+----+ +|hda6 || 26918912 | 52576 | 25 || 10485760 | 20480 | 10| ++-------++--------------+---------+-----++--------------+---------+----+ +|hda7 || 37404672 | 73056 | 35 || 13959168 | 27264 | 13| ++-------++--------------+---------+-----++--------------+---------+----+ The LDM Database may not store the partitions in the order that they appear on disk, but the driver will sort them. -When Linux boots, you will see something like: +When Linux boots, you will see something like:: hda: 102400 sectors w/32KiB Cache, CHS=50/64/32 hda: [LDM] hda1 hda2 hda3 hda4 hda5 hda6 hda7 @@ -65,13 +77,13 @@ Compiling LDM Support To enable LDM, choose the following two options: - "Advanced partition selection" CONFIG_PARTITION_ADVANCED - "Windows Logical Disk Manager (Dynamic Disk) support" CONFIG_LDM_PARTITION + - "Advanced partition selection" CONFIG_PARTITION_ADVANCED + - "Windows Logical Disk Manager (Dynamic Disk) support" CONFIG_LDM_PARTITION If you believe the driver isn't working as it should, you can enable the extra debugging code. This will produce a LOT of output. The option is: - "Windows LDM extra logging" CONFIG_LDM_DEBUG + - "Windows LDM extra logging" CONFIG_LDM_DEBUG N.B. The partition code cannot be compiled as a module. diff --git a/Documentation/lockup-watchdogs.txt b/Documentation/lockup-watchdogs.txt index c8b8378513d6..290840c160af 100644 --- a/Documentation/lockup-watchdogs.txt +++ b/Documentation/lockup-watchdogs.txt @@ -30,7 +30,8 @@ timeout is set through the confusingly named "kernel.panic" sysctl), to cause the system to reboot automatically after a specified amount of time. -=== Implementation === +Implementation +============== The soft and hard lockup detectors are built on top of the hrtimer and perf subsystems, respectively. A direct consequence of this is that, diff --git a/Documentation/lsm.txt b/Documentation/lsm.txt new file mode 100644 index 000000000000..ad4dfd020e0d --- /dev/null +++ b/Documentation/lsm.txt @@ -0,0 +1,201 @@ +======================================================== +Linux Security Modules: General Security Hooks for Linux +======================================================== + +:Author: Stephen Smalley +:Author: Timothy Fraser +:Author: Chris Vance + +.. note:: + + The APIs described in this book are outdated. + +Introduction +============ + +In March 2001, the National Security Agency (NSA) gave a presentation +about Security-Enhanced Linux (SELinux) at the 2.5 Linux Kernel Summit. +SELinux is an implementation of flexible and fine-grained +nondiscretionary access controls in the Linux kernel, originally +implemented as its own particular kernel patch. Several other security +projects (e.g. RSBAC, Medusa) have also developed flexible access +control architectures for the Linux kernel, and various projects have +developed particular access control models for Linux (e.g. LIDS, DTE, +SubDomain). Each project has developed and maintained its own kernel +patch to support its security needs. + +In response to the NSA presentation, Linus Torvalds made a set of +remarks that described a security framework he would be willing to +consider for inclusion in the mainstream Linux kernel. He described a +general framework that would provide a set of security hooks to control +operations on kernel objects and a set of opaque security fields in +kernel data structures for maintaining security attributes. This +framework could then be used by loadable kernel modules to implement any +desired model of security. Linus also suggested the possibility of +migrating the Linux capabilities code into such a module. + +The Linux Security Modules (LSM) project was started by WireX to develop +such a framework. LSM is a joint development effort by several security +projects, including Immunix, SELinux, SGI and Janus, and several +individuals, including Greg Kroah-Hartman and James Morris, to develop a +Linux kernel patch that implements this framework. The patch is +currently tracking the 2.4 series and is targeted for integration into +the 2.5 development series. This technical report provides an overview +of the framework and the example capabilities security module provided +by the LSM kernel patch. + +LSM Framework +============= + +The LSM kernel patch provides a general kernel framework to support +security modules. In particular, the LSM framework is primarily focused +on supporting access control modules, although future development is +likely to address other security needs such as auditing. By itself, the +framework does not provide any additional security; it merely provides +the infrastructure to support security modules. The LSM kernel patch +also moves most of the capabilities logic into an optional security +module, with the system defaulting to the traditional superuser logic. +This capabilities module is discussed further in +`LSM Capabilities Module <#cap>`__. + +The LSM kernel patch adds security fields to kernel data structures and +inserts calls to hook functions at critical points in the kernel code to +manage the security fields and to perform access control. It also adds +functions for registering and unregistering security modules, and adds a +general :c:func:`security()` system call to support new system calls +for security-aware applications. + +The LSM security fields are simply ``void*`` pointers. For process and +program execution security information, security fields were added to +:c:type:`struct task_struct <task_struct>` and +:c:type:`struct linux_binprm <linux_binprm>`. For filesystem +security information, a security field was added to :c:type:`struct +super_block <super_block>`. For pipe, file, and socket security +information, security fields were added to :c:type:`struct inode +<inode>` and :c:type:`struct file <file>`. For packet and +network device security information, security fields were added to +:c:type:`struct sk_buff <sk_buff>` and :c:type:`struct +net_device <net_device>`. For System V IPC security information, +security fields were added to :c:type:`struct kern_ipc_perm +<kern_ipc_perm>` and :c:type:`struct msg_msg +<msg_msg>`; additionally, the definitions for :c:type:`struct +msg_msg <msg_msg>`, struct msg_queue, and struct shmid_kernel +were moved to header files (``include/linux/msg.h`` and +``include/linux/shm.h`` as appropriate) to allow the security modules to +use these definitions. + +Each LSM hook is a function pointer in a global table, security_ops. +This table is a :c:type:`struct security_operations +<security_operations>` structure as defined by +``include/linux/security.h``. Detailed documentation for each hook is +included in this header file. At present, this structure consists of a +collection of substructures that group related hooks based on the kernel +object (e.g. task, inode, file, sk_buff, etc) as well as some top-level +hook function pointers for system operations. This structure is likely +to be flattened in the future for performance. The placement of the hook +calls in the kernel code is described by the "called:" lines in the +per-hook documentation in the header file. The hook calls can also be +easily found in the kernel code by looking for the string +"security_ops->". + +Linus mentioned per-process security hooks in his original remarks as a +possible alternative to global security hooks. However, if LSM were to +start from the perspective of per-process hooks, then the base framework +would have to deal with how to handle operations that involve multiple +processes (e.g. kill), since each process might have its own hook for +controlling the operation. This would require a general mechanism for +composing hooks in the base framework. Additionally, LSM would still +need global hooks for operations that have no process context (e.g. +network input operations). Consequently, LSM provides global security +hooks, but a security module is free to implement per-process hooks +(where that makes sense) by storing a security_ops table in each +process' security field and then invoking these per-process hooks from +the global hooks. The problem of composition is thus deferred to the +module. + +The global security_ops table is initialized to a set of hook functions +provided by a dummy security module that provides traditional superuser +logic. A :c:func:`register_security()` function (in +``security/security.c``) is provided to allow a security module to set +security_ops to refer to its own hook functions, and an +:c:func:`unregister_security()` function is provided to revert +security_ops to the dummy module hooks. This mechanism is used to set +the primary security module, which is responsible for making the final +decision for each hook. + +LSM also provides a simple mechanism for stacking additional security +modules with the primary security module. It defines +:c:func:`register_security()` and +:c:func:`unregister_security()` hooks in the :c:type:`struct +security_operations <security_operations>` structure and +provides :c:func:`mod_reg_security()` and +:c:func:`mod_unreg_security()` functions that invoke these hooks +after performing some sanity checking. A security module can call these +functions in order to stack with other modules. However, the actual +details of how this stacking is handled are deferred to the module, +which can implement these hooks in any way it wishes (including always +returning an error if it does not wish to support stacking). In this +manner, LSM again defers the problem of composition to the module. + +Although the LSM hooks are organized into substructures based on kernel +object, all of the hooks can be viewed as falling into two major +categories: hooks that are used to manage the security fields and hooks +that are used to perform access control. Examples of the first category +of hooks include the :c:func:`alloc_security()` and +:c:func:`free_security()` hooks defined for each kernel data +structure that has a security field. These hooks are used to allocate +and free security structures for kernel objects. The first category of +hooks also includes hooks that set information in the security field +after allocation, such as the :c:func:`post_lookup()` hook in +:c:type:`struct inode_security_ops <inode_security_ops>`. +This hook is used to set security information for inodes after +successful lookup operations. An example of the second category of hooks +is the :c:func:`permission()` hook in :c:type:`struct +inode_security_ops <inode_security_ops>`. This hook checks +permission when accessing an inode. + +LSM Capabilities Module +======================= + +The LSM kernel patch moves most of the existing POSIX.1e capabilities +logic into an optional security module stored in the file +``security/capability.c``. This change allows users who do not want to +use capabilities to omit this code entirely from their kernel, instead +using the dummy module for traditional superuser logic or any other +module that they desire. This change also allows the developers of the +capabilities logic to maintain and enhance their code more freely, +without needing to integrate patches back into the base kernel. + +In addition to moving the capabilities logic, the LSM kernel patch could +move the capability-related fields from the kernel data structures into +the new security fields managed by the security modules. However, at +present, the LSM kernel patch leaves the capability fields in the kernel +data structures. In his original remarks, Linus suggested that this +might be preferable so that other security modules can be easily stacked +with the capabilities module without needing to chain multiple security +structures on the security field. It also avoids imposing extra overhead +on the capabilities module to manage the security fields. However, the +LSM framework could certainly support such a move if it is determined to +be desirable, with only a few additional changes described below. + +At present, the capabilities logic for computing process capabilities on +:c:func:`execve()` and :c:func:`set\*uid()`, checking +capabilities for a particular process, saving and checking capabilities +for netlink messages, and handling the :c:func:`capget()` and +:c:func:`capset()` system calls have been moved into the +capabilities module. There are still a few locations in the base kernel +where capability-related fields are directly examined or modified, but +the current version of the LSM patch does allow a security module to +completely replace the assignment and testing of capabilities. These few +locations would need to be changed if the capability-related fields were +moved into the security field. The following is a list of known +locations that still perform such direct examination or modification of +capability-related fields: + +- ``fs/open.c``::c:func:`sys_access()` + +- ``fs/lockd/host.c``::c:func:`nlm_bind_host()` + +- ``fs/nfsd/auth.c``::c:func:`nfsd_setuser()` + +- ``fs/proc/array.c``::c:func:`task_cap()` diff --git a/Documentation/lzo.txt b/Documentation/lzo.txt index 285c54f66779..6fa6a93d0949 100644 --- a/Documentation/lzo.txt +++ b/Documentation/lzo.txt @@ -1,8 +1,9 @@ - +=========================================================== LZO stream format as understood by Linux's LZO decompressor =========================================================== Introduction +============ This is not a specification. No specification seems to be publicly available for the LZO stream format. This document describes what input format the LZO @@ -14,12 +15,13 @@ Introduction for future bug reports. Description +=========== The stream is composed of a series of instructions, operands, and data. The instructions consist in a few bits representing an opcode, and bits forming the operands for the instruction, whose size and position depend on the opcode and on the number of literals copied by previous instruction. The - operands are used to indicate : + operands are used to indicate: - a distance when copying data from the dictionary (past output buffer) - a length (number of bytes to copy from dictionary) @@ -38,7 +40,7 @@ Description of bits in the operand. If the number of bits isn't enough to represent the length, up to 255 may be added in increments by consuming more bytes with a rate of at most 255 per extra byte (thus the compression ratio cannot exceed - around 255:1). The variable length encoding using #bits is always the same : + around 255:1). The variable length encoding using #bits is always the same:: length = byte & ((1 << #bits) - 1) if (!length) { @@ -67,15 +69,19 @@ Description instruction may encode this distance (0001HLLL), it takes one LE16 operand for the distance, thus requiring 3 bytes. - IMPORTANT NOTE : in the code some length checks are missing because certain - instructions are called under the assumption that a certain number of bytes - follow because it has already been guaranteed before parsing the instructions. - They just have to "refill" this credit if they consume extra bytes. This is - an implementation design choice independent on the algorithm or encoding. + .. important:: + + In the code some length checks are missing because certain instructions + are called under the assumption that a certain number of bytes follow + because it has already been guaranteed before parsing the instructions. + They just have to "refill" this credit if they consume extra bytes. This + is an implementation design choice independent on the algorithm or + encoding. Byte sequences +============== - First byte encoding : + First byte encoding:: 0..17 : follow regular instruction encoding, see below. It is worth noting that codes 16 and 17 will represent a block copy from @@ -91,7 +97,7 @@ Byte sequences state = 4 [ don't copy extra literals ] skip byte - Instruction encoding : + Instruction encoding:: 0 0 0 0 X X X X (0..15) Depends on the number of literals copied by the last instruction. @@ -156,6 +162,7 @@ Byte sequences distance = (H << 3) + D + 1 Authors +======= This document was written by Willy Tarreau <w@1wt.eu> on 2014/07/19 during an analysis of the decompression code available in Linux 3.16-rc5. The code is diff --git a/Documentation/mailbox.txt b/Documentation/mailbox.txt index 7ed371c85204..0ed95009cc30 100644 --- a/Documentation/mailbox.txt +++ b/Documentation/mailbox.txt @@ -1,7 +1,10 @@ - The Common Mailbox Framework - Jassi Brar <jaswinder.singh@linaro.org> +============================ +The Common Mailbox Framework +============================ - This document aims to help developers write client and controller +:Author: Jassi Brar <jaswinder.singh@linaro.org> + +This document aims to help developers write client and controller drivers for the API. But before we start, let us note that the client (especially) and controller drivers are likely going to be very platform specific because the remote firmware is likely to be @@ -13,14 +16,17 @@ similar copies of code written for each platform. Having said that, nothing prevents the remote f/w to also be Linux based and use the same api there. However none of that helps us locally because we only ever deal at client's protocol level. - Some of the choices made during implementation are the result of this + +Some of the choices made during implementation are the result of this peculiarity of this "common" framework. - Part 1 - Controller Driver (See include/linux/mailbox_controller.h) +Controller Driver (See include/linux/mailbox_controller.h) +========================================================== + - Allocate mbox_controller and the array of mbox_chan. +Allocate mbox_controller and the array of mbox_chan. Populate mbox_chan_ops, except peek_data() all are mandatory. The controller driver might know a message has been consumed by the remote by getting an IRQ or polling some hardware flag @@ -30,91 +36,94 @@ the controller driver should set via 'txdone_irq' or 'txdone_poll' or neither. - Part 2 - Client Driver (See include/linux/mailbox_client.h) +Client Driver (See include/linux/mailbox_client.h) +================================================== - The client might want to operate in blocking mode (synchronously + +The client might want to operate in blocking mode (synchronously send a message through before returning) or non-blocking/async mode (submit a message and a callback function to the API and return immediately). - -struct demo_client { - struct mbox_client cl; - struct mbox_chan *mbox; - struct completion c; - bool async; - /* ... */ -}; - -/* - * This is the handler for data received from remote. The behaviour is purely - * dependent upon the protocol. This is just an example. - */ -static void message_from_remote(struct mbox_client *cl, void *mssg) -{ - struct demo_client *dc = container_of(cl, struct demo_client, cl); - if (dc->async) { - if (is_an_ack(mssg)) { - /* An ACK to our last sample sent */ - return; /* Or do something else here */ - } else { /* A new message from remote */ - queue_req(mssg); +:: + + struct demo_client { + struct mbox_client cl; + struct mbox_chan *mbox; + struct completion c; + bool async; + /* ... */ + }; + + /* + * This is the handler for data received from remote. The behaviour is purely + * dependent upon the protocol. This is just an example. + */ + static void message_from_remote(struct mbox_client *cl, void *mssg) + { + struct demo_client *dc = container_of(cl, struct demo_client, cl); + if (dc->async) { + if (is_an_ack(mssg)) { + /* An ACK to our last sample sent */ + return; /* Or do something else here */ + } else { /* A new message from remote */ + queue_req(mssg); + } + } else { + /* Remote f/w sends only ACK packets on this channel */ + return; } - } else { - /* Remote f/w sends only ACK packets on this channel */ - return; } -} - -static void sample_sent(struct mbox_client *cl, void *mssg, int r) -{ - struct demo_client *dc = container_of(cl, struct demo_client, cl); - complete(&dc->c); -} - -static void client_demo(struct platform_device *pdev) -{ - struct demo_client *dc_sync, *dc_async; - /* The controller already knows async_pkt and sync_pkt */ - struct async_pkt ap; - struct sync_pkt sp; - - dc_sync = kzalloc(sizeof(*dc_sync), GFP_KERNEL); - dc_async = kzalloc(sizeof(*dc_async), GFP_KERNEL); - - /* Populate non-blocking mode client */ - dc_async->cl.dev = &pdev->dev; - dc_async->cl.rx_callback = message_from_remote; - dc_async->cl.tx_done = sample_sent; - dc_async->cl.tx_block = false; - dc_async->cl.tx_tout = 0; /* doesn't matter here */ - dc_async->cl.knows_txdone = false; /* depending upon protocol */ - dc_async->async = true; - init_completion(&dc_async->c); - - /* Populate blocking mode client */ - dc_sync->cl.dev = &pdev->dev; - dc_sync->cl.rx_callback = message_from_remote; - dc_sync->cl.tx_done = NULL; /* operate in blocking mode */ - dc_sync->cl.tx_block = true; - dc_sync->cl.tx_tout = 500; /* by half a second */ - dc_sync->cl.knows_txdone = false; /* depending upon protocol */ - dc_sync->async = false; - - /* ASync mailbox is listed second in 'mboxes' property */ - dc_async->mbox = mbox_request_channel(&dc_async->cl, 1); - /* Populate data packet */ - /* ap.xxx = 123; etc */ - /* Send async message to remote */ - mbox_send_message(dc_async->mbox, &ap); - - /* Sync mailbox is listed first in 'mboxes' property */ - dc_sync->mbox = mbox_request_channel(&dc_sync->cl, 0); - /* Populate data packet */ - /* sp.abc = 123; etc */ - /* Send message to remote in blocking mode */ - mbox_send_message(dc_sync->mbox, &sp); - /* At this point 'sp' has been sent */ - - /* Now wait for async chan to be done */ - wait_for_completion(&dc_async->c); -} + + static void sample_sent(struct mbox_client *cl, void *mssg, int r) + { + struct demo_client *dc = container_of(cl, struct demo_client, cl); + complete(&dc->c); + } + + static void client_demo(struct platform_device *pdev) + { + struct demo_client *dc_sync, *dc_async; + /* The controller already knows async_pkt and sync_pkt */ + struct async_pkt ap; + struct sync_pkt sp; + + dc_sync = kzalloc(sizeof(*dc_sync), GFP_KERNEL); + dc_async = kzalloc(sizeof(*dc_async), GFP_KERNEL); + + /* Populate non-blocking mode client */ + dc_async->cl.dev = &pdev->dev; + dc_async->cl.rx_callback = message_from_remote; + dc_async->cl.tx_done = sample_sent; + dc_async->cl.tx_block = false; + dc_async->cl.tx_tout = 0; /* doesn't matter here */ + dc_async->cl.knows_txdone = false; /* depending upon protocol */ + dc_async->async = true; + init_completion(&dc_async->c); + + /* Populate blocking mode client */ + dc_sync->cl.dev = &pdev->dev; + dc_sync->cl.rx_callback = message_from_remote; + dc_sync->cl.tx_done = NULL; /* operate in blocking mode */ + dc_sync->cl.tx_block = true; + dc_sync->cl.tx_tout = 500; /* by half a second */ + dc_sync->cl.knows_txdone = false; /* depending upon protocol */ + dc_sync->async = false; + + /* ASync mailbox is listed second in 'mboxes' property */ + dc_async->mbox = mbox_request_channel(&dc_async->cl, 1); + /* Populate data packet */ + /* ap.xxx = 123; etc */ + /* Send async message to remote */ + mbox_send_message(dc_async->mbox, &ap); + + /* Sync mailbox is listed first in 'mboxes' property */ + dc_sync->mbox = mbox_request_channel(&dc_sync->cl, 0); + /* Populate data packet */ + /* sp.abc = 123; etc */ + /* Send message to remote in blocking mode */ + mbox_send_message(dc_sync->mbox, &sp); + /* At this point 'sp' has been sent */ + + /* Now wait for async chan to be done */ + wait_for_completion(&dc_async->c); + } diff --git a/Documentation/media/uapi/v4l/vidioc-g-selection.rst b/Documentation/media/uapi/v4l/vidioc-g-selection.rst index 09923b86ec5d..c1ee86472918 100644 --- a/Documentation/media/uapi/v4l/vidioc-g-selection.rst +++ b/Documentation/media/uapi/v4l/vidioc-g-selection.rst @@ -121,8 +121,8 @@ Selection targets and flags are documented in .. _sel-const-adjust: -.. figure:: constraints.* - :alt: constraints.pdf / constraints.svg +.. kernel-figure:: constraints.svg + :alt: constraints.svg :align: center Size adjustments with constraint flags. diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index 732f10ea382e..c4ddfcd5ee32 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -27,7 +27,7 @@ The purpose of this document is twofold: (2) to provide a guide as to how to use the barriers that are available. Note that an architecture can provide more than the minimum requirement -for any particular barrier, but if the architecure provides less than +for any particular barrier, but if the architecture provides less than that, that architecture is incorrect. Note also that it is possible that a barrier may be a no-op for an @@ -498,11 +498,11 @@ And a couple of implicit varieties: This means that ACQUIRE acts as a minimal "acquire" operation and RELEASE acts as a minimal "release" operation. -A subset of the atomic operations described in atomic_ops.txt have ACQUIRE -and RELEASE variants in addition to fully-ordered and relaxed (no barrier -semantics) definitions. For compound atomics performing both a load and a -store, ACQUIRE semantics apply only to the load and RELEASE semantics apply -only to the store portion of the operation. +A subset of the atomic operations described in core-api/atomic_ops.rst have +ACQUIRE and RELEASE variants in addition to fully-ordered and relaxed (no +barrier semantics) definitions. For compound atomics performing both a load +and a store, ACQUIRE semantics apply only to the load and RELEASE semantics +apply only to the store portion of the operation. Memory barriers are only required where there's a possibility of interaction between two CPUs or between a CPU and a device. If it can be guaranteed that @@ -1876,8 +1876,8 @@ There are some more advanced barrier functions: This makes sure that the death mark on the object is perceived to be set *before* the reference counter is decremented. - See Documentation/atomic_ops.txt for more information. See the "Atomic - operations" subsection for information on where to use these. + See Documentation/core-api/atomic_ops.rst for more information. See the + "Atomic operations" subsection for information on where to use these. (*) lockless_dereference(); @@ -2584,7 +2584,7 @@ situations because on some CPUs the atomic instructions used imply full memory barriers, and so barrier instructions are superfluous in conjunction with them, and in such cases the special barrier primitives will be no-ops. -See Documentation/atomic_ops.txt for more information. +See Documentation/core-api/atomic_ops.rst for more information. ACCESSING DEVICES diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt index 670f3ded0802..7f49ebf3ddb2 100644 --- a/Documentation/memory-hotplug.txt +++ b/Documentation/memory-hotplug.txt @@ -2,43 +2,48 @@ Memory Hotplug ============== -Created: Jul 28 2007 -Add description of notifier of memory hotplug Oct 11 2007 +:Created: Jul 28 2007 +:Updated: Add description of notifier of memory hotplug: Oct 11 2007 This document is about memory hotplug including how-to-use and current status. Because Memory Hotplug is still under development, contents of this text will be changed often. -1. Introduction - 1.1 purpose of memory hotplug - 1.2. Phases of memory hotplug - 1.3. Unit of Memory online/offline operation -2. Kernel Configuration -3. sysfs files for memory hotplug -4. Physical memory hot-add phase - 4.1 Hardware(Firmware) Support - 4.2 Notify memory hot-add event by hand -5. Logical Memory hot-add phase - 5.1. State of memory - 5.2. How to online memory -6. Logical memory remove - 6.1 Memory offline and ZONE_MOVABLE - 6.2. How to offline memory -7. Physical memory remove -8. Memory hotplug event notifier -9. Future Work List - -Note(1): x86_64's has special implementation for memory hotplug. - This text does not describe it. -Note(2): This text assumes that sysfs is mounted at /sys. +.. CONTENTS + 1. Introduction + 1.1 purpose of memory hotplug + 1.2. Phases of memory hotplug + 1.3. Unit of Memory online/offline operation + 2. Kernel Configuration + 3. sysfs files for memory hotplug + 4. Physical memory hot-add phase + 4.1 Hardware(Firmware) Support + 4.2 Notify memory hot-add event by hand + 5. Logical Memory hot-add phase + 5.1. State of memory + 5.2. How to online memory + 6. Logical memory remove + 6.1 Memory offline and ZONE_MOVABLE + 6.2. How to offline memory + 7. Physical memory remove + 8. Memory hotplug event notifier + 9. Future Work List ---------------- -1. Introduction ---------------- -1.1 purpose of memory hotplug ------------- +.. note:: + + (1) x86_64's has special implementation for memory hotplug. + This text does not describe it. + (2) This text assumes that sysfs is mounted at /sys. + + +Introduction +============ + +purpose of memory hotplug +------------------------- + Memory Hotplug allows users to increase/decrease the amount of memory. Generally, there are two purposes. @@ -53,9 +58,11 @@ hardware which supports memory power management. Linux memory hotplug is designed for both purpose. -1.2. Phases of memory hotplug ---------------- -There are 2 phases in Memory Hotplug. +Phases of memory hotplug +------------------------ + +There are 2 phases in Memory Hotplug: + 1) Physical Memory Hotplug phase 2) Logical Memory Hotplug phase. @@ -70,7 +77,7 @@ management tables, and makes sysfs files for new memory's operation. If firmware supports notification of connection of new memory to OS, this phase is triggered automatically. ACPI can notify this event. If not, "probe" operation by system administration is used instead. -(see Section 4.). +(see :ref:`memory_hotplug_physical_mem`). Logical Memory Hotplug phase is to change memory state into available/unavailable for users. Amount of memory from user's view is @@ -83,11 +90,12 @@ Logical Memory Hotplug phase is triggered by write of sysfs file by system administrator. For the hot-add case, it must be executed after Physical Hotplug phase by hand. (However, if you writes udev's hotplug scripts for memory hotplug, these - phases can be execute in seamless way.) +phases can be execute in seamless way.) -1.3. Unit of Memory online/offline operation ------------- +Unit of Memory online/offline operation +--------------------------------------- + Memory hotplug uses SPARSEMEM memory model which allows memory to be divided into chunks of the same size. These chunks are called "sections". The size of a memory section is architecture dependent. For example, power uses 16MiB, ia64 @@ -97,46 +105,50 @@ Memory sections are combined into chunks referred to as "memory blocks". The size of a memory block is architecture dependent and represents the logical unit upon which memory online/offline operations are to be performed. The default size of a memory block is the same as memory section size unless an -architecture specifies otherwise. (see Section 3.) +architecture specifies otherwise. (see :ref:`memory_hotplug_sysfs_files`.) To determine the size (in bytes) of a memory block please read this file: /sys/devices/system/memory/block_size_bytes ------------------------ -2. Kernel Configuration ------------------------ +Kernel Configuration +==================== + To use memory hotplug feature, kernel must be compiled with following config options. -- For all memory hotplug - Memory model -> Sparse Memory (CONFIG_SPARSEMEM) - Allow for memory hot-add (CONFIG_MEMORY_HOTPLUG) +- For all memory hotplug: + - Memory model -> Sparse Memory (CONFIG_SPARSEMEM) + - Allow for memory hot-add (CONFIG_MEMORY_HOTPLUG) -- To enable memory removal, the following are also necessary - Allow for memory hot remove (CONFIG_MEMORY_HOTREMOVE) - Page Migration (CONFIG_MIGRATION) +- To enable memory removal, the following are also necessary: + - Allow for memory hot remove (CONFIG_MEMORY_HOTREMOVE) + - Page Migration (CONFIG_MIGRATION) -- For ACPI memory hotplug, the following are also necessary - Memory hotplug (under ACPI Support menu) (CONFIG_ACPI_HOTPLUG_MEMORY) - This option can be kernel module. +- For ACPI memory hotplug, the following are also necessary: + - Memory hotplug (under ACPI Support menu) (CONFIG_ACPI_HOTPLUG_MEMORY) + - This option can be kernel module. - As a related configuration, if your box has a feature of NUMA-node hotplug via ACPI, then this option is necessary too. - ACPI0004,PNP0A05 and PNP0A06 Container Driver (under ACPI Support menu) - (CONFIG_ACPI_CONTAINER). - This option can be kernel module too. + - ACPI0004,PNP0A05 and PNP0A06 Container Driver (under ACPI Support menu) + (CONFIG_ACPI_CONTAINER). + + This option can be kernel module too. + + +.. _memory_hotplug_sysfs_files: + +sysfs files for memory hotplug +============================== --------------------------------- -3 sysfs files for memory hotplug --------------------------------- All memory blocks have their device information in sysfs. Each memory block -is described under /sys/devices/system/memory as +is described under /sys/devices/system/memory as: -/sys/devices/system/memory/memoryXXX -(XXX is the memory block id.) + /sys/devices/system/memory/memoryXXX + (XXX is the memory block id.) For the memory block covered by the sysfs directory. It is expected that all memory sections in this range are present and no memory holes exist in the @@ -145,43 +157,53 @@ the existence of one should not affect the hotplug capabilities of the memory block. For example, assume 1GiB memory block size. A device for a memory starting at -0x100000000 is /sys/device/system/memory/memory4 -(0x100000000 / 1Gib = 4) +0x100000000 is /sys/device/system/memory/memory4:: + + (0x100000000 / 1Gib = 4) + This device covers address range [0x100000000 ... 0x140000000) Under each memory block, you can see 5 files: -/sys/devices/system/memory/memoryXXX/phys_index -/sys/devices/system/memory/memoryXXX/phys_device -/sys/devices/system/memory/memoryXXX/state -/sys/devices/system/memory/memoryXXX/removable -/sys/devices/system/memory/memoryXXX/valid_zones +- /sys/devices/system/memory/memoryXXX/phys_index +- /sys/devices/system/memory/memoryXXX/phys_device +- /sys/devices/system/memory/memoryXXX/state +- /sys/devices/system/memory/memoryXXX/removable +- /sys/devices/system/memory/memoryXXX/valid_zones + +=================== ============================================================ +``phys_index`` read-only and contains memory block id, same as XXX. +``state`` read-write + + - at read: contains online/offline state of memory. + - at write: user can specify "online_kernel", -'phys_index' : read-only and contains memory block id, same as XXX. -'state' : read-write - at read: contains online/offline state of memory. - at write: user can specify "online_kernel", "online_movable", "online", "offline" command which will be performed on all sections in the block. -'phys_device' : read-only: designed to show the name of physical memory +``phys_device`` read-only: designed to show the name of physical memory device. This is not well implemented now. -'removable' : read-only: contains an integer value indicating +``removable`` read-only: contains an integer value indicating whether the memory block is removable or not removable. A value of 1 indicates that the memory block is removable and a value of 0 indicates that it is not removable. A memory block is removable only if every section in the block is removable. -'valid_zones' : read-only: designed to show which zones this memory block +``valid_zones`` read-only: designed to show which zones this memory block can be onlined to. - The first column shows it's default zone. + + The first column shows it`s default zone. + "memory6/valid_zones: Normal Movable" shows this memoryblock can be onlined to ZONE_NORMAL by default and to ZONE_MOVABLE by online_movable. + "memory7/valid_zones: Movable Normal" shows this memoryblock can be onlined to ZONE_MOVABLE by default and to ZONE_NORMAL by online_kernel. +=================== ============================================================ + +.. note:: -NOTE: These directories/files appear after physical memory hotplug phase. If CONFIG_NUMA is enabled the memoryXXX/ directories can also be accessed @@ -193,13 +215,14 @@ For example: A backlink will also be created: /sys/devices/system/memory/memory9/node0 -> ../../node/node0 +.. _memory_hotplug_physical_mem: --------------------------------- -4. Physical memory hot-add phase --------------------------------- +Physical memory hot-add phase +============================= + +Hardware(Firmware) Support +-------------------------- -4.1 Hardware(Firmware) Support ------------- On x86_64/ia64 platform, memory hotplug by ACPI is supported. In general, the firmware (ACPI) which supports memory hotplug defines @@ -209,7 +232,8 @@ script. This will be done automatically. But scripts for memory hotplug are not contained in generic udev package(now). You may have to write it by yourself or online/offline memory by hand. -Please see "How to online memory", "How to offline memory" in this text. +Please see :ref:`memory_hotplug_how_to_online_memory` and +:ref:`memory_hotplug_how_to_offline_memory`. If firmware supports NUMA-node hotplug, and defines an object _HID "ACPI0004", "PNP0A05", or "PNP0A06", notification is asserted to it, and ACPI handler @@ -217,8 +241,9 @@ calls hotplug code for all of objects which are defined in it. If memory device is found, memory hotplug code will be called. -4.2 Notify memory hot-add event by hand ------------- +Notify memory hot-add event by hand +----------------------------------- + On some architectures, the firmware may not notify the kernel of a memory hotplug event. Therefore, the memory "probe" interface is supported to explicitly notify the kernel. This interface depends on @@ -229,45 +254,48 @@ notification. Probe interface is located at /sys/devices/system/memory/probe -You can tell the physical address of new memory to the kernel by +You can tell the physical address of new memory to the kernel by:: -% echo start_address_of_new_memory > /sys/devices/system/memory/probe + % echo start_address_of_new_memory > /sys/devices/system/memory/probe Then, [start_address_of_new_memory, start_address_of_new_memory + memory_block_size] memory range is hot-added. In this case, hotplug script is not called (in current implementation). You'll have to online memory by -yourself. Please see "How to online memory" in this text. +yourself. Please see :ref:`memory_hotplug_how_to_online_memory`. + +Logical Memory hot-add phase +============================ ------------------------------- -5. Logical Memory hot-add phase ------------------------------- +State of memory +--------------- + +To see (online/offline) state of a memory block, read 'state' file:: + + % cat /sys/device/system/memory/memoryXXX/state -5.1. State of memory ------------- -To see (online/offline) state of a memory block, read 'state' file. -% cat /sys/device/system/memory/memoryXXX/state +- If the memory block is online, you'll read "online". +- If the memory block is offline, you'll read "offline". -If the memory block is online, you'll read "online". -If the memory block is offline, you'll read "offline". +.. _memory_hotplug_how_to_online_memory: +How to online memory +-------------------- -5.2. How to online memory ------------- When the memory is hot-added, the kernel decides whether or not to "online" -it according to the policy which can be read from "auto_online_blocks" file: +it according to the policy which can be read from "auto_online_blocks" file:: -% cat /sys/devices/system/memory/auto_online_blocks + % cat /sys/devices/system/memory/auto_online_blocks The default depends on the CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel config option. If it is disabled the default is "offline" which means the newly added memory is not in a ready-to-use state and you have to "online" the newly added memory blocks manually. Automatic onlining can be requested by writing "online" -to "auto_online_blocks" file: +to "auto_online_blocks" file:: -% echo online > /sys/devices/system/memory/auto_online_blocks + % echo online > /sys/devices/system/memory/auto_online_blocks This sets a global policy and impacts all memory blocks that will subsequently be hotplugged. Currently offline blocks keep their state. It is possible, under @@ -277,35 +305,43 @@ online. User space tools can check their "state" files If the automatic onlining wasn't requested, failed, or some memory block was offlined it is possible to change the individual block's state by writing to the -"state" file: +"state" file:: -% echo online > /sys/devices/system/memory/memoryXXX/state + % echo online > /sys/devices/system/memory/memoryXXX/state This onlining will not change the ZONE type of the target memory block, -If the memory block is in ZONE_NORMAL, you can change it to ZONE_MOVABLE: +If the memory block doesn't belong to any zone an appropriate kernel zone +(usually ZONE_NORMAL) will be used unless movable_node kernel command line +option is specified when ZONE_MOVABLE will be used. + +You can explicitly request to associate it with ZONE_MOVABLE by:: + + % echo online_movable > /sys/devices/system/memory/memoryXXX/state + +.. note:: current limit: this memory block must be adjacent to ZONE_MOVABLE -% echo online_movable > /sys/devices/system/memory/memoryXXX/state -(NOTE: current limit: this memory block must be adjacent to ZONE_MOVABLE) +Or you can explicitly request a kernel zone (usually ZONE_NORMAL) by:: -And if the memory block is in ZONE_MOVABLE, you can change it to ZONE_NORMAL: + % echo online_kernel > /sys/devices/system/memory/memoryXXX/state -% echo online_kernel > /sys/devices/system/memory/memoryXXX/state -(NOTE: current limit: this memory block must be adjacent to ZONE_NORMAL) +.. note:: current limit: this memory block must be adjacent to ZONE_NORMAL + +An explicit zone onlining can fail (e.g. when the range is already within +and existing and incompatible zone already). After this, memory block XXX's state will be 'online' and the amount of available memory will be increased. -Currently, newly added memory is added as ZONE_NORMAL (for powerpc, ZONE_DMA). This may be changed in future. ------------------------- -6. Logical memory remove ------------------------- +Logical memory remove +===================== + +Memory offline and ZONE_MOVABLE +------------------------------- -6.1 Memory offline and ZONE_MOVABLE ------------- Memory offlining is more complicated than memory online. Because memory offline has to make the whole memory block be unused, memory offline can fail if the memory block includes memory which cannot be freed. @@ -330,24 +366,27 @@ Assume the system has "TOTAL" amount of memory at boot time, this boot option creates ZONE_MOVABLE as following. 1) When kernelcore=YYYY boot option is used, - Size of memory not for movable pages (not for offline) is YYYY. - Size of memory for movable pages (for offline) is TOTAL-YYYY. + Size of memory not for movable pages (not for offline) is YYYY. + Size of memory for movable pages (for offline) is TOTAL-YYYY. 2) When movablecore=ZZZZ boot option is used, - Size of memory not for movable pages (not for offline) is TOTAL - ZZZZ. - Size of memory for movable pages (for offline) is ZZZZ. + Size of memory not for movable pages (not for offline) is TOTAL - ZZZZ. + Size of memory for movable pages (for offline) is ZZZZ. + +.. note:: + Unfortunately, there is no information to show which memory block belongs + to ZONE_MOVABLE. This is TBD. -Note: Unfortunately, there is no information to show which memory block belongs -to ZONE_MOVABLE. This is TBD. +.. _memory_hotplug_how_to_offline_memory: +How to offline memory +--------------------- -6.2. How to offline memory ------------- You can offline a memory block by using the same sysfs interface that was used -in memory onlining. +in memory onlining:: -% echo offline > /sys/devices/system/memory/memoryXXX/state + % echo offline > /sys/devices/system/memory/memoryXXX/state If offline succeeds, the state of the memory block is changed to be "offline". If it fails, some error core (like -EBUSY) will be returned by the kernel. @@ -361,22 +400,22 @@ able to offline it (or not). (For example, a page is referred to by some kernel internal call and released soon.) Consideration: -Memory hotplug's design direction is to make the possibility of memory offlining -higher and to guarantee unplugging memory under any situation. But it needs -more work. Returning -EBUSY under some situation may be good because the user -can decide to retry more or not by himself. Currently, memory offlining code -does some amount of retry with 120 seconds timeout. + Memory hotplug's design direction is to make the possibility of memory + offlining higher and to guarantee unplugging memory under any situation. But + it needs more work. Returning -EBUSY under some situation may be good because + the user can decide to retry more or not by himself. Currently, memory + offlining code does some amount of retry with 120 seconds timeout. + +Physical memory remove +====================== -------------------------- -7. Physical memory remove -------------------------- Need more implementation yet.... - Notification completion of remove works by OS to firmware. - Guard from remove if not yet. --------------------------------- -8. Memory hotplug event notifier --------------------------------- +Memory hotplug event notifier +============================= + Hotplugging events are sent to a notification queue. There are six types of notification defined in include/linux/memory.h: @@ -406,14 +445,14 @@ MEM_CANCEL_OFFLINE MEM_OFFLINE Generated after offlining memory is complete. -A callback routine can be registered by calling +A callback routine can be registered by calling:: hotplug_memory_notifier(callback_func, priority) Callback functions with higher values of priority are called before callback functions with lower values. -A callback function must have the following prototype: +A callback function must have the following prototype:: int callback_func( struct notifier_block *self, unsigned long action, void *arg); @@ -421,27 +460,28 @@ A callback function must have the following prototype: The first argument of the callback function (self) is a pointer to the block of the notifier chain that points to the callback function itself. The second argument (action) is one of the event types described above. -The third argument (arg) passes a pointer of struct memory_notify. - -struct memory_notify { - unsigned long start_pfn; - unsigned long nr_pages; - int status_change_nid_normal; - int status_change_nid_high; - int status_change_nid; -} - -start_pfn is start_pfn of online/offline memory. -nr_pages is # of pages of online/offline memory. -status_change_nid_normal is set node id when N_NORMAL_MEMORY of nodemask -is (will be) set/clear, if this is -1, then nodemask status is not changed. -status_change_nid_high is set node id when N_HIGH_MEMORY of nodemask -is (will be) set/clear, if this is -1, then nodemask status is not changed. -status_change_nid is set node id when N_MEMORY of nodemask is (will be) -set/clear. It means a new(memoryless) node gets new memory by online and a -node loses all memory. If this is -1, then nodemask status is not changed. -If status_changed_nid* >= 0, callback should create/discard structures for the -node if necessary. +The third argument (arg) passes a pointer of struct memory_notify:: + + struct memory_notify { + unsigned long start_pfn; + unsigned long nr_pages; + int status_change_nid_normal; + int status_change_nid_high; + int status_change_nid; + } + +- start_pfn is start_pfn of online/offline memory. +- nr_pages is # of pages of online/offline memory. +- status_change_nid_normal is set node id when N_NORMAL_MEMORY of nodemask + is (will be) set/clear, if this is -1, then nodemask status is not changed. +- status_change_nid_high is set node id when N_HIGH_MEMORY of nodemask + is (will be) set/clear, if this is -1, then nodemask status is not changed. +- status_change_nid is set node id when N_MEMORY of nodemask is (will be) + set/clear. It means a new(memoryless) node gets new memory by online and a + node loses all memory. If this is -1, then nodemask status is not changed. + + If status_changed_nid* >= 0, callback should create/discard structures for the + node if necessary. The callback routine shall return one of the values NOTIFY_DONE, NOTIFY_OK, NOTIFY_BAD, NOTIFY_STOP @@ -455,9 +495,9 @@ further processing of the notification queue. NOTIFY_STOP stops further processing of the notification queue. --------------- -9. Future Work --------------- +Future Work +=========== + - allowing memory hot-add to ZONE_MOVABLE. maybe we need some switch like sysctl or new control file. - showing memory block and physical device relationship. @@ -465,4 +505,3 @@ NOTIFY_STOP stops further processing of the notification queue. - support HugeTLB page migration and offlining. - memmap removing at memory offline. - physical remove memory. - diff --git a/Documentation/men-chameleon-bus.txt b/Documentation/men-chameleon-bus.txt index 30ded732027e..1b1f048aa748 100644 --- a/Documentation/men-chameleon-bus.txt +++ b/Documentation/men-chameleon-bus.txt @@ -1,163 +1,175 @@ - MEN Chameleon Bus - ================= - -Table of Contents ================= -1 Introduction - 1.1 Scope of this Document - 1.2 Limitations of the current implementation -2 Architecture - 2.1 MEN Chameleon Bus - 2.2 Carrier Devices - 2.3 Parser -3 Resource handling - 3.1 Memory Resources - 3.2 IRQs -4 Writing an MCB driver - 4.1 The driver structure - 4.2 Probing and attaching - 4.3 Initializing the driver - - -1 Introduction -=============== - This document describes the architecture and implementation of the MEN - Chameleon Bus (called MCB throughout this document). - -1.1 Scope of this Document ---------------------------- - This document is intended to be a short overview of the current - implementation and does by no means describe the complete possibilities of MCB - based devices. - -1.2 Limitations of the current implementation ----------------------------------------------- - The current implementation is limited to PCI and PCIe based carrier devices - that only use a single memory resource and share the PCI legacy IRQ. Not - implemented are: - - Multi-resource MCB devices like the VME Controller or M-Module carrier. - - MCB devices that need another MCB device, like SRAM for a DMA Controller's - buffer descriptors or a video controller's video memory. - - A per-carrier IRQ domain for carrier devices that have one (or more) IRQs - per MCB device like PCIe based carriers with MSI or MSI-X support. - -2 Architecture -=============== - MCB is divided into 3 functional blocks: - - The MEN Chameleon Bus itself, - - drivers for MCB Carrier Devices and - - the parser for the Chameleon table. - -2.1 MEN Chameleon Bus +MEN Chameleon Bus +================= + +.. Table of Contents + ================= + 1 Introduction + 1.1 Scope of this Document + 1.2 Limitations of the current implementation + 2 Architecture + 2.1 MEN Chameleon Bus + 2.2 Carrier Devices + 2.3 Parser + 3 Resource handling + 3.1 Memory Resources + 3.2 IRQs + 4 Writing an MCB driver + 4.1 The driver structure + 4.2 Probing and attaching + 4.3 Initializing the driver + + +Introduction +============ + +This document describes the architecture and implementation of the MEN +Chameleon Bus (called MCB throughout this document). + +Scope of this Document ---------------------- - The MEN Chameleon Bus is an artificial bus system that attaches to a so - called Chameleon FPGA device found on some hardware produced my MEN Mikro - Elektronik GmbH. These devices are multi-function devices implemented in a - single FPGA and usually attached via some sort of PCI or PCIe link. Each - FPGA contains a header section describing the content of the FPGA. The - header lists the device id, PCI BAR, offset from the beginning of the PCI - BAR, size in the FPGA, interrupt number and some other properties currently - not handled by the MCB implementation. - -2.2 Carrier Devices + +This document is intended to be a short overview of the current +implementation and does by no means describe the complete possibilities of MCB +based devices. + +Limitations of the current implementation +----------------------------------------- + +The current implementation is limited to PCI and PCIe based carrier devices +that only use a single memory resource and share the PCI legacy IRQ. Not +implemented are: + +- Multi-resource MCB devices like the VME Controller or M-Module carrier. +- MCB devices that need another MCB device, like SRAM for a DMA Controller's + buffer descriptors or a video controller's video memory. +- A per-carrier IRQ domain for carrier devices that have one (or more) IRQs + per MCB device like PCIe based carriers with MSI or MSI-X support. + +Architecture +============ + +MCB is divided into 3 functional blocks: + +- The MEN Chameleon Bus itself, +- drivers for MCB Carrier Devices and +- the parser for the Chameleon table. + +MEN Chameleon Bus +----------------- + +The MEN Chameleon Bus is an artificial bus system that attaches to a so +called Chameleon FPGA device found on some hardware produced my MEN Mikro +Elektronik GmbH. These devices are multi-function devices implemented in a +single FPGA and usually attached via some sort of PCI or PCIe link. Each +FPGA contains a header section describing the content of the FPGA. The +header lists the device id, PCI BAR, offset from the beginning of the PCI +BAR, size in the FPGA, interrupt number and some other properties currently +not handled by the MCB implementation. + +Carrier Devices +--------------- + +A carrier device is just an abstraction for the real world physical bus the +Chameleon FPGA is attached to. Some IP Core drivers may need to interact with +properties of the carrier device (like querying the IRQ number of a PCI +device). To provide abstraction from the real hardware bus, an MCB carrier +device provides callback methods to translate the driver's MCB function calls +to hardware related function calls. For example a carrier device may +implement the get_irq() method which can be translated into a hardware bus +query for the IRQ number the device should use. + +Parser +------ + +The parser reads the first 512 bytes of a Chameleon device and parses the +Chameleon table. Currently the parser only supports the Chameleon v2 variant +of the Chameleon table but can easily be adopted to support an older or +possible future variant. While parsing the table's entries new MCB devices +are allocated and their resources are assigned according to the resource +assignment in the Chameleon table. After resource assignment is finished, the +MCB devices are registered at the MCB and thus at the driver core of the +Linux kernel. + +Resource handling +================= + +The current implementation assigns exactly one memory and one IRQ resource +per MCB device. But this is likely going to change in the future. + +Memory Resources +---------------- + +Each MCB device has exactly one memory resource, which can be requested from +the MCB bus. This memory resource is the physical address of the MCB device +inside the carrier and is intended to be passed to ioremap() and friends. It +is already requested from the kernel by calling request_mem_region(). + +IRQs +---- + +Each MCB device has exactly one IRQ resource, which can be requested from the +MCB bus. If a carrier device driver implements the ->get_irq() callback +method, the IRQ number assigned by the carrier device will be returned, +otherwise the IRQ number inside the Chameleon table will be returned. This +number is suitable to be passed to request_irq(). + +Writing an MCB driver +===================== + +The driver structure -------------------- - A carrier device is just an abstraction for the real world physical bus the - Chameleon FPGA is attached to. Some IP Core drivers may need to interact with - properties of the carrier device (like querying the IRQ number of a PCI - device). To provide abstraction from the real hardware bus, an MCB carrier - device provides callback methods to translate the driver's MCB function calls - to hardware related function calls. For example a carrier device may - implement the get_irq() method which can be translated into a hardware bus - query for the IRQ number the device should use. - -2.3 Parser ------------ - The parser reads the first 512 bytes of a Chameleon device and parses the - Chameleon table. Currently the parser only supports the Chameleon v2 variant - of the Chameleon table but can easily be adopted to support an older or - possible future variant. While parsing the table's entries new MCB devices - are allocated and their resources are assigned according to the resource - assignment in the Chameleon table. After resource assignment is finished, the - MCB devices are registered at the MCB and thus at the driver core of the - Linux kernel. - -3 Resource handling -==================== - The current implementation assigns exactly one memory and one IRQ resource - per MCB device. But this is likely going to change in the future. - -3.1 Memory Resources + +Each MCB driver has a structure to identify the device driver as well as +device ids which identify the IP Core inside the FPGA. The driver structure +also contains callback methods which get executed on driver probe and +removal from the system:: + + static const struct mcb_device_id foo_ids[] = { + { .device = 0x123 }, + { } + }; + MODULE_DEVICE_TABLE(mcb, foo_ids); + + static struct mcb_driver foo_driver = { + driver = { + .name = "foo-bar", + .owner = THIS_MODULE, + }, + .probe = foo_probe, + .remove = foo_remove, + .id_table = foo_ids, + }; + +Probing and attaching --------------------- - Each MCB device has exactly one memory resource, which can be requested from - the MCB bus. This memory resource is the physical address of the MCB device - inside the carrier and is intended to be passed to ioremap() and friends. It - is already requested from the kernel by calling request_mem_region(). - -3.2 IRQs ---------- - Each MCB device has exactly one IRQ resource, which can be requested from the - MCB bus. If a carrier device driver implements the ->get_irq() callback - method, the IRQ number assigned by the carrier device will be returned, - otherwise the IRQ number inside the Chameleon table will be returned. This - number is suitable to be passed to request_irq(). - -4 Writing an MCB driver -======================= - -4.1 The driver structure -------------------------- - Each MCB driver has a structure to identify the device driver as well as - device ids which identify the IP Core inside the FPGA. The driver structure - also contains callback methods which get executed on driver probe and - removal from the system. - - - static const struct mcb_device_id foo_ids[] = { - { .device = 0x123 }, - { } - }; - MODULE_DEVICE_TABLE(mcb, foo_ids); - - static struct mcb_driver foo_driver = { - driver = { - .name = "foo-bar", - .owner = THIS_MODULE, - }, - .probe = foo_probe, - .remove = foo_remove, - .id_table = foo_ids, - }; - -4.2 Probing and attaching --------------------------- - When a driver is loaded and the MCB devices it services are found, the MCB - core will call the driver's probe callback method. When the driver is removed - from the system, the MCB core will call the driver's remove callback method. - - - static init foo_probe(struct mcb_device *mdev, const struct mcb_device_id *id); - static void foo_remove(struct mcb_device *mdev); - -4.3 Initializing the driver ----------------------------- - When the kernel is booted or your foo driver module is inserted, you have to - perform driver initialization. Usually it is enough to register your driver - module at the MCB core. - - - static int __init foo_init(void) - { - return mcb_register_driver(&foo_driver); - } - module_init(foo_init); - - static void __exit foo_exit(void) - { - mcb_unregister_driver(&foo_driver); - } - module_exit(foo_exit); - - The module_mcb_driver() macro can be used to reduce the above code. - - - module_mcb_driver(foo_driver); + +When a driver is loaded and the MCB devices it services are found, the MCB +core will call the driver's probe callback method. When the driver is removed +from the system, the MCB core will call the driver's remove callback method:: + + static init foo_probe(struct mcb_device *mdev, const struct mcb_device_id *id); + static void foo_remove(struct mcb_device *mdev); + +Initializing the driver +----------------------- + +When the kernel is booted or your foo driver module is inserted, you have to +perform driver initialization. Usually it is enough to register your driver +module at the MCB core:: + + static int __init foo_init(void) + { + return mcb_register_driver(&foo_driver); + } + module_init(foo_init); + + static void __exit foo_exit(void) + { + mcb_unregister_driver(&foo_driver); + } + module_exit(foo_exit); + +The module_mcb_driver() macro can be used to reduce the above code:: + + module_mcb_driver(foo_driver); diff --git a/Documentation/networking/checksum-offloads.txt b/Documentation/networking/checksum-offloads.txt index 56e36861245f..d52d191bbb0c 100644 --- a/Documentation/networking/checksum-offloads.txt +++ b/Documentation/networking/checksum-offloads.txt @@ -35,6 +35,9 @@ This interface only allows a single checksum to be offloaded. Where encapsulation is used, the packet may have multiple checksum fields in different header layers, and the rest will have to be handled by another mechanism such as LCO or RCO. +CRC32c can also be offloaded using this interface, by means of filling + skb->csum_start and skb->csum_offset as described above, and setting + skb->csum_not_inet: see skbuff.h comment (section 'D') for more details. No offloading of the IP header checksum is performed; it is always done in software. This is OK because when we build the IP header, we obviously have it in cache, so summing it isn't expensive. It's also rather short. @@ -49,9 +52,9 @@ A driver declares its offload capabilities in netdev->hw_features; see and csum_offset given in the SKB; if it tries to deduce these itself in hardware (as some NICs do) the driver should check that the values in the SKB match those which the hardware will deduce, and if not, fall back to - checksumming in software instead (with skb_checksum_help or one of the - skb_csum_off_chk* functions as mentioned in include/linux/skbuff.h). This - is a pain, but that's what you get when hardware tries to be clever. + checksumming in software instead (with skb_csum_hwoffload_help() or one of + the skb_checksum_help() / skb_crc32c_csum_help functions, as mentioned in + include/linux/skbuff.h). The stack should, for the most part, assume that checksum offload is supported by the underlying device. The only place that should check is @@ -60,7 +63,7 @@ The stack should, for the most part, assume that checksum offload is may include other offloads besides TX Checksum Offload) and, if they are not supported or enabled on the device (determined by netdev->features), performs the corresponding offload in software. In the case of TX - Checksum Offload, that means calling skb_checksum_help(skb). + Checksum Offload, that means calling skb_csum_hwoffload_help(skb, features). LCO: Local Checksum Offload diff --git a/Documentation/networking/conf.py b/Documentation/networking/conf.py new file mode 100644 index 000000000000..40f69e67a883 --- /dev/null +++ b/Documentation/networking/conf.py @@ -0,0 +1,10 @@ +# -*- coding: utf-8; mode: python -*- + +project = "Linux Networking Documentation" + +tags.add("subproject") + +latex_documents = [ + ('index', 'networking.tex', project, + 'The kernel development community', 'manual'), +] diff --git a/Documentation/networking/dns_resolver.txt b/Documentation/networking/dns_resolver.txt index d86adcdae420..eaa8f9a6fd5d 100644 --- a/Documentation/networking/dns_resolver.txt +++ b/Documentation/networking/dns_resolver.txt @@ -143,7 +143,7 @@ the key will be discarded and recreated when the data it holds has expired. dns_query() returns a copy of the value attached to the key, or an error if that is indicated instead. -See <file:Documentation/security/keys-request-key.txt> for further +See <file:Documentation/security/keys/request-key.rst> for further information about request-key function. diff --git a/Documentation/networking/i40evf.txt b/Documentation/networking/i40evf.txt index 21e41271af79..e9b3035b95d0 100644 --- a/Documentation/networking/i40evf.txt +++ b/Documentation/networking/i40evf.txt @@ -1,8 +1,8 @@ Linux* Base Driver for Intel(R) Network Connection ================================================== -Intel XL710 X710 Virtual Function Linux driver. -Copyright(c) 2013 Intel Corporation. +Intel Ethernet Adaptive Virtual Function Linux driver. +Copyright(c) 2013-2017 Intel Corporation. Contents ======== @@ -11,19 +11,26 @@ Contents - Known Issues/Troubleshooting - Support -This file describes the i40evf Linux* Base Driver for the Intel(R) XL710 -X710 Virtual Function. +This file describes the i40evf Linux* Base Driver. -The i40evf driver supports XL710 and X710 virtual function devices that -can only be activated on kernels with CONFIG_PCI_IOV enabled. +The i40evf driver supports the below mentioned virtual function +devices and can only be activated on kernels running the i40e or +newer Physical Function (PF) driver compiled with CONFIG_PCI_IOV. +The i40evf driver requires CONFIG_PCI_MSI to be enabled. The guest OS loading the i40evf driver must support MSI-X interrupts. +Supported Hardware +================== +Intel XL710 X710 Virtual Function +Intel Ethernet Adaptive Virtual Function +Intel X722 Virtual Function + Identifying Your Adapter ======================== -For more information on how to identify your adapter, go to the Adapter & -Driver ID Guide at: +For more information on how to identify your adapter, go to the +Adapter & Driver ID Guide at: http://support.intel.com/support/go/network/adapter/idguide.htm diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst new file mode 100644 index 000000000000..b5bd87e01f52 --- /dev/null +++ b/Documentation/networking/index.rst @@ -0,0 +1,18 @@ +Linux Networking Documentation +============================== + +Contents: + +.. toctree:: + :maxdepth: 2 + + kapi + z8530book + +.. only:: subproject + + Indices + ======= + + * :ref:`genindex` + diff --git a/Documentation/networking/ipvlan.txt b/Documentation/networking/ipvlan.txt index 24196cef7c91..1fe42a874aae 100644 --- a/Documentation/networking/ipvlan.txt +++ b/Documentation/networking/ipvlan.txt @@ -22,9 +22,9 @@ The driver can be built into the kernel (CONFIG_IPVLAN=y) or as a module There are no module parameters for this driver and it can be configured using IProute2/ip utility. - ip link add link <master-dev> <slave-dev> type ipvlan mode { l2 | l3 | l3s } + ip link add link <master-dev> name <slave-dev> type ipvlan mode { l2 | l3 | l3s } - e.g. ip link add link ipvl0 eth0 type ipvlan mode l2 + e.g. ip link add link eth0 name ipvl0 type ipvlan mode l2 4. Operating modes: diff --git a/Documentation/networking/kapi.rst b/Documentation/networking/kapi.rst new file mode 100644 index 000000000000..580289f345da --- /dev/null +++ b/Documentation/networking/kapi.rst @@ -0,0 +1,147 @@ +========================================= +Linux Networking and Network Devices APIs +========================================= + +Linux Networking +================ + +Networking Base Types +--------------------- + +.. kernel-doc:: include/linux/net.h + :internal: + +Socket Buffer Functions +----------------------- + +.. kernel-doc:: include/linux/skbuff.h + :internal: + +.. kernel-doc:: include/net/sock.h + :internal: + +.. kernel-doc:: net/socket.c + :export: + +.. kernel-doc:: net/core/skbuff.c + :export: + +.. kernel-doc:: net/core/sock.c + :export: + +.. kernel-doc:: net/core/datagram.c + :export: + +.. kernel-doc:: net/core/stream.c + :export: + +Socket Filter +------------- + +.. kernel-doc:: net/core/filter.c + :export: + +Generic Network Statistics +-------------------------- + +.. kernel-doc:: include/uapi/linux/gen_stats.h + :internal: + +.. kernel-doc:: net/core/gen_stats.c + :export: + +.. kernel-doc:: net/core/gen_estimator.c + :export: + +SUN RPC subsystem +----------------- + +.. kernel-doc:: net/sunrpc/xdr.c + :export: + +.. kernel-doc:: net/sunrpc/svc_xprt.c + :export: + +.. kernel-doc:: net/sunrpc/xprt.c + :export: + +.. kernel-doc:: net/sunrpc/sched.c + :export: + +.. kernel-doc:: net/sunrpc/socklib.c + :export: + +.. kernel-doc:: net/sunrpc/stats.c + :export: + +.. kernel-doc:: net/sunrpc/rpc_pipe.c + :export: + +.. kernel-doc:: net/sunrpc/rpcb_clnt.c + :export: + +.. kernel-doc:: net/sunrpc/clnt.c + :export: + +WiMAX +----- + +.. kernel-doc:: net/wimax/op-msg.c + :export: + +.. kernel-doc:: net/wimax/op-reset.c + :export: + +.. kernel-doc:: net/wimax/op-rfkill.c + :export: + +.. kernel-doc:: net/wimax/stack.c + :export: + +.. kernel-doc:: include/net/wimax.h + :internal: + +.. kernel-doc:: include/uapi/linux/wimax.h + :internal: + +Network device support +====================== + +Driver Support +-------------- + +.. kernel-doc:: net/core/dev.c + :export: + +.. kernel-doc:: net/ethernet/eth.c + :export: + +.. kernel-doc:: net/sched/sch_generic.c + :export: + +.. kernel-doc:: include/linux/etherdevice.h + :internal: + +.. kernel-doc:: include/linux/netdevice.h + :internal: + +PHY Support +----------- + +.. kernel-doc:: drivers/net/phy/phy.c + :export: + +.. kernel-doc:: drivers/net/phy/phy.c + :internal: + +.. kernel-doc:: drivers/net/phy/phy_device.c + :export: + +.. kernel-doc:: drivers/net/phy/phy_device.c + :internal: + +.. kernel-doc:: drivers/net/phy/mdio_bus.c + :export: + +.. kernel-doc:: drivers/net/phy/mdio_bus.c + :internal: diff --git a/Documentation/networking/phy.txt b/Documentation/networking/phy.txt index 16f90d817224..bdec0f700bc1 100644 --- a/Documentation/networking/phy.txt +++ b/Documentation/networking/phy.txt @@ -295,7 +295,6 @@ Doing it all yourself settings in the PHY. int phy_ethtool_sset(struct phy_device *phydev, struct ethtool_cmd *cmd); - int phy_ethtool_gset(struct phy_device *phydev, struct ethtool_cmd *cmd); Ethtool convenience functions. diff --git a/Documentation/networking/policy-routing.txt b/Documentation/networking/policy-routing.txt deleted file mode 100644 index 36f6936d7f21..000000000000 --- a/Documentation/networking/policy-routing.txt +++ /dev/null @@ -1,150 +0,0 @@ -Classes -------- - - "Class" is a complete routing table in common sense. - I.e. it is tree of nodes (destination prefix, tos, metric) - with attached information: gateway, device etc. - This tree is looked up as specified in RFC1812 5.2.4.3 - 1. Basic match - 2. Longest match - 3. Weak TOS. - 4. Metric. (should not be in kernel space, but they are) - 5. Additional pruning rules. (not in kernel space). - - We have two special type of nodes: - REJECT - abort route lookup and return an error value. - THROW - abort route lookup in this class. - - - Currently the number of classes is limited to 255 - (0 is reserved for "not specified class") - - Three classes are builtin: - - RT_CLASS_LOCAL=255 - local interface addresses, - broadcasts, nat addresses. - - RT_CLASS_MAIN=254 - all normal routes are put there - by default. - - RT_CLASS_DEFAULT=253 - if ip_fib_model==1, then - normal default routes are put there, if ip_fib_model==2 - all gateway routes are put there. - - -Rules ------ - Rule is a record of (src prefix, src interface, tos, dst prefix) - with attached information. - - Rule types: - RTP_ROUTE - lookup in attached class - RTP_NAT - lookup in attached class and if a match is found, - translate packet source address. - RTP_MASQUERADE - lookup in attached class and if a match is found, - masquerade packet as sourced by us. - RTP_DROP - silently drop the packet. - RTP_REJECT - drop the packet and send ICMP NET UNREACHABLE. - RTP_PROHIBIT - drop the packet and send ICMP COMM. ADM. PROHIBITED. - - Rule flags: - RTRF_LOG - log route creations. - RTRF_VALVE - One way route (used with masquerading) - -Default setup: - -root@amber:/pub/ip-routing # iproute -r -Kernel routing policy rules -Pref Source Destination TOS Iface Cl - 0 default default 00 * 255 - 254 default default 00 * 254 - 255 default default 00 * 253 - - -Lookup algorithm ----------------- - - We scan rules list, and if a rule is matched, apply it. - If a route is found, return it. - If it is not found or a THROW node was matched, continue - to scan rules. - -Applications ------------- - -1. Just ignore classes. All the routes are put into MAIN class - (and/or into DEFAULT class). - - HOWTO: iproute add PREFIX [ tos TOS ] [ gw GW ] [ dev DEV ] - [ metric METRIC ] [ reject ] ... (look at iproute utility) - - or use route utility from current net-tools. - -2. Opposite case. Just forget all that you know about routing - tables. Every rule is supplied with its own gateway, device - info. record. This approach is not appropriate for automated - route maintenance, but it is ideal for manual configuration. - - HOWTO: iproute addrule [ from PREFIX ] [ to PREFIX ] [ tos TOS ] - [ dev INPUTDEV] [ pref PREFERENCE ] route [ gw GATEWAY ] - [ dev OUTDEV ] ..... - - Warning: As of now the size of the routing table in this - approach is limited to 256. If someone likes this model, I'll - relax this limitation. - -3. OSPF classes (see RFC1583, RFC1812 E.3.3) - Very clean, stable and robust algorithm for OSPF routing - domains. Unfortunately, it is not widely used in the Internet. - - Proposed setup: - 255 local addresses - 254 interface routes - 253 ASE routes with external metric - 252 ASE routes with internal metric - 251 inter-area routes - 250 intra-area routes for 1st area - 249 intra-area routes for 2nd area - etc. - - Rules: - iproute addrule class 253 - iproute addrule class 252 - iproute addrule class 251 - iproute addrule to a-prefix-for-1st-area class 250 - iproute addrule to another-prefix-for-1st-area class 250 - ... - iproute addrule to a-prefix-for-2nd-area class 249 - ... - - Area classes must be terminated with reject record. - iproute add default reject class 250 - iproute add default reject class 249 - ... - -4. The Variant Router Requirements Algorithm (RFC1812 E.3.2) - Create 16 classes for different TOS values. - It is a funny, but pretty useless algorithm. - I listed it just to show the power of new routing code. - -5. All the variety of combinations...... - - -GATED ------ - - Gated does not understand classes, but it will work - happily in MAIN+DEFAULT. All policy routes can be set - and maintained manually. - -IMPORTANT NOTE --------------- - route.c has a compilation time switch CONFIG_IP_LOCAL_RT_POLICY. - If it is set, locally originated packets are routed - using all the policy list. This is not very convenient and - pretty ambiguous when used with NAT and masquerading. - I set it to FALSE by default. - - -Alexey Kuznetov -kuznet@ms2.inr.ac.ru diff --git a/Documentation/networking/rxrpc.txt b/Documentation/networking/rxrpc.txt index 1b63bbc6b94f..8c70ba5dee4d 100644 --- a/Documentation/networking/rxrpc.txt +++ b/Documentation/networking/rxrpc.txt @@ -325,6 +325,9 @@ calls, to invoke certain actions and to report certain conditions. These are: RXRPC_LOCAL_ERROR -rt error num Local error encountered RXRPC_NEW_CALL -r- n/a New call received RXRPC_ACCEPT s-- n/a Accept new call + RXRPC_EXCLUSIVE_CALL s-- n/a Make an exclusive client call + RXRPC_UPGRADE_SERVICE s-- n/a Client call can be upgraded + RXRPC_TX_LENGTH s-- data len Total length of Tx data (SRT = usable in Sendmsg / delivered by Recvmsg / Terminal message) @@ -387,6 +390,40 @@ calls, to invoke certain actions and to report certain conditions. These are: return error ENODATA. If the user ID is already in use by another call, then error EBADSLT will be returned. + (*) RXRPC_EXCLUSIVE_CALL + + This is used to indicate that a client call should be made on a one-off + connection. The connection is discarded once the call has terminated. + + (*) RXRPC_UPGRADE_SERVICE + + This is used to make a client call to probe if the specified service ID + may be upgraded by the server. The caller must check msg_name returned to + recvmsg() for the service ID actually in use. The operation probed must + be one that takes the same arguments in both services. + + Once this has been used to establish the upgrade capability (or lack + thereof) of the server, the service ID returned should be used for all + future communication to that server and RXRPC_UPGRADE_SERVICE should no + longer be set. + + (*) RXRPC_TX_LENGTH + + This is used to inform the kernel of the total amount of data that is + going to be transmitted by a call (whether in a client request or a + service response). If given, it allows the kernel to encrypt from the + userspace buffer directly to the packet buffers, rather than copying into + the buffer and then encrypting in place. This may only be given with the + first sendmsg() providing data for a call. EMSGSIZE will be generated if + the amount of data actually given is different. + + This takes a parameter of __s64 type that indicates how much will be + transmitted. This may not be less than zero. + +The symbol RXRPC__SUPPORTED is defined as one more than the highest control +message type supported. At run time this can be queried by means of the +RXRPC_SUPPORTED_CMSG socket option (see below). + ============== SOCKET OPTIONS @@ -433,6 +470,18 @@ AF_RXRPC sockets support a few socket options at the SOL_RXRPC level: Encrypted checksum plus entire packet padded and encrypted, including actual packet length. + (*) RXRPC_UPGRADEABLE_SERVICE + + This is used to indicate that a service socket with two bindings may + upgrade one bound service to the other if requested by the client. optval + must point to an array of two unsigned short ints. The first is the + service ID to upgrade from and the second the service ID to upgrade to. + + (*) RXRPC_SUPPORTED_CMSG + + This is a read-only option that writes an int into the buffer indicating + the highest control message type supported. + ======== SECURITY @@ -542,6 +591,9 @@ A client would issue an operation by: MSG_MORE should be set in msghdr::msg_flags on all but the last part of the request. Multiple requests may be made simultaneously. + An RXRPC_TX_LENGTH control message can also be specified on the first + sendmsg() call. + If a call is intended to go to a destination other than the default specified through connect(), then msghdr::msg_name should be set on the first request message of that call. @@ -559,6 +611,17 @@ A client would issue an operation by: buffer instead, and MSG_EOR will be flagged to indicate the end of that call. +A client may ask for a service ID it knows and ask that this be upgraded to a +better service if one is available by supplying RXRPC_UPGRADE_SERVICE on the +first sendmsg() of a call. The client should then check srx_service in the +msg_name filled in by recvmsg() when collecting the result. srx_service will +hold the same value as given to sendmsg() if the upgrade request was ignored by +the service - otherwise it will be altered to indicate the service ID the +server upgraded to. Note that the upgraded service ID is chosen by the server. +The caller has to wait until it sees the service ID in the reply before sending +any more calls (further calls to the same destination will be blocked until the +probe is concluded). + ==================== EXAMPLE SERVER USAGE @@ -588,7 +651,7 @@ A server would be set up to accept operations in the following manner: The keyring can be manipulated after it has been given to the socket. This permits the server to add more keys, replace keys, etc. whilst it is live. - (2) A local address must then be bound: + (3) A local address must then be bound: struct sockaddr_rxrpc srx = { .srx_family = AF_RXRPC, @@ -600,11 +663,26 @@ A server would be set up to accept operations in the following manner: }; bind(server, &srx, sizeof(srx)); - (3) The server is then set to listen out for incoming calls: + More than one service ID may be bound to a socket, provided the transport + parameters are the same. The limit is currently two. To do this, bind() + should be called twice. + + (4) If service upgrading is required, first two service IDs must have been + bound and then the following option must be set: + + unsigned short service_ids[2] = { from_ID, to_ID }; + setsockopt(server, SOL_RXRPC, RXRPC_UPGRADEABLE_SERVICE, + service_ids, sizeof(service_ids)); + + This will automatically upgrade connections on service from_ID to service + to_ID if they request it. This will be reflected in msg_name obtained + through recvmsg() when the request data is delivered to userspace. + + (5) The server is then set to listen out for incoming calls: listen(server, 100); - (4) The kernel notifies the server of pending incoming connections by sending + (6) The kernel notifies the server of pending incoming connections by sending it a message for each. This is received with recvmsg() on the server socket. It has no data, and has a single dataless control message attached: @@ -616,13 +694,13 @@ A server would be set up to accept operations in the following manner: the time it is accepted - in which case the first call still on the queue will be accepted. - (5) The server then accepts the new call by issuing a sendmsg() with two + (7) The server then accepts the new call by issuing a sendmsg() with two pieces of control data and no actual data: RXRPC_ACCEPT - indicate connection acceptance RXRPC_USER_CALL_ID - specify user ID for this call - (6) The first request data packet will then be posted to the server socket for + (8) The first request data packet will then be posted to the server socket for recvmsg() to pick up. At that point, the RxRPC address for the call can be read from the address fields in the msghdr struct. @@ -634,7 +712,7 @@ A server would be set up to accept operations in the following manner: RXRPC_USER_CALL_ID - specifies the user ID for this call - (8) The reply data should then be posted to the server socket using a series + (9) The reply data should then be posted to the server socket using a series of sendmsg() calls, each with the following control messages attached: RXRPC_USER_CALL_ID - specifies the user ID for this call @@ -642,7 +720,7 @@ A server would be set up to accept operations in the following manner: MSG_MORE should be set in msghdr::msg_flags on all but the last message for a particular call. - (9) The final ACK from the client will be posted for retrieval by recvmsg() +(10) The final ACK from the client will be posted for retrieval by recvmsg() when it is received. It will take the form of a dataless message with two control messages attached: @@ -652,7 +730,7 @@ A server would be set up to accept operations in the following manner: MSG_EOR will be flagged to indicate that this is the final message for this call. -(10) Up to the point the final packet of reply data is sent, the call can be +(11) Up to the point the final packet of reply data is sent, the call can be aborted by calling sendmsg() with a dataless message with the following control messages attached: @@ -703,6 +781,7 @@ The kernel interface functions are as follows: struct sockaddr_rxrpc *srx, struct key *key, unsigned long user_call_ID, + s64 tx_total_len, gfp_t gfp); This allocates the infrastructure to make a new RxRPC call and assigns @@ -719,6 +798,11 @@ The kernel interface functions are as follows: control data buffer. It is entirely feasible to use this to point to a kernel data structure. + tx_total_len is the amount of data the caller is intending to transmit + with this call (or -1 if unknown at this point). Setting the data size + allows the kernel to encrypt directly to the packet buffers, thereby + saving a copy. The value may not be less than -1. + If this function is successful, an opaque reference to the RxRPC call is returned. The caller now holds a reference on this and it must be properly ended. @@ -870,6 +954,17 @@ The kernel interface functions are as follows: This is used to find the remote peer address of a call. + (*) Set the total transmit data size on a call. + + void rxrpc_kernel_set_tx_length(struct socket *sock, + struct rxrpc_call *call, + s64 tx_total_len); + + This sets the amount of data that the caller is intending to transmit on a + call. It's intended to be used for setting the reply size as the request + size should be set when the call is begun. tx_total_len may not be less + than zero. + ======================= CONFIGURABLE PARAMETERS diff --git a/Documentation/networking/segmentation-offloads.txt b/Documentation/networking/segmentation-offloads.txt index f200467ade38..2f09455a993a 100644 --- a/Documentation/networking/segmentation-offloads.txt +++ b/Documentation/networking/segmentation-offloads.txt @@ -55,7 +55,7 @@ IPIP, SIT, GRE, UDP Tunnel, and Remote Checksum Offloads In addition to the offloads described above it is possible for a frame to contain additional headers such as an outer tunnel. In order to account for such instances an additional set of segmentation offload types were -introduced including SKB_GSO_IPIP, SKB_GSO_SIT, SKB_GSO_GRE, and +introduced including SKB_GSO_IPXIP4, SKB_GSO_IPXIP6, SKB_GSO_GRE, and SKB_GSO_UDP_TUNNEL. These extra segmentation types are used to identify cases where there are more than just 1 set of headers. For example in the case of IPIP and SIT we should have the network and transport headers moved diff --git a/Documentation/networking/timestamping.txt b/Documentation/networking/timestamping.txt index 96f50694a748..1be0b6f9e0cb 100644 --- a/Documentation/networking/timestamping.txt +++ b/Documentation/networking/timestamping.txt @@ -44,8 +44,7 @@ timeval of SO_TIMESTAMP (ms). Supports multiple types of timestamp requests. As a result, this socket option takes a bitmap of flags, not a boolean. In - err = setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, (void *) val, - sizeof(val)); + err = setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, &val, sizeof(val)); val is an integer with any of the following bits set. Setting other bit returns EINVAL and does not change the current state. @@ -193,6 +192,24 @@ SOF_TIMESTAMPING_OPT_STATS: the transmit timestamps, such as how long a certain block of data was limited by peer's receiver window. +SOF_TIMESTAMPING_OPT_PKTINFO: + + Enable the SCM_TIMESTAMPING_PKTINFO control message for incoming + packets with hardware timestamps. The message contains struct + scm_ts_pktinfo, which supplies the index of the real interface which + received the packet and its length at layer 2. A valid (non-zero) + interface index will be returned only if CONFIG_NET_RX_BUSY_POLL is + enabled and the driver is using NAPI. The struct contains also two + other fields, but they are reserved and undefined. + +SOF_TIMESTAMPING_OPT_TX_SWHW: + + Request both hardware and software timestamps for outgoing packets + when SOF_TIMESTAMPING_TX_HARDWARE and SOF_TIMESTAMPING_TX_SOFTWARE + are enabled at the same time. If both timestamps are generated, + two separate messages will be looped to the socket's error queue, + each containing just one timestamp. + New applications are encouraged to pass SOF_TIMESTAMPING_OPT_ID to disambiguate timestamps and SOF_TIMESTAMPING_OPT_TSONLY to operate regardless of the setting of sysctl net.core.tstamp_allow_data. @@ -231,8 +248,7 @@ setsockopt to receive timestamps: __u32 val = SOF_TIMESTAMPING_SOFTWARE | SOF_TIMESTAMPING_OPT_ID /* or any other flag */; - err = setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, (void *) val, - sizeof(val)); + err = setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, &val, sizeof(val)); 1.4 Bytestream Timestamps @@ -312,7 +328,7 @@ struct scm_timestamping { }; The structure can return up to three timestamps. This is a legacy -feature. Only one field is non-zero at any time. Most timestamps +feature. At least one field is non-zero at any time. Most timestamps are passed in ts[0]. Hardware timestamps are passed in ts[2]. ts[1] used to hold hardware timestamps converted to system time. @@ -321,6 +337,12 @@ a HW PTP clock source, to allow time conversion in userspace and optionally synchronize system time with a userspace PTP stack such as linuxptp. For the PTP clock API, see Documentation/ptp/ptp.txt. +Note that if the SO_TIMESTAMP or SO_TIMESTAMPNS option is enabled +together with SO_TIMESTAMPING using SOF_TIMESTAMPING_SOFTWARE, a false +software timestamp will be generated in the recvmsg() call and passed +in ts[0] when a real software timestamp is missing. This happens also +on hardware transmit timestamps. + 2.1.1 Transmit timestamps with MSG_ERRQUEUE For transmit timestamps the outgoing packet is looped back to the diff --git a/Documentation/networking/tls.txt b/Documentation/networking/tls.txt new file mode 100644 index 000000000000..77ed00631c12 --- /dev/null +++ b/Documentation/networking/tls.txt @@ -0,0 +1,135 @@ +Overview +======== + +Transport Layer Security (TLS) is a Upper Layer Protocol (ULP) that runs over +TCP. TLS provides end-to-end data integrity and confidentiality. + +User interface +============== + +Creating a TLS connection +------------------------- + +First create a new TCP socket and set the TLS ULP. + + sock = socket(AF_INET, SOCK_STREAM, 0); + setsockopt(sock, SOL_TCP, TCP_ULP, "tls", sizeof("tls")); + +Setting the TLS ULP allows us to set/get TLS socket options. Currently +only the symmetric encryption is handled in the kernel. After the TLS +handshake is complete, we have all the parameters required to move the +data-path to the kernel. There is a separate socket option for moving +the transmit and the receive into the kernel. + + /* From linux/tls.h */ + struct tls_crypto_info { + unsigned short version; + unsigned short cipher_type; + }; + + struct tls12_crypto_info_aes_gcm_128 { + struct tls_crypto_info info; + unsigned char iv[TLS_CIPHER_AES_GCM_128_IV_SIZE]; + unsigned char key[TLS_CIPHER_AES_GCM_128_KEY_SIZE]; + unsigned char salt[TLS_CIPHER_AES_GCM_128_SALT_SIZE]; + unsigned char rec_seq[TLS_CIPHER_AES_GCM_128_REC_SEQ_SIZE]; + }; + + + struct tls12_crypto_info_aes_gcm_128 crypto_info; + + crypto_info.info.version = TLS_1_2_VERSION; + crypto_info.info.cipher_type = TLS_CIPHER_AES_GCM_128; + memcpy(crypto_info.iv, iv_write, TLS_CIPHER_AES_GCM_128_IV_SIZE); + memcpy(crypto_info.rec_seq, seq_number_write, + TLS_CIPHER_AES_GCM_128_REC_SEQ_SIZE); + memcpy(crypto_info.key, cipher_key_write, TLS_CIPHER_AES_GCM_128_KEY_SIZE); + memcpy(crypto_info.salt, implicit_iv_write, TLS_CIPHER_AES_GCM_128_SALT_SIZE); + + setsockopt(sock, SOL_TLS, TLS_TX, &crypto_info, sizeof(crypto_info)); + +Sending TLS application data +---------------------------- + +After setting the TLS_TX socket option all application data sent over this +socket is encrypted using TLS and the parameters provided in the socket option. +For example, we can send an encrypted hello world record as follows: + + const char *msg = "hello world\n"; + send(sock, msg, strlen(msg)); + +send() data is directly encrypted from the userspace buffer provided +to the encrypted kernel send buffer if possible. + +The sendfile system call will send the file's data over TLS records of maximum +length (2^14). + + file = open(filename, O_RDONLY); + fstat(file, &stat); + sendfile(sock, file, &offset, stat.st_size); + +TLS records are created and sent after each send() call, unless +MSG_MORE is passed. MSG_MORE will delay creation of a record until +MSG_MORE is not passed, or the maximum record size is reached. + +The kernel will need to allocate a buffer for the encrypted data. +This buffer is allocated at the time send() is called, such that +either the entire send() call will return -ENOMEM (or block waiting +for memory), or the encryption will always succeed. If send() returns +-ENOMEM and some data was left on the socket buffer from a previous +call using MSG_MORE, the MSG_MORE data is left on the socket buffer. + +Send TLS control messages +------------------------- + +Other than application data, TLS has control messages such as alert +messages (record type 21) and handshake messages (record type 22), etc. +These messages can be sent over the socket by providing the TLS record type +via a CMSG. For example the following function sends @data of @length bytes +using a record of type @record_type. + +/* send TLS control message using record_type */ + static int klts_send_ctrl_message(int sock, unsigned char record_type, + void *data, size_t length) + { + struct msghdr msg = {0}; + int cmsg_len = sizeof(record_type); + struct cmsghdr *cmsg; + char buf[CMSG_SPACE(cmsg_len)]; + struct iovec msg_iov; /* Vector of data to send/receive into. */ + + msg.msg_control = buf; + msg.msg_controllen = sizeof(buf); + cmsg = CMSG_FIRSTHDR(&msg); + cmsg->cmsg_level = SOL_TLS; + cmsg->cmsg_type = TLS_SET_RECORD_TYPE; + cmsg->cmsg_len = CMSG_LEN(cmsg_len); + *CMSG_DATA(cmsg) = record_type; + msg.msg_controllen = cmsg->cmsg_len; + + msg_iov.iov_base = data; + msg_iov.iov_len = length; + msg.msg_iov = &msg_iov; + msg.msg_iovlen = 1; + + return sendmsg(sock, &msg, 0); + } + +Control message data should be provided unencrypted, and will be +encrypted by the kernel. + +Integrating in to userspace TLS library +--------------------------------------- + +At a high level, the kernel TLS ULP is a replacement for the record +layer of a userspace TLS library. + +A patchset to OpenSSL to use ktls as the record layer is here: + +https://github.com/Mellanox/tls-openssl + +An example of calling send directly after a handshake using +gnutls. Since it doesn't implement a full record layer, control +messages are not supported: + +https://github.com/Mellanox/tls-af_ktls_tool diff --git a/Documentation/networking/z8530book.rst b/Documentation/networking/z8530book.rst new file mode 100644 index 000000000000..fea2c40e7973 --- /dev/null +++ b/Documentation/networking/z8530book.rst @@ -0,0 +1,256 @@ +======================= +Z8530 Programming Guide +======================= + +:Author: Alan Cox + +Introduction +============ + +The Z85x30 family synchronous/asynchronous controller chips are used on +a large number of cheap network interface cards. The kernel provides a +core interface layer that is designed to make it easy to provide WAN +services using this chip. + +The current driver only support synchronous operation. Merging the +asynchronous driver support into this code to allow any Z85x30 device to +be used as both a tty interface and as a synchronous controller is a +project for Linux post the 2.4 release + +Driver Modes +============ + +The Z85230 driver layer can drive Z8530, Z85C30 and Z85230 devices in +three different modes. Each mode can be applied to an individual channel +on the chip (each chip has two channels). + +The PIO synchronous mode supports the most common Z8530 wiring. Here the +chip is interface to the I/O and interrupt facilities of the host +machine but not to the DMA subsystem. When running PIO the Z8530 has +extremely tight timing requirements. Doing high speeds, even with a +Z85230 will be tricky. Typically you should expect to achieve at best +9600 baud with a Z8C530 and 64Kbits with a Z85230. + +The DMA mode supports the chip when it is configured to use dual DMA +channels on an ISA bus. The better cards tend to support this mode of +operation for a single channel. With DMA running the Z85230 tops out +when it starts to hit ISA DMA constraints at about 512Kbits. It is worth +noting here that many PC machines hang or crash when the chip is driven +fast enough to hold the ISA bus solid. + +Transmit DMA mode uses a single DMA channel. The DMA channel is used for +transmission as the transmit FIFO is smaller than the receive FIFO. it +gives better performance than pure PIO mode but is nowhere near as ideal +as pure DMA mode. + +Using the Z85230 driver +======================= + +The Z85230 driver provides the back end interface to your board. To +configure a Z8530 interface you need to detect the board and to identify +its ports and interrupt resources. It is also your problem to verify the +resources are available. + +Having identified the chip you need to fill in a struct z8530_dev, +which describes each chip. This object must exist until you finally +shutdown the board. Firstly zero the active field. This ensures nothing +goes off without you intending it. The irq field should be set to the +interrupt number of the chip. (Each chip has a single interrupt source +rather than each channel). You are responsible for allocating the +interrupt line. The interrupt handler should be set to +:c:func:`z8530_interrupt()`. The device id should be set to the +z8530_dev structure pointer. Whether the interrupt can be shared or not +is board dependent, and up to you to initialise. + +The structure holds two channel structures. Initialise chanA.ctrlio and +chanA.dataio with the address of the control and data ports. You can or +this with Z8530_PORT_SLEEP to indicate your interface needs the 5uS +delay for chip settling done in software. The PORT_SLEEP option is +architecture specific. Other flags may become available on future +platforms, eg for MMIO. Initialise the chanA.irqs to &z8530_nop to +start the chip up as disabled and discarding interrupt events. This +ensures that stray interrupts will be mopped up and not hang the bus. +Set chanA.dev to point to the device structure itself. The private and +name field you may use as you wish. The private field is unused by the +Z85230 layer. The name is used for error reporting and it may thus make +sense to make it match the network name. + +Repeat the same operation with the B channel if your chip has both +channels wired to something useful. This isn't always the case. If it is +not wired then the I/O values do not matter, but you must initialise +chanB.dev. + +If your board has DMA facilities then initialise the txdma and rxdma +fields for the relevant channels. You must also allocate the ISA DMA +channels and do any necessary board level initialisation to configure +them. The low level driver will do the Z8530 and DMA controller +programming but not board specific magic. + +Having initialised the device you can then call +:c:func:`z8530_init()`. This will probe the chip and reset it into +a known state. An identification sequence is then run to identify the +chip type. If the checks fail to pass the function returns a non zero +error code. Typically this indicates that the port given is not valid. +After this call the type field of the z8530_dev structure is +initialised to either Z8530, Z85C30 or Z85230 according to the chip +found. + +Once you have called z8530_init you can also make use of the utility +function :c:func:`z8530_describe()`. This provides a consistent +reporting format for the Z8530 devices, and allows all the drivers to +provide consistent reporting. + +Attaching Network Interfaces +============================ + +If you wish to use the network interface facilities of the driver, then +you need to attach a network device to each channel that is present and +in use. In addition to use the generic HDLC you need to follow some +additional plumbing rules. They may seem complex but a look at the +example hostess_sv11 driver should reassure you. + +The network device used for each channel should be pointed to by the +netdevice field of each channel. The hdlc-> priv field of the network +device points to your private data - you will need to be able to find +your private data from this. + +The way most drivers approach this particular problem is to create a +structure holding the Z8530 device definition and put that into the +private field of the network device. The network device fields of the +channels then point back to the network devices. + +If you wish to use the generic HDLC then you need to register the HDLC +device. + +Before you register your network device you will also need to provide +suitable handlers for most of the network device callbacks. See the +network device documentation for more details on this. + +Configuring And Activating The Port +=================================== + +The Z85230 driver provides helper functions and tables to load the port +registers on the Z8530 chips. When programming the register settings for +a channel be aware that the documentation recommends initialisation +orders. Strange things happen when these are not followed. + +:c:func:`z8530_channel_load()` takes an array of pairs of +initialisation values in an array of u8 type. The first value is the +Z8530 register number. Add 16 to indicate the alternate register bank on +the later chips. The array is terminated by a 255. + +The driver provides a pair of public tables. The z8530_hdlc_kilostream +table is for the UK 'Kilostream' service and also happens to cover most +other end host configurations. The z8530_hdlc_kilostream_85230 table +is the same configuration using the enhancements of the 85230 chip. The +configuration loaded is standard NRZ encoded synchronous data with HDLC +bitstuffing. All of the timing is taken from the other end of the link. + +When writing your own tables be aware that the driver internally tracks +register values. It may need to reload values. You should therefore be +sure to set registers 1-7, 9-11, 14 and 15 in all configurations. Where +the register settings depend on DMA selection the driver will update the +bits itself when you open or close. Loading a new table with the +interface open is not recommended. + +There are three standard configurations supported by the core code. In +PIO mode the interface is programmed up to use interrupt driven PIO. +This places high demands on the host processor to avoid latency. The +driver is written to take account of latency issues but it cannot avoid +latencies caused by other drivers, notably IDE in PIO mode. Because the +drivers allocate buffers you must also prevent MTU changes while the +port is open. + +Once the port is open it will call the rx_function of each channel +whenever a completed packet arrived. This is invoked from interrupt +context and passes you the channel and a network buffer (struct +sk_buff) holding the data. The data includes the CRC bytes so most +users will want to trim the last two bytes before processing the data. +This function is very timing critical. When you wish to simply discard +data the support code provides the function +:c:func:`z8530_null_rx()` to discard the data. + +To active PIO mode sending and receiving the ``z8530_sync_open`` is called. +This expects to be passed the network device and the channel. Typically +this is called from your network device open callback. On a failure a +non zero error status is returned. +The :c:func:`z8530_sync_close()` function shuts down a PIO +channel. This must be done before the channel is opened again and before +the driver shuts down and unloads. + +The ideal mode of operation is dual channel DMA mode. Here the kernel +driver will configure the board for DMA in both directions. The driver +also handles ISA DMA issues such as controller programming and the +memory range limit for you. This mode is activated by calling the +:c:func:`z8530_sync_dma_open()` function. On failure a non zero +error value is returned. Once this mode is activated it can be shut down +by calling the :c:func:`z8530_sync_dma_close()`. You must call +the close function matching the open mode you used. + +The final supported mode uses a single DMA channel to drive the transmit +side. As the Z85C30 has a larger FIFO on the receive channel this tends +to increase the maximum speed a little. This is activated by calling the +``z8530_sync_txdma_open``. This returns a non zero error code on failure. The +:c:func:`z8530_sync_txdma_close()` function closes down the Z8530 +interface from this mode. + +Network Layer Functions +======================= + +The Z8530 layer provides functions to queue packets for transmission. +The driver internally buffers the frame currently being transmitted and +one further frame (in order to keep back to back transmission running). +Any further buffering is up to the caller. + +The function :c:func:`z8530_queue_xmit()` takes a network buffer +in sk_buff format and queues it for transmission. The caller must +provide the entire packet with the exception of the bitstuffing and CRC. +This is normally done by the caller via the generic HDLC interface +layer. It returns 0 if the buffer has been queued and non zero values +for queue full. If the function accepts the buffer it becomes property +of the Z8530 layer and the caller should not free it. + +The function :c:func:`z8530_get_stats()` returns a pointer to an +internally maintained per interface statistics block. This provides most +of the interface code needed to implement the network layer get_stats +callback. + +Porting The Z8530 Driver +======================== + +The Z8530 driver is written to be portable. In DMA mode it makes +assumptions about the use of ISA DMA. These are probably warranted in +most cases as the Z85230 in particular was designed to glue to PC type +machines. The PIO mode makes no real assumptions. + +Should you need to retarget the Z8530 driver to another architecture the +only code that should need changing are the port I/O functions. At the +moment these assume PC I/O port accesses. This may not be appropriate +for all platforms. Replacing :c:func:`z8530_read_port()` and +``z8530_write_port`` is intended to be all that is required to port +this driver layer. + +Known Bugs And Assumptions +========================== + +Interrupt Locking + The locking in the driver is done via the global cli/sti lock. This + makes for relatively poor SMP performance. Switching this to use a + per device spin lock would probably materially improve performance. + +Occasional Failures + We have reports of occasional failures when run for very long + periods of time and the driver starts to receive junk frames. At the + moment the cause of this is not clear. + +Public Functions Provided +========================= + +.. kernel-doc:: drivers/net/wan/z85230.c + :export: + +Internal Functions +================== + +.. kernel-doc:: drivers/net/wan/z85230.c + :internal: diff --git a/Documentation/nommu-mmap.txt b/Documentation/nommu-mmap.txt index ae57b9ea0d41..69556f0d494b 100644 --- a/Documentation/nommu-mmap.txt +++ b/Documentation/nommu-mmap.txt @@ -1,6 +1,6 @@ - ============================= - NO-MMU MEMORY MAPPING SUPPORT - ============================= +============================= +No-MMU memory mapping support +============================= The kernel has limited support for memory mapping under no-MMU conditions, such as are used in uClinux environments. From the userspace point of view, memory @@ -16,7 +16,7 @@ the CLONE_VM flag. The behaviour is similar between the MMU and no-MMU cases, but not identical; and it's also much more restricted in the latter case: - (*) Anonymous mapping, MAP_PRIVATE + (#) Anonymous mapping, MAP_PRIVATE In the MMU case: VM regions backed by arbitrary pages; copy-on-write across fork. @@ -24,14 +24,14 @@ and it's also much more restricted in the latter case: In the no-MMU case: VM regions backed by arbitrary contiguous runs of pages. - (*) Anonymous mapping, MAP_SHARED + (#) Anonymous mapping, MAP_SHARED These behave very much like private mappings, except that they're shared across fork() or clone() without CLONE_VM in the MMU case. Since the no-MMU case doesn't support these, behaviour is identical to MAP_PRIVATE there. - (*) File, MAP_PRIVATE, PROT_READ / PROT_EXEC, !PROT_WRITE + (#) File, MAP_PRIVATE, PROT_READ / PROT_EXEC, !PROT_WRITE In the MMU case: VM regions backed by pages read from file; changes to the underlying file are reflected in the mapping; copied across fork. @@ -56,7 +56,7 @@ and it's also much more restricted in the latter case: are visible in other processes (no MMU protection), but should not happen. - (*) File, MAP_PRIVATE, PROT_READ / PROT_EXEC, PROT_WRITE + (#) File, MAP_PRIVATE, PROT_READ / PROT_EXEC, PROT_WRITE In the MMU case: like the non-PROT_WRITE case, except that the pages in question get copied before the write actually happens. From that point @@ -66,7 +66,7 @@ and it's also much more restricted in the latter case: In the no-MMU case: works much like the non-PROT_WRITE case, except that a copy is always taken and never shared. - (*) Regular file / blockdev, MAP_SHARED, PROT_READ / PROT_EXEC / PROT_WRITE + (#) Regular file / blockdev, MAP_SHARED, PROT_READ / PROT_EXEC / PROT_WRITE In the MMU case: VM regions backed by pages read from file; changes to pages written back to file; writes to file reflected into pages backing @@ -74,7 +74,7 @@ and it's also much more restricted in the latter case: In the no-MMU case: not supported. - (*) Memory backed regular file, MAP_SHARED, PROT_READ / PROT_EXEC / PROT_WRITE + (#) Memory backed regular file, MAP_SHARED, PROT_READ / PROT_EXEC / PROT_WRITE In the MMU case: As for ordinary regular files. @@ -85,7 +85,7 @@ and it's also much more restricted in the latter case: as for the MMU case. If the filesystem does not provide any such support, then the mapping request will be denied. - (*) Memory backed blockdev, MAP_SHARED, PROT_READ / PROT_EXEC / PROT_WRITE + (#) Memory backed blockdev, MAP_SHARED, PROT_READ / PROT_EXEC / PROT_WRITE In the MMU case: As for ordinary regular files. @@ -94,7 +94,7 @@ and it's also much more restricted in the latter case: truncate being called. The ramdisk driver could do this if it allocated all its memory as a contiguous array upfront. - (*) Memory backed chardev, MAP_SHARED, PROT_READ / PROT_EXEC / PROT_WRITE + (#) Memory backed chardev, MAP_SHARED, PROT_READ / PROT_EXEC / PROT_WRITE In the MMU case: As for ordinary regular files. @@ -105,21 +105,20 @@ and it's also much more restricted in the latter case: provide any such support, then the mapping request will be denied. -============================ -FURTHER NOTES ON NO-MMU MMAP +Further notes on no-MMU MMAP ============================ - (*) A request for a private mapping of a file may return a buffer that is not + (#) A request for a private mapping of a file may return a buffer that is not page-aligned. This is because XIP may take place, and the data may not be paged aligned in the backing store. - (*) A request for an anonymous mapping will always be page aligned. If + (#) A request for an anonymous mapping will always be page aligned. If possible the size of the request should be a power of two otherwise some of the space may be wasted as the kernel must allocate a power-of-2 granule but will only discard the excess if appropriately configured as this has an effect on fragmentation. - (*) The memory allocated by a request for an anonymous mapping will normally + (#) The memory allocated by a request for an anonymous mapping will normally be cleared by the kernel before being returned in accordance with the Linux man pages (ver 2.22 or later). @@ -145,24 +144,23 @@ FURTHER NOTES ON NO-MMU MMAP uClibc uses this to speed up malloc(), and the ELF-FDPIC binfmt uses this to allocate the brk and stack region. - (*) A list of all the private copy and anonymous mappings on the system is + (#) A list of all the private copy and anonymous mappings on the system is visible through /proc/maps in no-MMU mode. - (*) A list of all the mappings in use by a process is visible through + (#) A list of all the mappings in use by a process is visible through /proc/<pid>/maps in no-MMU mode. - (*) Supplying MAP_FIXED or a requesting a particular mapping address will + (#) Supplying MAP_FIXED or a requesting a particular mapping address will result in an error. - (*) Files mapped privately usually have to have a read method provided by the + (#) Files mapped privately usually have to have a read method provided by the driver or filesystem so that the contents can be read into the memory allocated if mmap() chooses not to map the backing device directly. An error will result if they don't. This is most likely to be encountered with character device files, pipes, fifos and sockets. -========================== -INTERPROCESS SHARED MEMORY +Interprocess shared memory ========================== Both SYSV IPC SHM shared memory and POSIX shared memory is supported in NOMMU @@ -170,8 +168,7 @@ mode. The former through the usual mechanism, the latter through files created on ramfs or tmpfs mounts. -======= -FUTEXES +Futexes ======= Futexes are supported in NOMMU mode if the arch supports them. An error will @@ -180,12 +177,11 @@ mappings made by a process or if the mapping in which the address lies does not support futexes (such as an I/O chardev mapping). -============= -NO-MMU MREMAP +No-MMU mremap ============= The mremap() function is partially supported. It may change the size of a -mapping, and may move it[*] if MREMAP_MAYMOVE is specified and if the new size +mapping, and may move it [#]_ if MREMAP_MAYMOVE is specified and if the new size of the mapping exceeds the size of the slab object currently occupied by the memory to which the mapping refers, or if a smaller slab object could be used. @@ -200,11 +196,10 @@ a previously mapped object. It may not be used to create holes in existing mappings, move parts of existing mappings or resize parts of mappings. It must act on a complete mapping. -[*] Not currently supported. +.. [#] Not currently supported. -============================================ -PROVIDING SHAREABLE CHARACTER DEVICE SUPPORT +Providing shareable character device support ============================================ To provide shareable character device support, a driver must provide a @@ -235,7 +230,7 @@ direct the call to the device-specific driver. Under such circumstances, the mapping request will be rejected if NOMMU_MAP_COPY is not specified, and a copy mapped otherwise. -IMPORTANT NOTE: +.. important:: Some types of device may present a different appearance to anyone looking at them in certain modes. Flash chips can be like this; for @@ -249,8 +244,7 @@ IMPORTANT NOTE: circumstances! -============================================== -PROVIDING SHAREABLE MEMORY-BACKED FILE SUPPORT +Providing shareable memory-backed file support ============================================== Provision of shared mappings on memory backed files is similar to the provision @@ -267,8 +261,7 @@ Memory backed devices are indicated by the mapping's backing device info having the memory_backed flag set. -======================================== -PROVIDING SHAREABLE BLOCK DEVICE SUPPORT +Providing shareable block device support ======================================== Provision of shared mappings on block device files is exactly the same as for @@ -276,8 +269,7 @@ character devices. If there isn't a real device underneath, then the driver should allocate sufficient contiguous memory to honour any supported mapping. -================================= -ADJUSTING PAGE TRIMMING BEHAVIOUR +Adjusting page trimming behaviour ================================= NOMMU mmap automatically rounds up to the nearest power-of-2 number of pages @@ -288,4 +280,4 @@ allocator. In order to retain finer-grained control over fragmentation, this behaviour can either be disabled completely, or bumped up to a higher page watermark where trimming begins. -Page trimming behaviour is configurable via the sysctl `vm.nr_trim_pages'. +Page trimming behaviour is configurable via the sysctl ``vm.nr_trim_pages``. diff --git a/Documentation/ntb.txt b/Documentation/ntb.txt index 1d9bbabb6c79..a043854d28df 100644 --- a/Documentation/ntb.txt +++ b/Documentation/ntb.txt @@ -1,16 +1,21 @@ -# NTB Drivers +=========== +NTB Drivers +=========== NTB (Non-Transparent Bridge) is a type of PCI-Express bridge chip that connects -the separate memory systems of two computers to the same PCI-Express fabric. -Existing NTB hardware supports a common feature set, including scratchpad -registers, doorbell registers, and memory translation windows. Scratchpad -registers are read-and-writable registers that are accessible from either side -of the device, so that peers can exchange a small amount of information at a -fixed address. Doorbell registers provide a way for peers to send interrupt -events. Memory windows allow translated read and write access to the peer -memory. - -## NTB Core Driver (ntb) +the separate memory systems of two or more computers to the same PCI-Express +fabric. Existing NTB hardware supports a common feature set: doorbell +registers and memory translation windows, as well as non common features like +scratchpad and message registers. Scratchpad registers are read-and-writable +registers that are accessible from either side of the device, so that peers can +exchange a small amount of information at a fixed address. Message registers can +be utilized for the same purpose. Additionally they are provided with with +special status bits to make sure the information isn't rewritten by another +peer. Doorbell registers provide a way for peers to send interrupt events. +Memory windows allow translated read and write access to the peer memory. + +NTB Core Driver (ntb) +===================== The NTB core driver defines an api wrapping the common feature set, and allows clients interested in NTB features to discover NTB the devices supported by @@ -18,7 +23,8 @@ hardware drivers. The term "client" is used here to mean an upper layer component making use of the NTB api. The term "driver," or "hardware driver," is used here to mean a driver for a specific vendor and model of NTB hardware. -## NTB Client Drivers +NTB Client Drivers +================== NTB client drivers should register with the NTB core driver. After registering, the client probe and remove functions will be called appropriately @@ -26,7 +32,90 @@ as ntb hardware, or hardware drivers, are inserted and removed. The registration uses the Linux Device framework, so it should feel familiar to anyone who has written a pci driver. -### NTB Transport Client (ntb\_transport) and NTB Netdev (ntb\_netdev) +NTB Typical client driver implementation +---------------------------------------- + +Primary purpose of NTB is to share some peace of memory between at least two +systems. So the NTB device features like Scratchpad/Message registers are +mainly used to perform the proper memory window initialization. Typically +there are two types of memory window interfaces supported by the NTB API: +inbound translation configured on the local ntb port and outbound translation +configured by the peer, on the peer ntb port. The first type is +depicted on the next figure + +Inbound translation: + Memory: Local NTB Port: Peer NTB Port: Peer MMIO: + ____________ + | dma-mapped |-ntb_mw_set_trans(addr) | + | memory | _v____________ | ______________ + | (addr) |<======| MW xlat addr |<====| MW base addr |<== memory-mapped IO + |------------| |--------------| | |--------------| + +So typical scenario of the first type memory window initialization looks: +1) allocate a memory region, 2) put translated address to NTB config, +3) somehow notify a peer device of performed initialization, 4) peer device +maps corresponding outbound memory window so to have access to the shared +memory region. + +The second type of interface, that implies the shared windows being +initialized by a peer device, is depicted on the figure: + +Outbound translation: + Memory: Local NTB Port: Peer NTB Port: Peer MMIO: + ____________ ______________ + | dma-mapped | | | MW base addr |<== memory-mapped IO + | memory | | |--------------| + | (addr) |<===================| MW xlat addr |<-ntb_peer_mw_set_trans(addr) + |------------| | |--------------| + +Typical scenario of the second type interface initialization would be: +1) allocate a memory region, 2) somehow deliver a translated address to a peer +device, 3) peer puts the translated address to NTB config, 4) peer device maps +outbound memory window so to have access to the shared memory region. + +As one can see the described scenarios can be combined in one portable +algorithm. + Local device: + 1) Allocate memory for a shared window + 2) Initialize memory window by translated address of the allocated region + (it may fail if local memory window initialization is unsupported) + 3) Send the translated address and memory window index to a peer device + Peer device: + 1) Initialize memory window with retrieved address of the allocated + by another device memory region (it may fail if peer memory window + initialization is unsupported) + 2) Map outbound memory window + +In accordance with this scenario, the NTB Memory Window API can be used as +follows: + Local device: + 1) ntb_mw_count(pidx) - retrieve number of memory ranges, which can + be allocated for memory windows between local device and peer device + of port with specified index. + 2) ntb_get_align(pidx, midx) - retrieve parameters restricting the + shared memory region alignment and size. Then memory can be properly + allocated. + 3) Allocate physically contiguous memory region in compliance with + restrictions retrieved in 2). + 4) ntb_mw_set_trans(pidx, midx) - try to set translation address of + the memory window with specified index for the defined peer device + (it may fail if local translated address setting is not supported) + 5) Send translated base address (usually together with memory window + number) to the peer device using, for instance, scratchpad or message + registers. + Peer device: + 1) ntb_peer_mw_set_trans(pidx, midx) - try to set received from other + device (related to pidx) translated address for specified memory + window. It may fail if retrieved address, for instance, exceeds + maximum possible address or isn't properly aligned. + 2) ntb_peer_mw_get_addr(widx) - retrieve MMIO address to map the memory + window so to have an access to the shared memory. + +Also it is worth to note, that method ntb_mw_count(pidx) should return the +same value as ntb_peer_mw_count() on the peer with port index - pidx. + +NTB Transport Client (ntb\_transport) and NTB Netdev (ntb\_netdev) +------------------------------------------------------------------ The primary client for NTB is the Transport client, used in tandem with NTB Netdev. These drivers function together to create a logical link to the peer, @@ -37,7 +126,8 @@ Transport queue pair. Network data is copied between socket buffers and the Transport queue pair buffer. The Transport client may be used for other things besides Netdev, however no other applications have yet been written. -### NTB Ping Pong Test Client (ntb\_pingpong) +NTB Ping Pong Test Client (ntb\_pingpong) +----------------------------------------- The Ping Pong test client serves as a demonstration to exercise the doorbell and scratchpad registers of NTB hardware, and as an example simple NTB client. @@ -64,7 +154,8 @@ Module Parameters: * dyndbg - It is suggested to specify dyndbg=+p when loading this module, and then to observe debugging output on the console. -### NTB Tool Test Client (ntb\_tool) +NTB Tool Test Client (ntb\_tool) +-------------------------------- The Tool test client serves for debugging, primarily, ntb hardware and drivers. The Tool provides access through debugfs for reading, setting, and clearing the @@ -74,48 +165,60 @@ The Tool does not currently have any module parameters. Debugfs Files: -* *debugfs*/ntb\_tool/*hw*/ - A directory in debugfs will be created for each +* *debugfs*/ntb\_tool/*hw*/ + A directory in debugfs will be created for each NTB device probed by the tool. This directory is shortened to *hw* below. -* *hw*/db - This file is used to read, set, and clear the local doorbell. Not +* *hw*/db + This file is used to read, set, and clear the local doorbell. Not all operations may be supported by all hardware. To read the doorbell, read the file. To set the doorbell, write `s` followed by the bits to set (eg: `echo 's 0x0101' > db`). To clear the doorbell, write `c` followed by the bits to clear. -* *hw*/mask - This file is used to read, set, and clear the local doorbell mask. +* *hw*/mask + This file is used to read, set, and clear the local doorbell mask. See *db* for details. -* *hw*/peer\_db - This file is used to read, set, and clear the peer doorbell. +* *hw*/peer\_db + This file is used to read, set, and clear the peer doorbell. See *db* for details. -* *hw*/peer\_mask - This file is used to read, set, and clear the peer doorbell +* *hw*/peer\_mask + This file is used to read, set, and clear the peer doorbell mask. See *db* for details. -* *hw*/spad - This file is used to read and write local scratchpads. To read +* *hw*/spad + This file is used to read and write local scratchpads. To read the values of all scratchpads, read the file. To write values, write a series of pairs of scratchpad number and value (eg: `echo '4 0x123 7 0xabc' > spad` # to set scratchpads `4` and `7` to `0x123` and `0xabc`, respectively). -* *hw*/peer\_spad - This file is used to read and write peer scratchpads. See +* *hw*/peer\_spad + This file is used to read and write peer scratchpads. See *spad* for details. -## NTB Hardware Drivers +NTB Hardware Drivers +==================== NTB hardware drivers should register devices with the NTB core driver. After registering, clients probe and remove functions will be called. -### NTB Intel Hardware Driver (ntb\_hw\_intel) +NTB Intel Hardware Driver (ntb\_hw\_intel) +------------------------------------------ The Intel hardware driver supports NTB on Xeon and Atom CPUs. Module Parameters: -* b2b\_mw\_idx - If the peer ntb is to be accessed via a memory window, then use +* b2b\_mw\_idx + If the peer ntb is to be accessed via a memory window, then use this memory window to access the peer ntb. A value of zero or positive starts from the first mw idx, and a negative value starts from the last mw idx. Both sides MUST set the same value here! The default value is `-1`. -* b2b\_mw\_share - If the peer ntb is to be accessed via a memory window, and if +* b2b\_mw\_share + If the peer ntb is to be accessed via a memory window, and if the memory window is large enough, still allow the client to use the second half of the memory window for address translation to the peer. -* xeon\_b2b\_usd\_bar2\_addr64 - If using B2B topology on Xeon hardware, use +* xeon\_b2b\_usd\_bar2\_addr64 + If using B2B topology on Xeon hardware, use this 64 bit address on the bus between the NTB devices for the window at BAR2, on the upstream side of the link. * xeon\_b2b\_usd\_bar4\_addr64 - See *xeon\_b2b\_bar2\_addr64*. diff --git a/Documentation/numastat.txt b/Documentation/numastat.txt index 520327790d54..aaf1667489f8 100644 --- a/Documentation/numastat.txt +++ b/Documentation/numastat.txt @@ -1,10 +1,12 @@ - +=============================== Numa policy hit/miss statistics +=============================== /sys/devices/system/node/node*/numastat All units are pages. Hugepages have separate counters. +=============== ============================================================ numa_hit A process wanted to allocate memory from this node, and succeeded. @@ -20,6 +22,7 @@ other_node A process ran on this node and got memory from another node. interleave_hit Interleaving wanted to allocate from this node and succeeded. +=============== ============================================================ For easier reading you can use the numastat utility from the numactl package (http://oss.sgi.com/projects/libnuma/). Note that it only works diff --git a/Documentation/padata.txt b/Documentation/padata.txt index 7ddfe216a0aa..b103d0c82000 100644 --- a/Documentation/padata.txt +++ b/Documentation/padata.txt @@ -1,5 +1,8 @@ +======================================= The padata parallel execution mechanism -Last updated for 2.6.36 +======================================= + +:Last updated: for 2.6.36 Padata is a mechanism by which the kernel can farm work out to be done in parallel on multiple CPUs while retaining the ordering of tasks. It was @@ -9,7 +12,7 @@ those packets. The crypto developers made a point of writing padata in a sufficiently general fashion that it could be put to other uses as well. The first step in using padata is to set up a padata_instance structure for -overall control of how tasks are to be run: +overall control of how tasks are to be run:: #include <linux/padata.h> @@ -24,7 +27,7 @@ The workqueue wq is where the work will actually be done; it should be a multithreaded queue, naturally. To allocate a padata instance with the cpu_possible_mask for both -cpumasks this helper function can be used: +cpumasks this helper function can be used:: struct padata_instance *padata_alloc_possible(struct workqueue_struct *wq); @@ -36,7 +39,7 @@ it is legal to supply a cpumask to padata that contains offline CPUs. Once an offline CPU in the user supplied cpumask comes online, padata is going to use it. -There are functions for enabling and disabling the instance: +There are functions for enabling and disabling the instance:: int padata_start(struct padata_instance *pinst); void padata_stop(struct padata_instance *pinst); @@ -48,7 +51,7 @@ padata cpumask contains no active CPU (flag not set). padata_stop clears the flag and blocks until the padata instance is unused. -The list of CPUs to be used can be adjusted with these functions: +The list of CPUs to be used can be adjusted with these functions:: int padata_set_cpumasks(struct padata_instance *pinst, cpumask_var_t pcpumask, @@ -71,12 +74,12 @@ padata_add_cpu/padata_remove_cpu are used. cpu specifies the CPU to add or remove and mask is one of PADATA_CPU_SERIAL, PADATA_CPU_PARALLEL. If a user is interested in padata cpumask changes, he can register to -the padata cpumask change notifier: +the padata cpumask change notifier:: int padata_register_cpumask_notifier(struct padata_instance *pinst, struct notifier_block *nblock); -To unregister from that notifier: +To unregister from that notifier:: int padata_unregister_cpumask_notifier(struct padata_instance *pinst, struct notifier_block *nblock); @@ -84,7 +87,7 @@ To unregister from that notifier: The padata cpumask change notifier notifies about changes of the usable cpumasks, i.e. the subset of active CPUs in the user supplied cpumask. -Padata calls the notifier chain with: +Padata calls the notifier chain with:: blocking_notifier_call_chain(&pinst->cpumask_change_notifier, notification_mask, @@ -95,7 +98,7 @@ is one of PADATA_CPU_SERIAL, PADATA_CPU_PARALLEL and cpumask is a pointer to a struct padata_cpumask that contains the new cpumask information. Actually submitting work to the padata instance requires the creation of a -padata_priv structure: +padata_priv structure:: struct padata_priv { /* Other stuff here... */ @@ -110,7 +113,7 @@ parallel() and serial() functions should be provided. Those functions will be called in the process of getting the work done as we will see momentarily. -The submission of work is done with: +The submission of work is done with:: int padata_do_parallel(struct padata_instance *pinst, struct padata_priv *padata, int cb_cpu); @@ -138,7 +141,7 @@ need not be completed during this call, but, if parallel() leaves work outstanding, it should be prepared to be called again with a new job before the previous one completes. When a task does complete, parallel() (or whatever function actually finishes the job) should inform padata of the -fact with a call to: +fact with a call to:: void padata_do_serial(struct padata_priv *padata); @@ -151,7 +154,7 @@ pains to ensure that tasks are completed in the order in which they were submitted. The one remaining function in the padata API should be called to clean up -when a padata instance is no longer needed: +when a padata instance is no longer needed:: void padata_free(struct padata_instance *pinst); diff --git a/Documentation/parport-lowlevel.txt b/Documentation/parport-lowlevel.txt index 120eb20dbb09..0633d70ffda7 100644 --- a/Documentation/parport-lowlevel.txt +++ b/Documentation/parport-lowlevel.txt @@ -1,11 +1,12 @@ +=============================== PARPORT interface documentation -------------------------------- +=============================== -Time-stamp: <2000-02-24 13:30:20 twaugh> +:Time-stamp: <2000-02-24 13:30:20 twaugh> Described here are the following functions: -Global functions: +Global functions:: parport_register_driver parport_unregister_driver parport_enumerate @@ -31,7 +32,8 @@ Global functions: parport_set_timeout Port functions (can be overridden by low-level drivers): - SPP: + + SPP:: port->ops->read_data port->ops->write_data port->ops->read_status @@ -43,23 +45,23 @@ Port functions (can be overridden by low-level drivers): port->ops->data_forward port->ops->data_reverse - EPP: + EPP:: port->ops->epp_write_data port->ops->epp_read_data port->ops->epp_write_addr port->ops->epp_read_addr - ECP: + ECP:: port->ops->ecp_write_data port->ops->ecp_read_data port->ops->ecp_write_addr - Other: + Other:: port->ops->nibble_read_data port->ops->byte_read_data port->ops->compat_write_data -The parport subsystem comprises 'parport' (the core port-sharing +The parport subsystem comprises ``parport`` (the core port-sharing code), and a variety of low-level drivers that actually do the port accesses. Each low-level driver handles a particular style of port (PC, Amiga, and so on). @@ -70,14 +72,14 @@ into global functions and port functions. The global functions are mostly for communicating between the device driver and the parport subsystem: acquiring a list of available ports, claiming a port for exclusive use, and so on. They also include -'generic' functions for doing standard things that will work on any +``generic`` functions for doing standard things that will work on any IEEE 1284-capable architecture. The port functions are provided by the low-level drivers, although the -core parport module provides generic 'defaults' for some routines. +core parport module provides generic ``defaults`` for some routines. The port functions can be split into three groups: SPP, EPP, and ECP. -SPP (Standard Parallel Port) functions modify so-called 'SPP' +SPP (Standard Parallel Port) functions modify so-called ``SPP`` registers: data, status, and control. The hardware may not actually have registers exactly like that, but the PC does and this interface is modelled after common PC implementations. Other low-level drivers may @@ -95,58 +97,63 @@ to cope with peripherals that only tenuously support IEEE 1284, a low-level driver specific function is provided, for altering 'fudge factors'. -GLOBAL FUNCTIONS ----------------- +Global functions +================ parport_register_driver - register a device driver with parport ------------------------ +--------------------------------------------------------------- SYNOPSIS +^^^^^^^^ + +:: -#include <linux/parport.h> + #include <linux/parport.h> -struct parport_driver { - const char *name; - void (*attach) (struct parport *); - void (*detach) (struct parport *); - struct parport_driver *next; -}; -int parport_register_driver (struct parport_driver *driver); + struct parport_driver { + const char *name; + void (*attach) (struct parport *); + void (*detach) (struct parport *); + struct parport_driver *next; + }; + int parport_register_driver (struct parport_driver *driver); DESCRIPTION +^^^^^^^^^^^ In order to be notified about parallel ports when they are detected, parport_register_driver should be called. Your driver will immediately be notified of all ports that have already been detected, and of each new port as low-level drivers are loaded. -A 'struct parport_driver' contains the textual name of your driver, +A ``struct parport_driver`` contains the textual name of your driver, a pointer to a function to handle new ports, and a pointer to a function to handle ports going away due to a low-level driver unloading. Ports will only be detached if they are not being used (i.e. there are no devices registered on them). -The visible parts of the 'struct parport *' argument given to -attach/detach are: - -struct parport -{ - struct parport *next; /* next parport in list */ - const char *name; /* port's name */ - unsigned int modes; /* bitfield of hardware modes */ - struct parport_device_info probe_info; - /* IEEE1284 info */ - int number; /* parport index */ - struct parport_operations *ops; - ... -}; +The visible parts of the ``struct parport *`` argument given to +attach/detach are:: + + struct parport + { + struct parport *next; /* next parport in list */ + const char *name; /* port's name */ + unsigned int modes; /* bitfield of hardware modes */ + struct parport_device_info probe_info; + /* IEEE1284 info */ + int number; /* parport index */ + struct parport_operations *ops; + ... + }; There are other members of the structure, but they should not be touched. -The 'modes' member summarises the capabilities of the underlying +The ``modes`` member summarises the capabilities of the underlying hardware. It consists of flags which may be bitwise-ored together: + ============================= =============================================== PARPORT_MODE_PCSPP IBM PC registers are available, i.e. functions that act on data, control and status registers are @@ -169,297 +176,351 @@ hardware. It consists of flags which may be bitwise-ored together: GFP_DMA flag with kmalloc) to the low-level driver in order to take advantage of it. + ============================= =============================================== -There may be other flags in 'modes' as well. +There may be other flags in ``modes`` as well. -The contents of 'modes' is advisory only. For example, if the -hardware is capable of DMA, and PARPORT_MODE_DMA is in 'modes', it +The contents of ``modes`` is advisory only. For example, if the +hardware is capable of DMA, and PARPORT_MODE_DMA is in ``modes``, it doesn't necessarily mean that DMA will always be used when possible. Similarly, hardware that is capable of assisting ECP transfers won't necessarily be used. RETURN VALUE +^^^^^^^^^^^^ Zero on success, otherwise an error code. ERRORS +^^^^^^ None. (Can it fail? Why return int?) EXAMPLE +^^^^^^^ -static void lp_attach (struct parport *port) -{ - ... - private = kmalloc (...); - dev[count++] = parport_register_device (...); - ... -} +:: -static void lp_detach (struct parport *port) -{ - ... -} + static void lp_attach (struct parport *port) + { + ... + private = kmalloc (...); + dev[count++] = parport_register_device (...); + ... + } -static struct parport_driver lp_driver = { - "lp", - lp_attach, - lp_detach, - NULL /* always put NULL here */ -}; + static void lp_detach (struct parport *port) + { + ... + } -int lp_init (void) -{ - ... - if (parport_register_driver (&lp_driver)) { - /* Failed; nothing we can do. */ - return -EIO; + static struct parport_driver lp_driver = { + "lp", + lp_attach, + lp_detach, + NULL /* always put NULL here */ + }; + + int lp_init (void) + { + ... + if (parport_register_driver (&lp_driver)) { + /* Failed; nothing we can do. */ + return -EIO; + } + ... } - ... -} + SEE ALSO +^^^^^^^^ parport_unregister_driver, parport_register_device, parport_enumerate - + + + parport_unregister_driver - tell parport to forget about this driver -------------------------- +-------------------------------------------------------------------- SYNOPSIS +^^^^^^^^ -#include <linux/parport.h> +:: -struct parport_driver { - const char *name; - void (*attach) (struct parport *); - void (*detach) (struct parport *); - struct parport_driver *next; -}; -void parport_unregister_driver (struct parport_driver *driver); + #include <linux/parport.h> + + struct parport_driver { + const char *name; + void (*attach) (struct parport *); + void (*detach) (struct parport *); + struct parport_driver *next; + }; + void parport_unregister_driver (struct parport_driver *driver); DESCRIPTION +^^^^^^^^^^^ This tells parport not to notify the device driver of new ports or of ports going away. Registered devices belonging to that driver are NOT unregistered: parport_unregister_device must be used for each one. EXAMPLE +^^^^^^^ -void cleanup_module (void) -{ - ... - /* Stop notifications. */ - parport_unregister_driver (&lp_driver); +:: - /* Unregister devices. */ - for (i = 0; i < NUM_DEVS; i++) - parport_unregister_device (dev[i]); - ... -} + void cleanup_module (void) + { + ... + /* Stop notifications. */ + parport_unregister_driver (&lp_driver); + + /* Unregister devices. */ + for (i = 0; i < NUM_DEVS; i++) + parport_unregister_device (dev[i]); + ... + } SEE ALSO +^^^^^^^^ parport_register_driver, parport_enumerate - + + + parport_enumerate - retrieve a list of parallel ports (DEPRECATED) ------------------ +------------------------------------------------------------------ SYNOPSIS +^^^^^^^^ -#include <linux/parport.h> +:: -struct parport *parport_enumerate (void); + #include <linux/parport.h> + + struct parport *parport_enumerate (void); DESCRIPTION +^^^^^^^^^^^ Retrieve the first of a list of valid parallel ports for this machine. -Successive parallel ports can be found using the 'struct parport -*next' element of the 'struct parport *' that is returned. If 'next' +Successive parallel ports can be found using the ``struct parport +*next`` element of the ``struct parport *`` that is returned. If ``next`` is NULL, there are no more parallel ports in the list. The number of ports in the list will not exceed PARPORT_MAX. RETURN VALUE +^^^^^^^^^^^^ -A 'struct parport *' describing a valid parallel port for the machine, +A ``struct parport *`` describing a valid parallel port for the machine, or NULL if there are none. ERRORS +^^^^^^ This function can return NULL to indicate that there are no parallel ports to use. EXAMPLE +^^^^^^^ + +:: -int detect_device (void) -{ - struct parport *port; + int detect_device (void) + { + struct parport *port; + + for (port = parport_enumerate (); + port != NULL; + port = port->next) { + /* Try to detect a device on the port... */ + ... + } + } - for (port = parport_enumerate (); - port != NULL; - port = port->next) { - /* Try to detect a device on the port... */ ... - } } - ... -} - NOTES +^^^^^ parport_enumerate is deprecated; parport_register_driver should be used instead. SEE ALSO +^^^^^^^^ parport_register_driver, parport_unregister_driver - + + + parport_register_device - register to use a port ------------------------ +------------------------------------------------ SYNOPSIS +^^^^^^^^ -#include <linux/parport.h> +:: -typedef int (*preempt_func) (void *handle); -typedef void (*wakeup_func) (void *handle); -typedef int (*irq_func) (int irq, void *handle, struct pt_regs *); + #include <linux/parport.h> -struct pardevice *parport_register_device(struct parport *port, - const char *name, - preempt_func preempt, - wakeup_func wakeup, - irq_func irq, - int flags, - void *handle); + typedef int (*preempt_func) (void *handle); + typedef void (*wakeup_func) (void *handle); + typedef int (*irq_func) (int irq, void *handle, struct pt_regs *); + + struct pardevice *parport_register_device(struct parport *port, + const char *name, + preempt_func preempt, + wakeup_func wakeup, + irq_func irq, + int flags, + void *handle); DESCRIPTION +^^^^^^^^^^^ Use this function to register your device driver on a parallel port -('port'). Once you have done that, you will be able to use +(``port``). Once you have done that, you will be able to use parport_claim and parport_release in order to use the port. -The ('name') argument is the name of the device that appears in /proc +The (``name``) argument is the name of the device that appears in /proc filesystem. The string must be valid for the whole lifetime of the device (until parport_unregister_device is called). This function will register three callbacks into your driver: -'preempt', 'wakeup' and 'irq'. Each of these may be NULL in order to +``preempt``, ``wakeup`` and ``irq``. Each of these may be NULL in order to indicate that you do not want a callback. -When the 'preempt' function is called, it is because another driver -wishes to use the parallel port. The 'preempt' function should return +When the ``preempt`` function is called, it is because another driver +wishes to use the parallel port. The ``preempt`` function should return non-zero if the parallel port cannot be released yet -- if zero is returned, the port is lost to another driver and the port must be re-claimed before use. -The 'wakeup' function is called once another driver has released the +The ``wakeup`` function is called once another driver has released the port and no other driver has yet claimed it. You can claim the -parallel port from within the 'wakeup' function (in which case the +parallel port from within the ``wakeup`` function (in which case the claim is guaranteed to succeed), or choose not to if you don't need it now. If an interrupt occurs on the parallel port your driver has claimed, -the 'irq' function will be called. (Write something about shared +the ``irq`` function will be called. (Write something about shared interrupts here.) -The 'handle' is a pointer to driver-specific data, and is passed to +The ``handle`` is a pointer to driver-specific data, and is passed to the callback functions. -'flags' may be a bitwise combination of the following flags: +``flags`` may be a bitwise combination of the following flags: + ===================== ================================================= Flag Meaning + ===================== ================================================= PARPORT_DEV_EXCL The device cannot share the parallel port at all. Use this only when absolutely necessary. + ===================== ================================================= The typedefs are not actually defined -- they are only shown in order to make the function prototype more readable. -The visible parts of the returned 'struct pardevice' are: +The visible parts of the returned ``struct pardevice`` are:: -struct pardevice { - struct parport *port; /* Associated port */ - void *private; /* Device driver's 'handle' */ - ... -}; + struct pardevice { + struct parport *port; /* Associated port */ + void *private; /* Device driver's 'handle' */ + ... + }; RETURN VALUE +^^^^^^^^^^^^ -A 'struct pardevice *': a handle to the registered parallel port +A ``struct pardevice *``: a handle to the registered parallel port device that can be used for parport_claim, parport_release, etc. ERRORS +^^^^^^ A return value of NULL indicates that there was a problem registering a device on that port. EXAMPLE +^^^^^^^ + +:: + + static int preempt (void *handle) + { + if (busy_right_now) + return 1; + + must_reclaim_port = 1; + return 0; + } + + static void wakeup (void *handle) + { + struct toaster *private = handle; + struct pardevice *dev = private->dev; + if (!dev) return; /* avoid races */ + + if (want_port) + parport_claim (dev); + } + + static int toaster_detect (struct toaster *private, struct parport *port) + { + private->dev = parport_register_device (port, "toaster", preempt, + wakeup, NULL, 0, + private); + if (!private->dev) + /* Couldn't register with parport. */ + return -EIO; -static int preempt (void *handle) -{ - if (busy_right_now) - return 1; - - must_reclaim_port = 1; - return 0; -} - -static void wakeup (void *handle) -{ - struct toaster *private = handle; - struct pardevice *dev = private->dev; - if (!dev) return; /* avoid races */ - - if (want_port) - parport_claim (dev); -} - -static int toaster_detect (struct toaster *private, struct parport *port) -{ - private->dev = parport_register_device (port, "toaster", preempt, - wakeup, NULL, 0, - private); - if (!private->dev) - /* Couldn't register with parport. */ - return -EIO; - - must_reclaim_port = 0; - busy_right_now = 1; - parport_claim_or_block (private->dev); - ... - /* Don't need the port while the toaster warms up. */ - busy_right_now = 0; - ... - busy_right_now = 1; - if (must_reclaim_port) { - parport_claim_or_block (private->dev); must_reclaim_port = 0; + busy_right_now = 1; + parport_claim_or_block (private->dev); + ... + /* Don't need the port while the toaster warms up. */ + busy_right_now = 0; + ... + busy_right_now = 1; + if (must_reclaim_port) { + parport_claim_or_block (private->dev); + must_reclaim_port = 0; + } + ... } - ... -} SEE ALSO +^^^^^^^^ parport_unregister_device, parport_claim + + parport_unregister_device - finish using a port -------------------------- +----------------------------------------------- SYNPOPSIS -#include <linux/parport.h> +:: + + #include <linux/parport.h> -void parport_unregister_device (struct pardevice *dev); + void parport_unregister_device (struct pardevice *dev); DESCRIPTION +^^^^^^^^^^^ This function is the opposite of parport_register_device. After using -parport_unregister_device, 'dev' is no longer a valid device handle. +parport_unregister_device, ``dev`` is no longer a valid device handle. You should not unregister a device that is currently claimed, although if you do it will be released automatically. EXAMPLE +^^^^^^^ + +:: ... kfree (dev->private); /* before we lose the pointer */ @@ -467,460 +528,602 @@ EXAMPLE ... SEE ALSO +^^^^^^^^ + parport_unregister_driver parport_claim, parport_claim_or_block - claim the parallel port for a device -------------------------------------- +---------------------------------------------------------------------------- SYNOPSIS +^^^^^^^^ + +:: -#include <linux/parport.h> + #include <linux/parport.h> -int parport_claim (struct pardevice *dev); -int parport_claim_or_block (struct pardevice *dev); + int parport_claim (struct pardevice *dev); + int parport_claim_or_block (struct pardevice *dev); DESCRIPTION +^^^^^^^^^^^ These functions attempt to gain control of the parallel port on which -'dev' is registered. 'parport_claim' does not block, but -'parport_claim_or_block' may do. (Put something here about blocking +``dev`` is registered. ``parport_claim`` does not block, but +``parport_claim_or_block`` may do. (Put something here about blocking interruptibly or non-interruptibly.) You should not try to claim a port that you have already claimed. RETURN VALUE +^^^^^^^^^^^^ A return value of zero indicates that the port was successfully claimed, and the caller now has possession of the parallel port. -If 'parport_claim_or_block' blocks before returning successfully, the +If ``parport_claim_or_block`` blocks before returning successfully, the return value is positive. ERRORS +^^^^^^ +========== ========================================================== -EAGAIN The port is unavailable at the moment, but another attempt to claim it may succeed. +========== ========================================================== SEE ALSO +^^^^^^^^ + parport_release parport_release - release the parallel port ---------------- +------------------------------------------- SYNOPSIS +^^^^^^^^ + +:: -#include <linux/parport.h> + #include <linux/parport.h> -void parport_release (struct pardevice *dev); + void parport_release (struct pardevice *dev); DESCRIPTION +^^^^^^^^^^^ Once a parallel port device has been claimed, it can be released using -'parport_release'. It cannot fail, but you should not release a +``parport_release``. It cannot fail, but you should not release a device that you do not have possession of. EXAMPLE +^^^^^^^ -static size_t write (struct pardevice *dev, const void *buf, - size_t len) -{ - ... - written = dev->port->ops->write_ecp_data (dev->port, buf, - len); - parport_release (dev); - ... -} +:: + + static size_t write (struct pardevice *dev, const void *buf, + size_t len) + { + ... + written = dev->port->ops->write_ecp_data (dev->port, buf, + len); + parport_release (dev); + ... + } SEE ALSO +^^^^^^^^ change_mode, parport_claim, parport_claim_or_block, parport_yield - + + + parport_yield, parport_yield_blocking - temporarily release a parallel port -------------------------------------- +--------------------------------------------------------------------------- SYNOPSIS +^^^^^^^^ + +:: -#include <linux/parport.h> + #include <linux/parport.h> -int parport_yield (struct pardevice *dev) -int parport_yield_blocking (struct pardevice *dev); + int parport_yield (struct pardevice *dev) + int parport_yield_blocking (struct pardevice *dev); DESCRIPTION +^^^^^^^^^^^ When a driver has control of a parallel port, it may allow another -driver to temporarily 'borrow' it. 'parport_yield' does not block; -'parport_yield_blocking' may do. +driver to temporarily ``borrow`` it. ``parport_yield`` does not block; +``parport_yield_blocking`` may do. RETURN VALUE +^^^^^^^^^^^^ A return value of zero indicates that the caller still owns the port and the call did not block. -A positive return value from 'parport_yield_blocking' indicates that +A positive return value from ``parport_yield_blocking`` indicates that the caller still owns the port and the call blocked. A return value of -EAGAIN indicates that the caller no longer owns the port, and it must be re-claimed before use. ERRORS +^^^^^^ +========= ========================================================== -EAGAIN Ownership of the parallel port was given away. +========= ========================================================== SEE ALSO +^^^^^^^^ parport_release + + parport_wait_peripheral - wait for status lines, up to 35ms ------------------------ +----------------------------------------------------------- SYNOPSIS +^^^^^^^^ + +:: -#include <linux/parport.h> + #include <linux/parport.h> -int parport_wait_peripheral (struct parport *port, - unsigned char mask, - unsigned char val); + int parport_wait_peripheral (struct parport *port, + unsigned char mask, + unsigned char val); DESCRIPTION +^^^^^^^^^^^ Wait for the status lines in mask to match the values in val. RETURN VALUE +^^^^^^^^^^^^ +======== ========================================================== -EINTR a signal is pending 0 the status lines in mask have values in val 1 timed out while waiting (35ms elapsed) +======== ========================================================== SEE ALSO +^^^^^^^^ parport_poll_peripheral + + parport_poll_peripheral - wait for status lines, in usec ------------------------ +-------------------------------------------------------- SYNOPSIS +^^^^^^^^ + +:: -#include <linux/parport.h> + #include <linux/parport.h> -int parport_poll_peripheral (struct parport *port, - unsigned char mask, - unsigned char val, - int usec); + int parport_poll_peripheral (struct parport *port, + unsigned char mask, + unsigned char val, + int usec); DESCRIPTION +^^^^^^^^^^^ Wait for the status lines in mask to match the values in val. RETURN VALUE +^^^^^^^^^^^^ +======== ========================================================== -EINTR a signal is pending 0 the status lines in mask have values in val 1 timed out while waiting (usec microseconds have elapsed) +======== ========================================================== SEE ALSO +^^^^^^^^ parport_wait_peripheral - + + + parport_wait_event - wait for an event on a port ------------------- +------------------------------------------------ SYNOPSIS +^^^^^^^^ -#include <linux/parport.h> +:: -int parport_wait_event (struct parport *port, signed long timeout) + #include <linux/parport.h> + + int parport_wait_event (struct parport *port, signed long timeout) DESCRIPTION +^^^^^^^^^^^ Wait for an event (e.g. interrupt) on a port. The timeout is in jiffies. RETURN VALUE +^^^^^^^^^^^^ +======= ========================================================== 0 success <0 error (exit as soon as possible) >0 timed out - +======= ========================================================== + parport_negotiate - perform IEEE 1284 negotiation ------------------ +------------------------------------------------- SYNOPSIS +^^^^^^^^ + +:: -#include <linux/parport.h> + #include <linux/parport.h> -int parport_negotiate (struct parport *, int mode); + int parport_negotiate (struct parport *, int mode); DESCRIPTION +^^^^^^^^^^^ Perform IEEE 1284 negotiation. RETURN VALUE +^^^^^^^^^^^^ +======= ========================================================== 0 handshake OK; IEEE 1284 peripheral and mode available -1 handshake failed; peripheral not compliant (or none present) 1 handshake OK; IEEE 1284 peripheral present but mode not available +======= ========================================================== SEE ALSO +^^^^^^^^ parport_read, parport_write - + + + parport_read - read data from device ------------- +------------------------------------ SYNOPSIS +^^^^^^^^ + +:: -#include <linux/parport.h> + #include <linux/parport.h> -ssize_t parport_read (struct parport *, void *buf, size_t len); + ssize_t parport_read (struct parport *, void *buf, size_t len); DESCRIPTION +^^^^^^^^^^^ Read data from device in current IEEE 1284 transfer mode. This only works for modes that support reverse data transfer. RETURN VALUE +^^^^^^^^^^^^ If negative, an error code; otherwise the number of bytes transferred. SEE ALSO +^^^^^^^^ parport_write, parport_negotiate - + + + parport_write - write data to device -------------- +------------------------------------ SYNOPSIS +^^^^^^^^ + +:: -#include <linux/parport.h> + #include <linux/parport.h> -ssize_t parport_write (struct parport *, const void *buf, size_t len); + ssize_t parport_write (struct parport *, const void *buf, size_t len); DESCRIPTION +^^^^^^^^^^^ Write data to device in current IEEE 1284 transfer mode. This only works for modes that support forward data transfer. RETURN VALUE +^^^^^^^^^^^^ If negative, an error code; otherwise the number of bytes transferred. SEE ALSO +^^^^^^^^ parport_read, parport_negotiate + + parport_open - register device for particular device number ------------- +----------------------------------------------------------- SYNOPSIS +^^^^^^^^ -#include <linux/parport.h> +:: -struct pardevice *parport_open (int devnum, const char *name, - int (*pf) (void *), - void (*kf) (void *), - void (*irqf) (int, void *, - struct pt_regs *), - int flags, void *handle); + #include <linux/parport.h> + + struct pardevice *parport_open (int devnum, const char *name, + int (*pf) (void *), + void (*kf) (void *), + void (*irqf) (int, void *, + struct pt_regs *), + int flags, void *handle); DESCRIPTION +^^^^^^^^^^^ This is like parport_register_device but takes a device number instead of a pointer to a struct parport. RETURN VALUE +^^^^^^^^^^^^ See parport_register_device. If no device is associated with devnum, NULL is returned. SEE ALSO +^^^^^^^^ parport_register_device - + + + parport_close - unregister device for particular device number -------------- +-------------------------------------------------------------- SYNOPSIS +^^^^^^^^ + +:: -#include <linux/parport.h> + #include <linux/parport.h> -void parport_close (struct pardevice *dev); + void parport_close (struct pardevice *dev); DESCRIPTION +^^^^^^^^^^^ This is the equivalent of parport_unregister_device for parport_open. SEE ALSO +^^^^^^^^ parport_unregister_device, parport_open - + + + parport_device_id - obtain IEEE 1284 Device ID ------------------ +---------------------------------------------- SYNOPSIS +^^^^^^^^ + +:: -#include <linux/parport.h> + #include <linux/parport.h> -ssize_t parport_device_id (int devnum, char *buffer, size_t len); + ssize_t parport_device_id (int devnum, char *buffer, size_t len); DESCRIPTION +^^^^^^^^^^^ Obtains the IEEE 1284 Device ID associated with a given device. RETURN VALUE +^^^^^^^^^^^^ If negative, an error code; otherwise, the number of bytes of buffer that contain the device ID. The format of the device ID is as -follows: +follows:: -[length][ID] + [length][ID] The first two bytes indicate the inclusive length of the entire Device ID, and are in big-endian order. The ID is a sequence of pairs of the -form: +form:: -key:value; + key:value; NOTES +^^^^^ Many devices have ill-formed IEEE 1284 Device IDs. SEE ALSO +^^^^^^^^ parport_find_class, parport_find_device - + + + parport_device_coords - convert device number to device coordinates ------------------- +------------------------------------------------------------------- SYNOPSIS +^^^^^^^^ + +:: -#include <linux/parport.h> + #include <linux/parport.h> -int parport_device_coords (int devnum, int *parport, int *mux, - int *daisy); + int parport_device_coords (int devnum, int *parport, int *mux, + int *daisy); DESCRIPTION +^^^^^^^^^^^ Convert between device number (zero-based) and device coordinates (port, multiplexor, daisy chain address). RETURN VALUE +^^^^^^^^^^^^ -Zero on success, in which case the coordinates are (*parport, *mux, -*daisy). +Zero on success, in which case the coordinates are (``*parport``, ``*mux``, +``*daisy``). SEE ALSO +^^^^^^^^ parport_open, parport_device_id - + + + parport_find_class - find a device by its class ------------------- +----------------------------------------------- SYNOPSIS - -#include <linux/parport.h> - -typedef enum { - PARPORT_CLASS_LEGACY = 0, /* Non-IEEE1284 device */ - PARPORT_CLASS_PRINTER, - PARPORT_CLASS_MODEM, - PARPORT_CLASS_NET, - PARPORT_CLASS_HDC, /* Hard disk controller */ - PARPORT_CLASS_PCMCIA, - PARPORT_CLASS_MEDIA, /* Multimedia device */ - PARPORT_CLASS_FDC, /* Floppy disk controller */ - PARPORT_CLASS_PORTS, - PARPORT_CLASS_SCANNER, - PARPORT_CLASS_DIGCAM, - PARPORT_CLASS_OTHER, /* Anything else */ - PARPORT_CLASS_UNSPEC, /* No CLS field in ID */ - PARPORT_CLASS_SCSIADAPTER -} parport_device_class; - -int parport_find_class (parport_device_class cls, int from); +^^^^^^^^ + +:: + + #include <linux/parport.h> + + typedef enum { + PARPORT_CLASS_LEGACY = 0, /* Non-IEEE1284 device */ + PARPORT_CLASS_PRINTER, + PARPORT_CLASS_MODEM, + PARPORT_CLASS_NET, + PARPORT_CLASS_HDC, /* Hard disk controller */ + PARPORT_CLASS_PCMCIA, + PARPORT_CLASS_MEDIA, /* Multimedia device */ + PARPORT_CLASS_FDC, /* Floppy disk controller */ + PARPORT_CLASS_PORTS, + PARPORT_CLASS_SCANNER, + PARPORT_CLASS_DIGCAM, + PARPORT_CLASS_OTHER, /* Anything else */ + PARPORT_CLASS_UNSPEC, /* No CLS field in ID */ + PARPORT_CLASS_SCSIADAPTER + } parport_device_class; + + int parport_find_class (parport_device_class cls, int from); DESCRIPTION +^^^^^^^^^^^ Find a device by class. The search starts from device number from+1. RETURN VALUE +^^^^^^^^^^^^ The device number of the next device in that class, or -1 if no such device exists. NOTES +^^^^^ -Example usage: +Example usage:: -int devnum = -1; -while ((devnum = parport_find_class (PARPORT_CLASS_DIGCAM, devnum)) != -1) { - struct pardevice *dev = parport_open (devnum, ...); - ... -} + int devnum = -1; + while ((devnum = parport_find_class (PARPORT_CLASS_DIGCAM, devnum)) != -1) { + struct pardevice *dev = parport_open (devnum, ...); + ... + } SEE ALSO +^^^^^^^^ parport_find_device, parport_open, parport_device_id - + + + parport_find_device - find a device by its class ------------------- +------------------------------------------------ SYNOPSIS +^^^^^^^^ -#include <linux/parport.h> +:: -int parport_find_device (const char *mfg, const char *mdl, int from); + #include <linux/parport.h> + + int parport_find_device (const char *mfg, const char *mdl, int from); DESCRIPTION +^^^^^^^^^^^ Find a device by vendor and model. The search starts from device number from+1. RETURN VALUE +^^^^^^^^^^^^ The device number of the next device matching the specifications, or -1 if no such device exists. NOTES +^^^^^ -Example usage: +Example usage:: -int devnum = -1; -while ((devnum = parport_find_device ("IOMEGA", "ZIP+", devnum)) != -1) { - struct pardevice *dev = parport_open (devnum, ...); - ... -} + int devnum = -1; + while ((devnum = parport_find_device ("IOMEGA", "ZIP+", devnum)) != -1) { + struct pardevice *dev = parport_open (devnum, ...); + ... + } SEE ALSO +^^^^^^^^ parport_find_class, parport_open, parport_device_id + + parport_set_timeout - set the inactivity timeout -------------------- +------------------------------------------------ SYNOPSIS +^^^^^^^^ + +:: -#include <linux/parport.h> + #include <linux/parport.h> -long parport_set_timeout (struct pardevice *dev, long inactivity); + long parport_set_timeout (struct pardevice *dev, long inactivity); DESCRIPTION +^^^^^^^^^^^ Set the inactivity timeout, in jiffies, for a registered device. The previous timeout is returned. RETURN VALUE +^^^^^^^^^^^^ The previous timeout, in jiffies. NOTES +^^^^^ Some of the port->ops functions for a parport may take time, owing to delays at the peripheral. After the peripheral has not responded for -'inactivity' jiffies, a timeout will occur and the blocking function +``inactivity`` jiffies, a timeout will occur and the blocking function will return. A timeout of 0 jiffies is a special case: the function must do as much @@ -932,29 +1135,37 @@ Once set for a registered device, the timeout will remain at the set value until set again. SEE ALSO +^^^^^^^^ port->ops->xxx_read/write_yyy - + + + + PORT FUNCTIONS --------------- +============== The functions in the port->ops structure (struct parport_operations) are provided by the low-level driver responsible for that port. port->ops->read_data - read the data register --------------------- +--------------------------------------------- SYNOPSIS +^^^^^^^^ -#include <linux/parport.h> +:: -struct parport_operations { - ... - unsigned char (*read_data) (struct parport *port); - ... -}; + #include <linux/parport.h> + + struct parport_operations { + ... + unsigned char (*read_data) (struct parport *port); + ... + }; DESCRIPTION +^^^^^^^^^^^ If port->modes contains the PARPORT_MODE_TRISTATE flag and the PARPORT_CONTROL_DIRECTION bit in the control register is set, this @@ -964,45 +1175,59 @@ not set, the return value _may_ be the last value written to the data register. Otherwise the return value is undefined. SEE ALSO +^^^^^^^^ write_data, read_status, write_control + + port->ops->write_data - write the data register ---------------------- +----------------------------------------------- SYNOPSIS +^^^^^^^^ -#include <linux/parport.h> +:: -struct parport_operations { - ... - void (*write_data) (struct parport *port, unsigned char d); - ... -}; + #include <linux/parport.h> + + struct parport_operations { + ... + void (*write_data) (struct parport *port, unsigned char d); + ... + }; DESCRIPTION +^^^^^^^^^^^ Writes to the data register. May have side-effects (a STROBE pulse, for instance). SEE ALSO +^^^^^^^^ read_data, read_status, write_control + + port->ops->read_status - read the status register ----------------------- +------------------------------------------------- SYNOPSIS +^^^^^^^^ -#include <linux/parport.h> +:: -struct parport_operations { - ... - unsigned char (*read_status) (struct parport *port); - ... -}; + #include <linux/parport.h> + + struct parport_operations { + ... + unsigned char (*read_status) (struct parport *port); + ... + }; DESCRIPTION +^^^^^^^^^^^ Reads from the status register. This is a bitmask: @@ -1015,76 +1240,98 @@ Reads from the status register. This is a bitmask: There may be other bits set. SEE ALSO +^^^^^^^^ read_data, write_data, write_control + + port->ops->read_control - read the control register ------------------------ +--------------------------------------------------- SYNOPSIS +^^^^^^^^ -#include <linux/parport.h> +:: -struct parport_operations { - ... - unsigned char (*read_control) (struct parport *port); - ... -}; + #include <linux/parport.h> + + struct parport_operations { + ... + unsigned char (*read_control) (struct parport *port); + ... + }; DESCRIPTION +^^^^^^^^^^^ Returns the last value written to the control register (either from write_control or frob_control). No port access is performed. SEE ALSO +^^^^^^^^ read_data, write_data, read_status, write_control + + port->ops->write_control - write the control register ------------------------- +----------------------------------------------------- SYNOPSIS +^^^^^^^^ -#include <linux/parport.h> +:: -struct parport_operations { - ... - void (*write_control) (struct parport *port, unsigned char s); - ... -}; + #include <linux/parport.h> + + struct parport_operations { + ... + void (*write_control) (struct parport *port, unsigned char s); + ... + }; DESCRIPTION +^^^^^^^^^^^ -Writes to the control register. This is a bitmask: - _______ -- PARPORT_CONTROL_STROBE (nStrobe) - _______ -- PARPORT_CONTROL_AUTOFD (nAutoFd) - _____ -- PARPORT_CONTROL_INIT (nInit) - _________ -- PARPORT_CONTROL_SELECT (nSelectIn) +Writes to the control register. This is a bitmask:: + + _______ + - PARPORT_CONTROL_STROBE (nStrobe) + _______ + - PARPORT_CONTROL_AUTOFD (nAutoFd) + _____ + - PARPORT_CONTROL_INIT (nInit) + _________ + - PARPORT_CONTROL_SELECT (nSelectIn) SEE ALSO +^^^^^^^^ read_data, write_data, read_status, frob_control + + port->ops->frob_control - write control register bits ------------------------ +----------------------------------------------------- SYNOPSIS +^^^^^^^^ -#include <linux/parport.h> +:: -struct parport_operations { - ... - unsigned char (*frob_control) (struct parport *port, - unsigned char mask, - unsigned char val); - ... -}; + #include <linux/parport.h> + + struct parport_operations { + ... + unsigned char (*frob_control) (struct parport *port, + unsigned char mask, + unsigned char val); + ... + }; DESCRIPTION +^^^^^^^^^^^ This is equivalent to reading from the control register, masking out the bits in mask, exclusive-or'ing with the bits in val, and writing @@ -1095,23 +1342,30 @@ of its contents is maintained, so frob_control is in fact only one port access. SEE ALSO +^^^^^^^^ read_data, write_data, read_status, write_control + + port->ops->enable_irq - enable interrupt generation ---------------------- +--------------------------------------------------- SYNOPSIS +^^^^^^^^ -#include <linux/parport.h> +:: -struct parport_operations { - ... - void (*enable_irq) (struct parport *port); - ... -}; + #include <linux/parport.h> + + struct parport_operations { + ... + void (*enable_irq) (struct parport *port); + ... + }; DESCRIPTION +^^^^^^^^^^^ The parallel port hardware is instructed to generate interrupts at appropriate moments, although those moments are @@ -1119,353 +1373,460 @@ architecture-specific. For the PC architecture, interrupts are commonly generated on the rising edge of nAck. SEE ALSO +^^^^^^^^ disable_irq + + port->ops->disable_irq - disable interrupt generation ----------------------- +----------------------------------------------------- SYNOPSIS +^^^^^^^^ -#include <linux/parport.h> +:: -struct parport_operations { - ... - void (*disable_irq) (struct parport *port); - ... -}; + #include <linux/parport.h> + + struct parport_operations { + ... + void (*disable_irq) (struct parport *port); + ... + }; DESCRIPTION +^^^^^^^^^^^ The parallel port hardware is instructed not to generate interrupts. The interrupt itself is not masked. SEE ALSO +^^^^^^^^ enable_irq + + port->ops->data_forward - enable data drivers ------------------------ +--------------------------------------------- SYNOPSIS +^^^^^^^^ -#include <linux/parport.h> +:: -struct parport_operations { - ... - void (*data_forward) (struct parport *port); - ... -}; + #include <linux/parport.h> + + struct parport_operations { + ... + void (*data_forward) (struct parport *port); + ... + }; DESCRIPTION +^^^^^^^^^^^ Enables the data line drivers, for 8-bit host-to-peripheral communications. SEE ALSO +^^^^^^^^ data_reverse + + port->ops->data_reverse - tristate the buffer ------------------------ +--------------------------------------------- SYNOPSIS +^^^^^^^^ -#include <linux/parport.h> +:: -struct parport_operations { - ... - void (*data_reverse) (struct parport *port); - ... -}; + #include <linux/parport.h> + + struct parport_operations { + ... + void (*data_reverse) (struct parport *port); + ... + }; DESCRIPTION +^^^^^^^^^^^ Places the data bus in a high impedance state, if port->modes has the PARPORT_MODE_TRISTATE bit set. SEE ALSO +^^^^^^^^ data_forward - + + + port->ops->epp_write_data - write EPP data -------------------------- +------------------------------------------ SYNOPSIS +^^^^^^^^ -#include <linux/parport.h> +:: -struct parport_operations { - ... - size_t (*epp_write_data) (struct parport *port, const void *buf, - size_t len, int flags); - ... -}; + #include <linux/parport.h> + + struct parport_operations { + ... + size_t (*epp_write_data) (struct parport *port, const void *buf, + size_t len, int flags); + ... + }; DESCRIPTION +^^^^^^^^^^^ Writes data in EPP mode, and returns the number of bytes written. -The 'flags' parameter may be one or more of the following, +The ``flags`` parameter may be one or more of the following, bitwise-or'ed together: +======================= ================================================= PARPORT_EPP_FAST Use fast transfers. Some chips provide 16-bit and 32-bit registers. However, if a transfer times out, the return value may be unreliable. +======================= ================================================= SEE ALSO +^^^^^^^^ epp_read_data, epp_write_addr, epp_read_addr + + port->ops->epp_read_data - read EPP data ------------------------- +---------------------------------------- SYNOPSIS +^^^^^^^^ -#include <linux/parport.h> +:: -struct parport_operations { - ... - size_t (*epp_read_data) (struct parport *port, void *buf, - size_t len, int flags); - ... -}; + #include <linux/parport.h> + + struct parport_operations { + ... + size_t (*epp_read_data) (struct parport *port, void *buf, + size_t len, int flags); + ... + }; DESCRIPTION +^^^^^^^^^^^ Reads data in EPP mode, and returns the number of bytes read. -The 'flags' parameter may be one or more of the following, +The ``flags`` parameter may be one or more of the following, bitwise-or'ed together: +======================= ================================================= PARPORT_EPP_FAST Use fast transfers. Some chips provide 16-bit and 32-bit registers. However, if a transfer times out, the return value may be unreliable. +======================= ================================================= SEE ALSO +^^^^^^^^ epp_write_data, epp_write_addr, epp_read_addr - + + + port->ops->epp_write_addr - write EPP address -------------------------- +--------------------------------------------- SYNOPSIS +^^^^^^^^ -#include <linux/parport.h> +:: -struct parport_operations { - ... - size_t (*epp_write_addr) (struct parport *port, - const void *buf, size_t len, int flags); - ... -}; + #include <linux/parport.h> + + struct parport_operations { + ... + size_t (*epp_write_addr) (struct parport *port, + const void *buf, size_t len, int flags); + ... + }; DESCRIPTION +^^^^^^^^^^^ Writes EPP addresses (8 bits each), and returns the number written. -The 'flags' parameter may be one or more of the following, +The ``flags`` parameter may be one or more of the following, bitwise-or'ed together: +======================= ================================================= PARPORT_EPP_FAST Use fast transfers. Some chips provide 16-bit and 32-bit registers. However, if a transfer times out, the return value may be unreliable. +======================= ================================================= (Does PARPORT_EPP_FAST make sense for this function?) SEE ALSO +^^^^^^^^ epp_write_data, epp_read_data, epp_read_addr + + port->ops->epp_read_addr - read EPP address ------------------------- +------------------------------------------- SYNOPSIS +^^^^^^^^ -#include <linux/parport.h> +:: -struct parport_operations { - ... - size_t (*epp_read_addr) (struct parport *port, void *buf, - size_t len, int flags); - ... -}; + #include <linux/parport.h> + + struct parport_operations { + ... + size_t (*epp_read_addr) (struct parport *port, void *buf, + size_t len, int flags); + ... + }; DESCRIPTION +^^^^^^^^^^^ Reads EPP addresses (8 bits each), and returns the number read. -The 'flags' parameter may be one or more of the following, +The ``flags`` parameter may be one or more of the following, bitwise-or'ed together: +======================= ================================================= PARPORT_EPP_FAST Use fast transfers. Some chips provide 16-bit and 32-bit registers. However, if a transfer times out, the return value may be unreliable. +======================= ================================================= (Does PARPORT_EPP_FAST make sense for this function?) SEE ALSO +^^^^^^^^ epp_write_data, epp_read_data, epp_write_addr + + port->ops->ecp_write_data - write a block of ECP data -------------------------- +----------------------------------------------------- SYNOPSIS +^^^^^^^^ -#include <linux/parport.h> +:: -struct parport_operations { - ... - size_t (*ecp_write_data) (struct parport *port, - const void *buf, size_t len, int flags); - ... -}; + #include <linux/parport.h> + + struct parport_operations { + ... + size_t (*ecp_write_data) (struct parport *port, + const void *buf, size_t len, int flags); + ... + }; DESCRIPTION +^^^^^^^^^^^ -Writes a block of ECP data. The 'flags' parameter is ignored. +Writes a block of ECP data. The ``flags`` parameter is ignored. RETURN VALUE +^^^^^^^^^^^^ The number of bytes written. SEE ALSO +^^^^^^^^ ecp_read_data, ecp_write_addr + + port->ops->ecp_read_data - read a block of ECP data ------------------------- +--------------------------------------------------- SYNOPSIS +^^^^^^^^ -#include <linux/parport.h> +:: -struct parport_operations { - ... - size_t (*ecp_read_data) (struct parport *port, - void *buf, size_t len, int flags); - ... -}; + #include <linux/parport.h> + + struct parport_operations { + ... + size_t (*ecp_read_data) (struct parport *port, + void *buf, size_t len, int flags); + ... + }; DESCRIPTION +^^^^^^^^^^^ -Reads a block of ECP data. The 'flags' parameter is ignored. +Reads a block of ECP data. The ``flags`` parameter is ignored. RETURN VALUE +^^^^^^^^^^^^ The number of bytes read. NB. There may be more unread data in a FIFO. Is there a way of stunning the FIFO to prevent this? SEE ALSO +^^^^^^^^ ecp_write_block, ecp_write_addr - + + + port->ops->ecp_write_addr - write a block of ECP addresses -------------------------- +---------------------------------------------------------- SYNOPSIS +^^^^^^^^ -#include <linux/parport.h> +:: -struct parport_operations { - ... - size_t (*ecp_write_addr) (struct parport *port, - const void *buf, size_t len, int flags); - ... -}; + #include <linux/parport.h> + + struct parport_operations { + ... + size_t (*ecp_write_addr) (struct parport *port, + const void *buf, size_t len, int flags); + ... + }; DESCRIPTION +^^^^^^^^^^^ -Writes a block of ECP addresses. The 'flags' parameter is ignored. +Writes a block of ECP addresses. The ``flags`` parameter is ignored. RETURN VALUE +^^^^^^^^^^^^ The number of bytes written. NOTES +^^^^^ This may use a FIFO, and if so shall not return until the FIFO is empty. SEE ALSO +^^^^^^^^ ecp_read_data, ecp_write_data - + + + port->ops->nibble_read_data - read a block of data in nibble mode ---------------------------- +----------------------------------------------------------------- SYNOPSIS +^^^^^^^^ -#include <linux/parport.h> +:: -struct parport_operations { - ... - size_t (*nibble_read_data) (struct parport *port, - void *buf, size_t len, int flags); - ... -}; + #include <linux/parport.h> + + struct parport_operations { + ... + size_t (*nibble_read_data) (struct parport *port, + void *buf, size_t len, int flags); + ... + }; DESCRIPTION +^^^^^^^^^^^ -Reads a block of data in nibble mode. The 'flags' parameter is ignored. +Reads a block of data in nibble mode. The ``flags`` parameter is ignored. RETURN VALUE +^^^^^^^^^^^^ The number of whole bytes read. SEE ALSO +^^^^^^^^ byte_read_data, compat_write_data + + port->ops->byte_read_data - read a block of data in byte mode -------------------------- +------------------------------------------------------------- SYNOPSIS +^^^^^^^^ -#include <linux/parport.h> +:: -struct parport_operations { - ... - size_t (*byte_read_data) (struct parport *port, - void *buf, size_t len, int flags); - ... -}; + #include <linux/parport.h> + + struct parport_operations { + ... + size_t (*byte_read_data) (struct parport *port, + void *buf, size_t len, int flags); + ... + }; DESCRIPTION +^^^^^^^^^^^ -Reads a block of data in byte mode. The 'flags' parameter is ignored. +Reads a block of data in byte mode. The ``flags`` parameter is ignored. RETURN VALUE +^^^^^^^^^^^^ The number of bytes read. SEE ALSO +^^^^^^^^ nibble_read_data, compat_write_data + + port->ops->compat_write_data - write a block of data in compatibility mode ----------------------------- +-------------------------------------------------------------------------- SYNOPSIS +^^^^^^^^ -#include <linux/parport.h> +:: -struct parport_operations { - ... - size_t (*compat_write_data) (struct parport *port, - const void *buf, size_t len, int flags); - ... -}; + #include <linux/parport.h> + + struct parport_operations { + ... + size_t (*compat_write_data) (struct parport *port, + const void *buf, size_t len, int flags); + ... + }; DESCRIPTION +^^^^^^^^^^^ -Writes a block of data in compatibility mode. The 'flags' parameter +Writes a block of data in compatibility mode. The ``flags`` parameter is ignored. RETURN VALUE +^^^^^^^^^^^^ The number of bytes written. SEE ALSO +^^^^^^^^ nibble_read_data, byte_read_data diff --git a/Documentation/percpu-rw-semaphore.txt b/Documentation/percpu-rw-semaphore.txt index 7d3c82431909..247de6410855 100644 --- a/Documentation/percpu-rw-semaphore.txt +++ b/Documentation/percpu-rw-semaphore.txt @@ -1,5 +1,6 @@ +==================== Percpu rw semaphores --------------------- +==================== Percpu rw semaphores is a new read-write semaphore design that is optimized for locking for reading. diff --git a/Documentation/phy.txt b/Documentation/phy.txt index 383cdd863f08..457c3e0f86d6 100644 --- a/Documentation/phy.txt +++ b/Documentation/phy.txt @@ -1,10 +1,14 @@ - PHY SUBSYSTEM - Kishon Vijay Abraham I <kishon@ti.com> +============= +PHY subsystem +============= + +:Author: Kishon Vijay Abraham I <kishon@ti.com> This document explains the Generic PHY Framework along with the APIs provided, and how-to-use. -1. Introduction +Introduction +============ *PHY* is the abbreviation for physical layer. It is used to connect a device to the physical medium e.g., the USB controller has a PHY to provide functions @@ -21,7 +25,8 @@ better code maintainability. This framework will be of use only to devices that use external PHY (PHY functionality is not embedded within the controller). -2. Registering/Unregistering the PHY provider +Registering/Unregistering the PHY provider +========================================== PHY provider refers to an entity that implements one or more PHY instances. For the simple case where the PHY provider implements only a single instance of @@ -30,11 +35,14 @@ of_phy_simple_xlate. If the PHY provider implements multiple instances, it should provide its own implementation of of_xlate. of_xlate is used only for dt boot case. -#define of_phy_provider_register(dev, xlate) \ - __of_phy_provider_register((dev), NULL, THIS_MODULE, (xlate)) +:: + + #define of_phy_provider_register(dev, xlate) \ + __of_phy_provider_register((dev), NULL, THIS_MODULE, (xlate)) -#define devm_of_phy_provider_register(dev, xlate) \ - __devm_of_phy_provider_register((dev), NULL, THIS_MODULE, (xlate)) + #define devm_of_phy_provider_register(dev, xlate) \ + __devm_of_phy_provider_register((dev), NULL, THIS_MODULE, + (xlate)) of_phy_provider_register and devm_of_phy_provider_register macros can be used to register the phy_provider and it takes device and of_xlate as @@ -47,28 +55,35 @@ nodes within extra levels for context and extensibility, in which case the low level of_phy_provider_register_full() and devm_of_phy_provider_register_full() macros can be used to override the node containing the children. -#define of_phy_provider_register_full(dev, children, xlate) \ - __of_phy_provider_register(dev, children, THIS_MODULE, xlate) +:: + + #define of_phy_provider_register_full(dev, children, xlate) \ + __of_phy_provider_register(dev, children, THIS_MODULE, xlate) -#define devm_of_phy_provider_register_full(dev, children, xlate) \ - __devm_of_phy_provider_register_full(dev, children, THIS_MODULE, xlate) + #define devm_of_phy_provider_register_full(dev, children, xlate) \ + __devm_of_phy_provider_register_full(dev, children, + THIS_MODULE, xlate) -void devm_of_phy_provider_unregister(struct device *dev, - struct phy_provider *phy_provider); -void of_phy_provider_unregister(struct phy_provider *phy_provider); + void devm_of_phy_provider_unregister(struct device *dev, + struct phy_provider *phy_provider); + void of_phy_provider_unregister(struct phy_provider *phy_provider); devm_of_phy_provider_unregister and of_phy_provider_unregister can be used to unregister the PHY. -3. Creating the PHY +Creating the PHY +================ The PHY driver should create the PHY in order for other peripheral controllers to make use of it. The PHY framework provides 2 APIs to create the PHY. -struct phy *phy_create(struct device *dev, struct device_node *node, - const struct phy_ops *ops); -struct phy *devm_phy_create(struct device *dev, struct device_node *node, - const struct phy_ops *ops); +:: + + struct phy *phy_create(struct device *dev, struct device_node *node, + const struct phy_ops *ops); + struct phy *devm_phy_create(struct device *dev, + struct device_node *node, + const struct phy_ops *ops); The PHY drivers can use one of the above 2 APIs to create the PHY by passing the device pointer and phy ops. @@ -84,12 +99,16 @@ phy_ops to get back the private data. Before the controller can make use of the PHY, it has to get a reference to it. This framework provides the following APIs to get a reference to the PHY. -struct phy *phy_get(struct device *dev, const char *string); -struct phy *phy_optional_get(struct device *dev, const char *string); -struct phy *devm_phy_get(struct device *dev, const char *string); -struct phy *devm_phy_optional_get(struct device *dev, const char *string); -struct phy *devm_of_phy_get_by_index(struct device *dev, struct device_node *np, - int index); +:: + + struct phy *phy_get(struct device *dev, const char *string); + struct phy *phy_optional_get(struct device *dev, const char *string); + struct phy *devm_phy_get(struct device *dev, const char *string); + struct phy *devm_phy_optional_get(struct device *dev, + const char *string); + struct phy *devm_of_phy_get_by_index(struct device *dev, + struct device_node *np, + int index); phy_get, phy_optional_get, devm_phy_get and devm_phy_optional_get can be used to get the PHY. In the case of dt boot, the string arguments @@ -111,30 +130,35 @@ the phy_init() and phy_exit() calls, and phy_power_on() and phy_power_off() calls are all NOP when applied to a NULL phy. The NULL phy is useful in devices for handling optional phy devices. -5. Releasing a reference to the PHY +Releasing a reference to the PHY +================================ When the controller no longer needs the PHY, it has to release the reference to the PHY it has obtained using the APIs mentioned in the above section. The PHY framework provides 2 APIs to release a reference to the PHY. -void phy_put(struct phy *phy); -void devm_phy_put(struct device *dev, struct phy *phy); +:: + + void phy_put(struct phy *phy); + void devm_phy_put(struct device *dev, struct phy *phy); Both these APIs are used to release a reference to the PHY and devm_phy_put destroys the devres associated with this PHY. -6. Destroying the PHY +Destroying the PHY +================== When the driver that created the PHY is unloaded, it should destroy the PHY it -created using one of the following 2 APIs. +created using one of the following 2 APIs:: -void phy_destroy(struct phy *phy); -void devm_phy_destroy(struct device *dev, struct phy *phy); + void phy_destroy(struct phy *phy); + void devm_phy_destroy(struct device *dev, struct phy *phy); Both these APIs destroy the PHY and devm_phy_destroy destroys the devres associated with this PHY. -7. PM Runtime +PM Runtime +========== This subsystem is pm runtime enabled. So while creating the PHY, pm_runtime_enable of the phy device created by this subsystem is called and @@ -150,7 +174,8 @@ There are exported APIs like phy_pm_runtime_get, phy_pm_runtime_get_sync, phy_pm_runtime_put, phy_pm_runtime_put_sync, phy_pm_runtime_allow and phy_pm_runtime_forbid for performing PM operations. -8. PHY Mappings +PHY Mappings +============ In order to get reference to a PHY without help from DeviceTree, the framework offers lookups which can be compared to clkdev that allow clk structures to be @@ -158,12 +183,15 @@ bound to devices. A lookup can be made be made during runtime when a handle to the struct phy already exists. The framework offers the following API for registering and unregistering the -lookups. +lookups:: -int phy_create_lookup(struct phy *phy, const char *con_id, const char *dev_id); -void phy_remove_lookup(struct phy *phy, const char *con_id, const char *dev_id); + int phy_create_lookup(struct phy *phy, const char *con_id, + const char *dev_id); + void phy_remove_lookup(struct phy *phy, const char *con_id, + const char *dev_id); -9. DeviceTree Binding +DeviceTree Binding +================== The documentation for PHY dt binding can be found @ Documentation/devicetree/bindings/phy/phy-bindings.txt diff --git a/Documentation/pi-futex.txt b/Documentation/pi-futex.txt index 9a5bc8651c29..aafddbee7377 100644 --- a/Documentation/pi-futex.txt +++ b/Documentation/pi-futex.txt @@ -1,5 +1,6 @@ +====================== Lightweight PI-futexes ----------------------- +====================== We are calling them lightweight for 3 reasons: @@ -25,8 +26,8 @@ determinism and well-bound latencies. Even in the worst-case, PI will improve the statistical distribution of locking related application delays. -The longer reply: ------------------ +The longer reply +---------------- Firstly, sharing locks between multiple tasks is a common programming technique that often cannot be replaced with lockless algorithms. As we @@ -71,8 +72,8 @@ deterministic execution of the high-prio task: any medium-priority task could preempt the low-prio task while it holds the shared lock and executes the critical section, and could delay it indefinitely. -Implementation: ---------------- +Implementation +-------------- As mentioned before, the userspace fastpath of PI-enabled pthread mutexes involves no kernel work at all - they behave quite similarly to @@ -83,8 +84,8 @@ entering the kernel. To handle the slowpath, we have added two new futex ops: - FUTEX_LOCK_PI - FUTEX_UNLOCK_PI + - FUTEX_LOCK_PI + - FUTEX_UNLOCK_PI If the lock-acquire fastpath fails, [i.e. an atomic transition from 0 to TID fails], then FUTEX_LOCK_PI is called. The kernel does all the diff --git a/Documentation/pnp.txt b/Documentation/pnp.txt index 763e4659bf18..bab2d10631f0 100644 --- a/Documentation/pnp.txt +++ b/Documentation/pnp.txt @@ -1,98 +1,118 @@ +================================= Linux Plug and Play Documentation -by Adam Belay <ambx1@neo.rr.com> -last updated: Oct. 16, 2002 ---------------------------------------------------------------------------------------- +================================= +:Author: Adam Belay <ambx1@neo.rr.com> +:Last updated: Oct. 16, 2002 Overview -------- - Plug and Play provides a means of detecting and setting resources for legacy or + +Plug and Play provides a means of detecting and setting resources for legacy or otherwise unconfigurable devices. The Linux Plug and Play Layer provides these services to compatible drivers. - The User Interface ------------------ - The Linux Plug and Play user interface provides a means to activate PnP devices + +The Linux Plug and Play user interface provides a means to activate PnP devices for legacy and user level drivers that do not support Linux Plug and Play. The user interface is integrated into sysfs. In addition to the standard sysfs file the following are created in each device's directory: -id - displays a list of support EISA IDs -options - displays possible resource configurations -resources - displays currently allocated resources and allows resource changes +- id - displays a list of support EISA IDs +- options - displays possible resource configurations +- resources - displays currently allocated resources and allows resource changes --activating a device +activating a device +^^^^^^^^^^^^^^^^^^^ -#echo "auto" > resources +:: + + # echo "auto" > resources this will invoke the automatic resource config system to activate the device --manually activating a device +manually activating a device +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +:: + + # echo "manual <depnum> <mode>" > resources -#echo "manual <depnum> <mode>" > resources -<depnum> - the configuration number -<mode> - static or dynamic - static = for next boot - dynamic = now + <depnum> - the configuration number + <mode> - static or dynamic + static = for next boot + dynamic = now --disabling a device +disabling a device +^^^^^^^^^^^^^^^^^^ -#echo "disable" > resources +:: + + # echo "disable" > resources EXAMPLE: Suppose you need to activate the floppy disk controller. -1.) change to the proper directory, in my case it is -/driver/bus/pnp/devices/00:0f -# cd /driver/bus/pnp/devices/00:0f -# cat name -PC standard floppy disk controller - -2.) check if the device is already active -# cat resources -DISABLED - -- Notice the string "DISABLED". This means the device is not active. - -3.) check the device's possible configurations (optional) -# cat options -Dependent: 01 - Priority acceptable - port 0x3f0-0x3f0, align 0x7, size 0x6, 16-bit address decoding - port 0x3f7-0x3f7, align 0x0, size 0x1, 16-bit address decoding - irq 6 - dma 2 8-bit compatible -Dependent: 02 - Priority acceptable - port 0x370-0x370, align 0x7, size 0x6, 16-bit address decoding - port 0x377-0x377, align 0x0, size 0x1, 16-bit address decoding - irq 6 - dma 2 8-bit compatible - -4.) now activate the device -# echo "auto" > resources - -5.) finally check if the device is active -# cat resources -io 0x3f0-0x3f5 -io 0x3f7-0x3f7 -irq 6 -dma 2 - -also there are a series of kernel parameters: -pnp_reserve_irq=irq1[,irq2] .... -pnp_reserve_dma=dma1[,dma2] .... -pnp_reserve_io=io1,size1[,io2,size2] .... -pnp_reserve_mem=mem1,size1[,mem2,size2] .... + +1. change to the proper directory, in my case it is + /driver/bus/pnp/devices/00:0f:: + + # cd /driver/bus/pnp/devices/00:0f + # cat name + PC standard floppy disk controller + +2. check if the device is already active:: + + # cat resources + DISABLED + + - Notice the string "DISABLED". This means the device is not active. + +3. check the device's possible configurations (optional):: + + # cat options + Dependent: 01 - Priority acceptable + port 0x3f0-0x3f0, align 0x7, size 0x6, 16-bit address decoding + port 0x3f7-0x3f7, align 0x0, size 0x1, 16-bit address decoding + irq 6 + dma 2 8-bit compatible + Dependent: 02 - Priority acceptable + port 0x370-0x370, align 0x7, size 0x6, 16-bit address decoding + port 0x377-0x377, align 0x0, size 0x1, 16-bit address decoding + irq 6 + dma 2 8-bit compatible + +4. now activate the device:: + + # echo "auto" > resources + +5. finally check if the device is active:: + + # cat resources + io 0x3f0-0x3f5 + io 0x3f7-0x3f7 + irq 6 + dma 2 + +also there are a series of kernel parameters:: + + pnp_reserve_irq=irq1[,irq2] .... + pnp_reserve_dma=dma1[,dma2] .... + pnp_reserve_io=io1,size1[,io2,size2] .... + pnp_reserve_mem=mem1,size1[,mem2,size2] .... The Unified Plug and Play Layer ------------------------------- - All Plug and Play drivers, protocols, and services meet at a central location + +All Plug and Play drivers, protocols, and services meet at a central location called the Plug and Play Layer. This layer is responsible for the exchange of information between PnP drivers and PnP protocols. Thus it automatically forwards commands to the proper protocol. This makes writing PnP drivers @@ -101,64 +121,73 @@ significantly easier. The following functions are available from the Plug and Play Layer: pnp_get_protocol -- increments the number of uses by one + increments the number of uses by one pnp_put_protocol -- deincrements the number of uses by one + deincrements the number of uses by one pnp_register_protocol -- use this to register a new PnP protocol + use this to register a new PnP protocol pnp_unregister_protocol -- use this function to remove a PnP protocol from the Plug and Play Layer + use this function to remove a PnP protocol from the Plug and Play Layer pnp_register_driver -- adds a PnP driver to the Plug and Play Layer -- this includes driver model integration -- returns zero for success or a negative error number for failure; count + adds a PnP driver to the Plug and Play Layer + + this includes driver model integration + returns zero for success or a negative error number for failure; count calls to the .add() method if you need to know how many devices bind to the driver pnp_unregister_driver -- removes a PnP driver from the Plug and Play Layer + removes a PnP driver from the Plug and Play Layer Plug and Play Protocols ----------------------- - This section contains information for PnP protocol developers. + +This section contains information for PnP protocol developers. The following Protocols are currently available in the computing world: -- PNPBIOS: used for system devices such as serial and parallel ports. -- ISAPNP: provides PnP support for the ISA bus -- ACPI: among its many uses, ACPI provides information about system level -devices. + +- PNPBIOS: + used for system devices such as serial and parallel ports. +- ISAPNP: + provides PnP support for the ISA bus +- ACPI: + among its many uses, ACPI provides information about system level + devices. + It is meant to replace the PNPBIOS. It is not currently supported by Linux Plug and Play but it is planned to be in the near future. Requirements for a Linux PnP protocol: -1.) the protocol must use EISA IDs -2.) the protocol must inform the PnP Layer of a device's current configuration +1. the protocol must use EISA IDs +2. the protocol must inform the PnP Layer of a device's current configuration + - the ability to set resources is optional but preferred. The following are PnP protocol related functions: pnp_add_device -- use this function to add a PnP device to the PnP layer -- only call this function when all wanted values are set in the pnp_dev -structure + use this function to add a PnP device to the PnP layer + + only call this function when all wanted values are set in the pnp_dev + structure pnp_init_device -- call this to initialize the PnP structure + call this to initialize the PnP structure pnp_remove_device -- call this to remove a device from the Plug and Play Layer. -- it will fail if the device is still in use. -- automatically will free mem used by the device and related structures + call this to remove a device from the Plug and Play Layer. + it will fail if the device is still in use. + automatically will free mem used by the device and related structures pnp_add_id -- adds an EISA ID to the list of supported IDs for the specified device + adds an EISA ID to the list of supported IDs for the specified device For more information consult the source of a protocol such as /drivers/pnp/pnpbios/core.c. @@ -167,85 +196,97 @@ For more information consult the source of a protocol such as Linux Plug and Play Drivers --------------------------- - This section contains information for Linux PnP driver developers. + +This section contains information for Linux PnP driver developers. The New Way -........... -1.) first make a list of supported EISA IDS -ex: -static const struct pnp_id pnp_dev_table[] = { - /* Standard LPT Printer Port */ - {.id = "PNP0400", .driver_data = 0}, - /* ECP Printer Port */ - {.id = "PNP0401", .driver_data = 0}, - {.id = ""} -}; - -Please note that the character 'X' can be used as a wild card in the function -portion (last four characters). -ex: +^^^^^^^^^^^ + +1. first make a list of supported EISA IDS + + ex:: + + static const struct pnp_id pnp_dev_table[] = { + /* Standard LPT Printer Port */ + {.id = "PNP0400", .driver_data = 0}, + /* ECP Printer Port */ + {.id = "PNP0401", .driver_data = 0}, + {.id = ""} + }; + + Please note that the character 'X' can be used as a wild card in the function + portion (last four characters). + + ex:: + /* Unknown PnP modems */ { "PNPCXXX", UNKNOWN_DEV }, -Supported PnP card IDs can optionally be defined. -ex: -static const struct pnp_id pnp_card_table[] = { - { "ANYDEVS", 0 }, - { "", 0 } -}; - -2.) Optionally define probe and remove functions. It may make sense not to -define these functions if the driver already has a reliable method of detecting -the resources, such as the parport_pc driver. -ex: -static int -serial_pnp_probe(struct pnp_dev * dev, const struct pnp_id *card_id, const - struct pnp_id *dev_id) -{ -. . . - -ex: -static void serial_pnp_remove(struct pnp_dev * dev) -{ -. . . - -consult /drivers/serial/8250_pnp.c for more information. - -3.) create a driver structure -ex: - -static struct pnp_driver serial_pnp_driver = { - .name = "serial", - .card_id_table = pnp_card_table, - .id_table = pnp_dev_table, - .probe = serial_pnp_probe, - .remove = serial_pnp_remove, -}; - -* name and id_table cannot be NULL. - -4.) register the driver -ex: - -static int __init serial8250_pnp_init(void) -{ - return pnp_register_driver(&serial_pnp_driver); -} + Supported PnP card IDs can optionally be defined. + ex:: + + static const struct pnp_id pnp_card_table[] = { + { "ANYDEVS", 0 }, + { "", 0 } + }; + +2. Optionally define probe and remove functions. It may make sense not to + define these functions if the driver already has a reliable method of detecting + the resources, such as the parport_pc driver. + + ex:: + + static int + serial_pnp_probe(struct pnp_dev * dev, const struct pnp_id *card_id, const + struct pnp_id *dev_id) + { + . . . + + ex:: + + static void serial_pnp_remove(struct pnp_dev * dev) + { + . . . + + consult /drivers/serial/8250_pnp.c for more information. + +3. create a driver structure + + ex:: + + static struct pnp_driver serial_pnp_driver = { + .name = "serial", + .card_id_table = pnp_card_table, + .id_table = pnp_dev_table, + .probe = serial_pnp_probe, + .remove = serial_pnp_remove, + }; + + * name and id_table cannot be NULL. + +4. register the driver + + ex:: + + static int __init serial8250_pnp_init(void) + { + return pnp_register_driver(&serial_pnp_driver); + } The Old Way -........... +^^^^^^^^^^^ A series of compatibility functions have been created to make it easy to convert ISAPNP drivers. They should serve as a temporary solution only. -They are as follows: +They are as follows:: -struct pnp_card *pnp_find_card(unsigned short vendor, - unsigned short device, - struct pnp_card *from) + struct pnp_card *pnp_find_card(unsigned short vendor, + unsigned short device, + struct pnp_card *from) -struct pnp_dev *pnp_find_dev(struct pnp_card *card, - unsigned short vendor, - unsigned short function, - struct pnp_dev *from) + struct pnp_dev *pnp_find_dev(struct pnp_card *card, + unsigned short vendor, + unsigned short function, + struct pnp_dev *from) diff --git a/Documentation/power/power_supply_class.txt b/Documentation/power/power_supply_class.txt index 0c72588bd967..300d37896e51 100644 --- a/Documentation/power/power_supply_class.txt +++ b/Documentation/power/power_supply_class.txt @@ -115,28 +115,33 @@ of charge when battery became full/empty". It also could mean "value of charge when battery considered full/empty at given conditions (temperature, age)". I.e. these attributes represents real thresholds, not design values. +ENERGY_FULL, ENERGY_EMPTY - same as above but for energy. + CHARGE_COUNTER - the current charge counter (in ยตAh). This could easily be negative; there is no empty or full value. It is only useful for relative, time-based measurements. +PRECHARGE_CURRENT - the maximum charge current during precharge phase +of charge cycle (typically 20% of battery capacity). +CHARGE_TERM_CURRENT - Charge termination current. The charge cycle +terminates when battery voltage is above recharge threshold, and charge +current is below this setting (typically 10% of battery capacity). + CONSTANT_CHARGE_CURRENT - constant charge current programmed by charger. CONSTANT_CHARGE_CURRENT_MAX - maximum charge current supported by the power supply object. -INPUT_CURRENT_LIMIT - input current limit programmed by charger. Indicates -the current drawn from a charging source. -CHARGE_TERM_CURRENT - Charge termination current used to detect the end of charge -condition. - -CALIBRATE - battery or coulomb counter calibration status CONSTANT_CHARGE_VOLTAGE - constant charge voltage programmed by charger. CONSTANT_CHARGE_VOLTAGE_MAX - maximum charge voltage supported by the power supply object. +INPUT_CURRENT_LIMIT - input current limit programmed by charger. Indicates +the current drawn from a charging source. + CHARGE_CONTROL_LIMIT - current charge control limit setting CHARGE_CONTROL_LIMIT_MAX - maximum charge control limit setting -ENERGY_FULL, ENERGY_EMPTY - same as above but for energy. +CALIBRATE - battery or coulomb counter calibration status CAPACITY - capacity in percents. CAPACITY_ALERT_MIN - minimum capacity alert value in percents. @@ -174,6 +179,18 @@ issued by external power supply will notify supplicants via external_power_changed callback. +Devicetree battery characteristics +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Drivers should call power_supply_get_battery_info() to obtain battery +characteristics from a devicetree battery node, defined in +Documentation/devicetree/bindings/power/supply/battery.txt. This is +implemented in drivers/power/supply/bq27xxx_battery.c. + +Properties in struct power_supply_battery_info and their counterparts in the +battery node have names corresponding to elements in enum power_supply_property, +for naming consistency between sysfs attributes and battery node properties. + + QA ~~ Q: Where is POWER_SUPPLY_PROP_XYZ attribute? diff --git a/Documentation/power/runtime_pm.txt b/Documentation/power/runtime_pm.txt index ee69d7532172..0fde3dcf077a 100644 --- a/Documentation/power/runtime_pm.txt +++ b/Documentation/power/runtime_pm.txt @@ -105,9 +105,9 @@ knows what to do to handle the device). In particular, if the driver requires remote wakeup capability (i.e. hardware mechanism allowing the device to request a change of its power state, such as -PCI PME) for proper functioning and device_run_wake() returns 'false' for the +PCI PME) for proper functioning and device_can_wakeup() returns 'false' for the device, then ->runtime_suspend() should return -EBUSY. On the other hand, if -device_run_wake() returns 'true' for the device and the device is put into a +device_can_wakeup() returns 'true' for the device and the device is put into a low-power state during the execution of the suspend callback, it is expected that remote wakeup will be enabled for the device. Generally, remote wakeup should be enabled for all input devices put into low-power states at run time. @@ -253,9 +253,6 @@ defined in include/linux/pm.h: being executed for that device and it is not practical to wait for the suspend to complete; means "start a resume as soon as you've suspended" - unsigned int run_wake; - - set if the device is capable of generating runtime wake-up events - enum rpm_status runtime_status; - the runtime PM status of the device; this field's initial value is RPM_SUSPENDED, which means that each device is initially regarded by the diff --git a/Documentation/powerpc/cxlflash.txt b/Documentation/powerpc/cxlflash.txt index 66b4496d6619..a64bdaa0a1cf 100644 --- a/Documentation/powerpc/cxlflash.txt +++ b/Documentation/powerpc/cxlflash.txt @@ -124,8 +124,8 @@ Block library API http://github.com/open-power/capiflash -CXL Flash Driver IOCTLs -======================= +CXL Flash Driver LUN IOCTLs +=========================== Users, such as the block library, that wish to interface with a flash device (LUN) via user space access need to use the services provided @@ -257,6 +257,12 @@ DK_CXLFLASH_VLUN_RESIZE operating in the virtual mode and used to program a LUN translation table that the AFU references when provided with a resource handle. + This ioctl can return -EAGAIN if an AFU sync operation takes too long. + In addition to returning a failure to user, cxlflash will also schedule + an asynchronous AFU reset. Should the user choose to retry the operation, + it is expected to succeed. If this ioctl fails with -EAGAIN, the user + can either retry the operation or treat it as a failure. + DK_CXLFLASH_RELEASE ------------------- This ioctl is responsible for releasing a previously obtained @@ -309,6 +315,12 @@ DK_CXLFLASH_VLUN_CLONE clone. This is to avoid a stale entry in the file descriptor table of the child process. + This ioctl can return -EAGAIN if an AFU sync operation takes too long. + In addition to returning a failure to user, cxlflash will also schedule + an asynchronous AFU reset. Should the user choose to retry the operation, + it is expected to succeed. If this ioctl fails with -EAGAIN, the user + can either retry the operation or treat it as a failure. + DK_CXLFLASH_VERIFY ------------------ This ioctl is used to detect various changes such as the capacity of @@ -355,3 +367,63 @@ DK_CXLFLASH_MANAGE_LUN exclusive user space access (superpipe). In case a LUN is visible across multiple ports and adapters, this ioctl is used to uniquely identify each LUN by its World Wide Node Name (WWNN). + + +CXL Flash Driver Host IOCTLs +============================ + + Each host adapter instance that is supported by the cxlflash driver + has a special character device associated with it to enable a set of + host management function. These character devices are hosted in a + class dedicated for cxlflash and can be accessed via /dev/cxlflash/*. + + Applications can be written to perform various functions using the + host ioctl APIs below. + + The structure definitions for these IOCTLs are available in: + uapi/scsi/cxlflash_ioctl.h + +HT_CXLFLASH_LUN_PROVISION +------------------------- + This ioctl is used to create and delete persistent LUNs on cxlflash + devices that lack an external LUN management interface. It is only + valid when used with AFUs that support the LUN provision capability. + + When sufficient space is available, LUNs can be created by specifying + the target port to host the LUN and a desired size in 4K blocks. Upon + success, the LUN ID and WWID of the created LUN will be returned and + the SCSI bus can be scanned to detect the change in LUN topology. Note + that partial allocations are not supported. Should a creation fail due + to a space issue, the target port can be queried for its current LUN + geometry. + + To remove a LUN, the device must first be disassociated from the Linux + SCSI subsystem. The LUN deletion can then be initiated by specifying a + target port and LUN ID. Upon success, the LUN geometry associated with + the port will be updated to reflect new number of provisioned LUNs and + available capacity. + + To query the LUN geometry of a port, the target port is specified and + upon success, the following information is presented: + + - Maximum number of provisioned LUNs allowed for the port + - Current number of provisioned LUNs for the port + - Maximum total capacity of provisioned LUNs for the port (4K blocks) + - Current total capacity of provisioned LUNs for the port (4K blocks) + + With this information, the number of available LUNs and capacity can be + can be calculated. + +HT_CXLFLASH_AFU_DEBUG +--------------------- + This ioctl is used to debug AFUs by supporting a command pass-through + interface. It is only valid when used with AFUs that support the AFU + debug capability. + + With exception of buffer management, AFU debug commands are opaque to + cxlflash and treated as pass-through. For debug commands that do require + data transfer, the user supplies an adequately sized data buffer and must + specify the data transfer direction with respect to the host. There is a + maximum transfer size of 256K imposed. Note that partial read completions + are not supported - when errors are experienced with a host read data + transfer, the data buffer is not copied back to the user. diff --git a/Documentation/powerpc/firmware-assisted-dump.txt b/Documentation/powerpc/firmware-assisted-dump.txt index 9cabaf8a207e..bdd344aa18d9 100644 --- a/Documentation/powerpc/firmware-assisted-dump.txt +++ b/Documentation/powerpc/firmware-assisted-dump.txt @@ -61,8 +61,8 @@ as follows: boot successfully. For syntax of crashkernel= parameter, refer to Documentation/kdump/kdump.txt. If any offset is provided in crashkernel= parameter, it will be ignored - as fadump reserves memory at end of RAM for boot memory - dump preservation in case of a crash. + as fadump uses a predefined offset to reserve memory + for boot memory dump preservation in case of a crash. -- After the low memory (boot memory) area has been saved, the firmware will reset PCI and other hardware state. It will diff --git a/Documentation/preempt-locking.txt b/Documentation/preempt-locking.txt index e89ce6624af2..c945062be66c 100644 --- a/Documentation/preempt-locking.txt +++ b/Documentation/preempt-locking.txt @@ -1,10 +1,13 @@ - Proper Locking Under a Preemptible Kernel: - Keeping Kernel Code Preempt-Safe - Robert Love <rml@tech9.net> - Last Updated: 28 Aug 2002 +=========================================================================== +Proper Locking Under a Preemptible Kernel: Keeping Kernel Code Preempt-Safe +=========================================================================== +:Author: Robert Love <rml@tech9.net> +:Last Updated: 28 Aug 2002 -INTRODUCTION + +Introduction +============ A preemptible kernel creates new locking issues. The issues are the same as @@ -17,9 +20,10 @@ requires protecting these situations. RULE #1: Per-CPU data structures need explicit protection +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Two similar problems arise. An example code snippet: +Two similar problems arise. An example code snippet:: struct this_needs_locking tux[NR_CPUS]; tux[smp_processor_id()] = some_value; @@ -35,6 +39,7 @@ You can also use put_cpu() and get_cpu(), which will disable preemption. RULE #2: CPU state must be protected. +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Under preemption, the state of the CPU must be protected. This is arch- @@ -52,6 +57,7 @@ However, fpu__restore() must be called with preemption disabled. RULE #3: Lock acquire and release must be performed by same task +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ A lock acquired in one task must be released by the same task. This @@ -61,17 +67,20 @@ like this, acquire and release the task in the same code path and have the caller wait on an event by the other task. -SOLUTION +Solution +======== Data protection under preemption is achieved by disabling preemption for the duration of the critical region. -preempt_enable() decrement the preempt counter -preempt_disable() increment the preempt counter -preempt_enable_no_resched() decrement, but do not immediately preempt -preempt_check_resched() if needed, reschedule -preempt_count() return the preempt counter +:: + + preempt_enable() decrement the preempt counter + preempt_disable() increment the preempt counter + preempt_enable_no_resched() decrement, but do not immediately preempt + preempt_check_resched() if needed, reschedule + preempt_count() return the preempt counter The functions are nestable. In other words, you can call preempt_disable n-times in a code path, and preemption will not be reenabled until the n-th @@ -89,7 +98,7 @@ So use this implicit preemption-disabling property only if you know that the affected codepath does not do any of this. Best policy is to use this only for small, atomic code that you wrote and which calls no complex functions. -Example: +Example:: cpucache_t *cc; /* this is per-CPU */ preempt_disable(); @@ -102,7 +111,7 @@ Example: return 0; Notice how the preemption statements must encompass every reference of the -critical variables. Another example: +critical variables. Another example:: int buf[NR_CPUS]; set_cpu_val(buf); @@ -114,7 +123,8 @@ This code is not preempt-safe, but see how easily we can fix it by simply moving the spin_lock up two lines. -PREVENTING PREEMPTION USING INTERRUPT DISABLING +Preventing preemption using interrupt disabling +=============================================== It is possible to prevent a preemption event using local_irq_disable and diff --git a/Documentation/printk-formats.txt b/Documentation/printk-formats.txt index 5962949944fd..65ea5915178b 100644 --- a/Documentation/printk-formats.txt +++ b/Documentation/printk-formats.txt @@ -1,5 +1,18 @@ -If variable is of Type, use printk format specifier: ---------------------------------------------------------- +========================================= +How to get printk format specifiers right +========================================= + +:Author: Randy Dunlap <rdunlap@infradead.org> +:Author: Andrew Murray <amurray@mpc-data.co.uk> + + +Integer types +============= + +:: + + If variable is of Type, use printk format specifier: + ------------------------------------------------------------ int %d or %x unsigned int %u or %x long %ld or %lx @@ -13,25 +26,29 @@ If variable is of Type, use printk format specifier: s64 %lld or %llx u64 %llu or %llx -If <type> is dependent on a config option for its size (e.g., sector_t, -blkcnt_t) or is architecture-dependent for its size (e.g., tcflag_t), use a -format specifier of its largest possible type and explicitly cast to it. -Example: +If <type> is dependent on a config option for its size (e.g., ``sector_t``, +``blkcnt_t``) or is architecture-dependent for its size (e.g., ``tcflag_t``), +use a format specifier of its largest possible type and explicitly cast to it. + +Example:: printk("test: sector number/total blocks: %llu/%llu\n", (unsigned long long)sector, (unsigned long long)blockcount); -Reminder: sizeof() result is of type size_t. +Reminder: ``sizeof()`` result is of type ``size_t``. -The kernel's printf does not support %n. For obvious reasons, floating -point formats (%e, %f, %g, %a) are also not recognized. Use of any +The kernel's printf does not support ``%n``. For obvious reasons, floating +point formats (``%e, %f, %g, %a``) are also not recognized. Use of any unsupported specifier or length qualifier results in a WARN and early return from vsnprintf. Raw pointer value SHOULD be printed with %p. The kernel supports the following extended format specifiers for pointer types: -Symbols/Function Pointers: +Symbols/Function Pointers +========================= + +:: %pF versatile_init+0x0/0x110 %pf versatile_init @@ -41,99 +58,122 @@ Symbols/Function Pointers: %ps versatile_init %pB prev_fn_of_versatile_init+0x88/0x88 - For printing symbols and function pointers. The 'S' and 's' specifiers - result in the symbol name with ('S') or without ('s') offsets. Where - this is used on a kernel without KALLSYMS - the symbol address is - printed instead. +For printing symbols and function pointers. The ``S`` and ``s`` specifiers +result in the symbol name with (``S``) or without (``s``) offsets. Where +this is used on a kernel without KALLSYMS - the symbol address is +printed instead. - The 'B' specifier results in the symbol name with offsets and should be - used when printing stack backtraces. The specifier takes into - consideration the effect of compiler optimisations which may occur - when tail-call's are used and marked with the noreturn GCC attribute. +The ``B`` specifier results in the symbol name with offsets and should be +used when printing stack backtraces. The specifier takes into +consideration the effect of compiler optimisations which may occur +when tail-call``s are used and marked with the noreturn GCC attribute. - On ia64, ppc64 and parisc64 architectures function pointers are - actually function descriptors which must first be resolved. The 'F' and - 'f' specifiers perform this resolution and then provide the same - functionality as the 'S' and 's' specifiers. +On ia64, ppc64 and parisc64 architectures function pointers are +actually function descriptors which must first be resolved. The ``F`` and +``f`` specifiers perform this resolution and then provide the same +functionality as the ``S`` and ``s`` specifiers. -Kernel Pointers: +Kernel Pointers +=============== + +:: %pK 0x01234567 or 0x0123456789abcdef - For printing kernel pointers which should be hidden from unprivileged - users. The behaviour of %pK depends on the kptr_restrict sysctl - see - Documentation/sysctl/kernel.txt for more details. +For printing kernel pointers which should be hidden from unprivileged +users. The behaviour of ``%pK`` depends on the ``kptr_restrict sysctl`` - see +Documentation/sysctl/kernel.txt for more details. + +Struct Resources +================ -Struct Resources: +:: %pr [mem 0x60000000-0x6fffffff flags 0x2200] or [mem 0x0000000060000000-0x000000006fffffff flags 0x2200] %pR [mem 0x60000000-0x6fffffff pref] or [mem 0x0000000060000000-0x000000006fffffff pref] - For printing struct resources. The 'R' and 'r' specifiers result in a - printed resource with ('R') or without ('r') a decoded flags member. - Passed by reference. +For printing struct resources. The ``R`` and ``r`` specifiers result in a +printed resource with (``R``) or without (``r``) a decoded flags member. +Passed by reference. -Physical addresses types phys_addr_t: +Physical addresses types ``phys_addr_t`` +======================================== + +:: %pa[p] 0x01234567 or 0x0123456789abcdef - For printing a phys_addr_t type (and its derivatives, such as - resource_size_t) which can vary based on build options, regardless of - the width of the CPU data path. Passed by reference. +For printing a ``phys_addr_t`` type (and its derivatives, such as +``resource_size_t``) which can vary based on build options, regardless of +the width of the CPU data path. Passed by reference. + +DMA addresses types ``dma_addr_t`` +================================== -DMA addresses types dma_addr_t: +:: %pad 0x01234567 or 0x0123456789abcdef - For printing a dma_addr_t type which can vary based on build options, - regardless of the width of the CPU data path. Passed by reference. +For printing a ``dma_addr_t`` type which can vary based on build options, +regardless of the width of the CPU data path. Passed by reference. + +Raw buffer as an escaped string +=============================== -Raw buffer as an escaped string: +:: %*pE[achnops] - For printing raw buffer as an escaped string. For the following buffer +For printing raw buffer as an escaped string. For the following buffer:: 1b 62 20 5c 43 07 22 90 0d 5d - few examples show how the conversion would be done (the result string - without surrounding quotes): +few examples show how the conversion would be done (the result string +without surrounding quotes):: %*pE "\eb \C\a"\220\r]" %*pEhp "\x1bb \C\x07"\x90\x0d]" %*pEa "\e\142\040\\\103\a\042\220\r\135" - The conversion rules are applied according to an optional combination - of flags (see string_escape_mem() kernel documentation for the - details): - a - ESCAPE_ANY - c - ESCAPE_SPECIAL - h - ESCAPE_HEX - n - ESCAPE_NULL - o - ESCAPE_OCTAL - p - ESCAPE_NP - s - ESCAPE_SPACE - By default ESCAPE_ANY_NP is used. +The conversion rules are applied according to an optional combination +of flags (see :c:func:`string_escape_mem` kernel documentation for the +details): - ESCAPE_ANY_NP is the sane choice for many cases, in particularly for - printing SSIDs. + - ``a`` - ESCAPE_ANY + - ``c`` - ESCAPE_SPECIAL + - ``h`` - ESCAPE_HEX + - ``n`` - ESCAPE_NULL + - ``o`` - ESCAPE_OCTAL + - ``p`` - ESCAPE_NP + - ``s`` - ESCAPE_SPACE - If field width is omitted the 1 byte only will be escaped. +By default ESCAPE_ANY_NP is used. -Raw buffer as a hex string: +ESCAPE_ANY_NP is the sane choice for many cases, in particularly for +printing SSIDs. + +If field width is omitted the 1 byte only will be escaped. + +Raw buffer as a hex string +========================== + +:: %*ph 00 01 02 ... 3f %*phC 00:01:02: ... :3f %*phD 00-01-02- ... -3f %*phN 000102 ... 3f - For printing a small buffers (up to 64 bytes long) as a hex string with - certain separator. For the larger buffers consider to use - print_hex_dump(). +For printing a small buffers (up to 64 bytes long) as a hex string with +certain separator. For the larger buffers consider to use +:c:func:`print_hex_dump`. + +MAC/FDDI addresses +================== -MAC/FDDI addresses: +:: %pM 00:01:02:03:04:05 %pMR 05:04:03:02:01:00 @@ -141,53 +181,62 @@ MAC/FDDI addresses: %pm 000102030405 %pmR 050403020100 - For printing 6-byte MAC/FDDI addresses in hex notation. The 'M' and 'm' - specifiers result in a printed address with ('M') or without ('m') byte - separators. The default byte separator is the colon (':'). +For printing 6-byte MAC/FDDI addresses in hex notation. The ``M`` and ``m`` +specifiers result in a printed address with (``M``) or without (``m``) byte +separators. The default byte separator is the colon (``:``). - Where FDDI addresses are concerned the 'F' specifier can be used after - the 'M' specifier to use dash ('-') separators instead of the default - separator. +Where FDDI addresses are concerned the ``F`` specifier can be used after +the ``M`` specifier to use dash (``-``) separators instead of the default +separator. - For Bluetooth addresses the 'R' specifier shall be used after the 'M' - specifier to use reversed byte order suitable for visual interpretation - of Bluetooth addresses which are in the little endian order. +For Bluetooth addresses the ``R`` specifier shall be used after the ``M`` +specifier to use reversed byte order suitable for visual interpretation +of Bluetooth addresses which are in the little endian order. - Passed by reference. +Passed by reference. + +IPv4 addresses +============== -IPv4 addresses: +:: %pI4 1.2.3.4 %pi4 001.002.003.004 %p[Ii]4[hnbl] - For printing IPv4 dot-separated decimal addresses. The 'I4' and 'i4' - specifiers result in a printed address with ('i4') or without ('I4') - leading zeros. +For printing IPv4 dot-separated decimal addresses. The ``I4`` and ``i4`` +specifiers result in a printed address with (``i4``) or without (``I4``) +leading zeros. - The additional 'h', 'n', 'b', and 'l' specifiers are used to specify - host, network, big or little endian order addresses respectively. Where - no specifier is provided the default network/big endian order is used. +The additional ``h``, ``n``, ``b``, and ``l`` specifiers are used to specify +host, network, big or little endian order addresses respectively. Where +no specifier is provided the default network/big endian order is used. - Passed by reference. +Passed by reference. -IPv6 addresses: +IPv6 addresses +============== + +:: %pI6 0001:0002:0003:0004:0005:0006:0007:0008 %pi6 00010002000300040005000600070008 %pI6c 1:2:3:4:5:6:7:8 - For printing IPv6 network-order 16-bit hex addresses. The 'I6' and 'i6' - specifiers result in a printed address with ('I6') or without ('i6') - colon-separators. Leading zeros are always used. +For printing IPv6 network-order 16-bit hex addresses. The ``I6`` and ``i6`` +specifiers result in a printed address with (``I6``) or without (``i6``) +colon-separators. Leading zeros are always used. - The additional 'c' specifier can be used with the 'I' specifier to - print a compressed IPv6 address as described by - http://tools.ietf.org/html/rfc5952 +The additional ``c`` specifier can be used with the ``I`` specifier to +print a compressed IPv6 address as described by +http://tools.ietf.org/html/rfc5952 - Passed by reference. +Passed by reference. + +IPv4/IPv6 addresses (generic, with port, flowinfo, scope) +========================================================= -IPv4/IPv6 addresses (generic, with port, flowinfo, scope): +:: %pIS 1.2.3.4 or 0001:0002:0003:0004:0005:0006:0007:0008 %piS 001.002.003.004 or 00010002000300040005000600070008 @@ -195,141 +244,202 @@ IPv4/IPv6 addresses (generic, with port, flowinfo, scope): %pISpc 1.2.3.4:12345 or [1:2:3:4:5:6:7:8]:12345 %p[Ii]S[pfschnbl] - For printing an IP address without the need to distinguish whether it's - of type AF_INET or AF_INET6, a pointer to a valid 'struct sockaddr', - specified through 'IS' or 'iS', can be passed to this format specifier. +For printing an IP address without the need to distinguish whether it``s +of type AF_INET or AF_INET6, a pointer to a valid ``struct sockaddr``, +specified through ``IS`` or ``iS``, can be passed to this format specifier. - The additional 'p', 'f', and 's' specifiers are used to specify port - (IPv4, IPv6), flowinfo (IPv6) and scope (IPv6). Ports have a ':' prefix, - flowinfo a '/' and scope a '%', each followed by the actual value. +The additional ``p``, ``f``, and ``s`` specifiers are used to specify port +(IPv4, IPv6), flowinfo (IPv6) and scope (IPv6). Ports have a ``:`` prefix, +flowinfo a ``/`` and scope a ``%``, each followed by the actual value. - In case of an IPv6 address the compressed IPv6 address as described by - http://tools.ietf.org/html/rfc5952 is being used if the additional - specifier 'c' is given. The IPv6 address is surrounded by '[', ']' in - case of additional specifiers 'p', 'f' or 's' as suggested by - https://tools.ietf.org/html/draft-ietf-6man-text-addr-representation-07 +In case of an IPv6 address the compressed IPv6 address as described by +http://tools.ietf.org/html/rfc5952 is being used if the additional +specifier ``c`` is given. The IPv6 address is surrounded by ``[``, ``]`` in +case of additional specifiers ``p``, ``f`` or ``s`` as suggested by +https://tools.ietf.org/html/draft-ietf-6man-text-addr-representation-07 - In case of IPv4 addresses, the additional 'h', 'n', 'b', and 'l' - specifiers can be used as well and are ignored in case of an IPv6 - address. +In case of IPv4 addresses, the additional ``h``, ``n``, ``b``, and ``l`` +specifiers can be used as well and are ignored in case of an IPv6 +address. - Passed by reference. +Passed by reference. - Further examples: +Further examples:: %pISfc 1.2.3.4 or [1:2:3:4:5:6:7:8]/123456789 %pISsc 1.2.3.4 or [1:2:3:4:5:6:7:8]%1234567890 %pISpfc 1.2.3.4:12345 or [1:2:3:4:5:6:7:8]:12345/123456789 -UUID/GUID addresses: +UUID/GUID addresses +=================== + +:: %pUb 00010203-0405-0607-0809-0a0b0c0d0e0f %pUB 00010203-0405-0607-0809-0A0B0C0D0E0F %pUl 03020100-0504-0706-0809-0a0b0c0e0e0f %pUL 03020100-0504-0706-0809-0A0B0C0E0E0F - For printing 16-byte UUID/GUIDs addresses. The additional 'l', 'L', - 'b' and 'B' specifiers are used to specify a little endian order in - lower ('l') or upper case ('L') hex characters - and big endian order - in lower ('b') or upper case ('B') hex characters. +For printing 16-byte UUID/GUIDs addresses. The additional 'l', 'L', +'b' and 'B' specifiers are used to specify a little endian order in +lower ('l') or upper case ('L') hex characters - and big endian order +in lower ('b') or upper case ('B') hex characters. - Where no additional specifiers are used the default big endian - order with lower case hex characters will be printed. +Where no additional specifiers are used the default big endian +order with lower case hex characters will be printed. - Passed by reference. +Passed by reference. -dentry names: +dentry names +============ + +:: %pd{,2,3,4} %pD{,2,3,4} - For printing dentry name; if we race with d_move(), the name might be - a mix of old and new ones, but it won't oops. %pd dentry is a safer - equivalent of %s dentry->d_name.name we used to use, %pd<n> prints - n last components. %pD does the same thing for struct file. +For printing dentry name; if we race with :c:func:`d_move`, the name might be +a mix of old and new ones, but it won't oops. ``%pd`` dentry is a safer +equivalent of ``%s`` ``dentry->d_name.name`` we used to use, ``%pd<n>`` prints +``n`` last components. ``%pD`` does the same thing for struct file. - Passed by reference. +Passed by reference. -block_device names: +block_device names +================== + +:: %pg sda, sda1 or loop0p1 - For printing name of block_device pointers. +For printing name of block_device pointers. + +struct va_format +================ -struct va_format: +:: %pV - For printing struct va_format structures. These contain a format string - and va_list as follows: +For printing struct va_format structures. These contain a format string +and va_list as follows:: struct va_format { const char *fmt; va_list *va; }; - Implements a "recursive vsnprintf". +Implements a "recursive vsnprintf". + +Do not use this feature without some mechanism to verify the +correctness of the format string and va_list arguments. + +Passed by reference. + +kobjects +======== + +:: + + %pO + + Base specifier for kobject based structs. Must be followed with + character for specific type of kobject as listed below: + + Device tree nodes: + + %pOF[fnpPcCF] + + For printing device tree nodes. The optional arguments are: + f device node full_name + n device node name + p device node phandle + P device node path spec (name + @unit) + F device node flags + c major compatible string + C full compatible string + Without any arguments prints full_name (same as %pOFf) + The separator when using multiple arguments is ':' - Do not use this feature without some mechanism to verify the - correctness of the format string and va_list arguments. + Examples: + + %pOF /foo/bar@0 - Node full name + %pOFf /foo/bar@0 - Same as above + %pOFfp /foo/bar@0:10 - Node full name + phandle + %pOFfcF /foo/bar@0:foo,device:--P- - Node full name + + major compatible string + + node flags + D - dynamic + d - detached + P - Populated + B - Populated bus Passed by reference. -struct clk: + +struct clk +========== + +:: %pC pll1 %pCn pll1 %pCr 1560000000 - For printing struct clk structures. '%pC' and '%pCn' print the name - (Common Clock Framework) or address (legacy clock framework) of the - structure; '%pCr' prints the current clock rate. +For printing struct clk structures. ``%pC`` and ``%pCn`` print the name +(Common Clock Framework) or address (legacy clock framework) of the +structure; ``%pCr`` prints the current clock rate. - Passed by reference. +Passed by reference. + +bitmap and its derivatives such as cpumask and nodemask +======================================================= -bitmap and its derivatives such as cpumask and nodemask: +:: %*pb 0779 %*pbl 0,3-6,8-10 - For printing bitmap and its derivatives such as cpumask and nodemask, - %*pb output the bitmap with field width as the number of bits and %*pbl - output the bitmap as range list with field width as the number of bits. +For printing bitmap and its derivatives such as cpumask and nodemask, +``%*pb`` output the bitmap with field width as the number of bits and ``%*pbl`` +output the bitmap as range list with field width as the number of bits. - Passed by reference. +Passed by reference. -Flags bitfields such as page flags, gfp_flags: +Flags bitfields such as page flags, gfp_flags +============================================= + +:: %pGp referenced|uptodate|lru|active|private %pGg GFP_USER|GFP_DMA32|GFP_NOWARN %pGv read|exec|mayread|maywrite|mayexec|denywrite - For printing flags bitfields as a collection of symbolic constants that - would construct the value. The type of flags is given by the third - character. Currently supported are [p]age flags, [v]ma_flags (both - expect unsigned long *) and [g]fp_flags (expects gfp_t *). The flag - names and print order depends on the particular type. +For printing flags bitfields as a collection of symbolic constants that +would construct the value. The type of flags is given by the third +character. Currently supported are [p]age flags, [v]ma_flags (both +expect ``unsigned long *``) and [g]fp_flags (expects ``gfp_t *``). The flag +names and print order depends on the particular type. - Note that this format should not be used directly in TP_printk() part - of a tracepoint. Instead, use the show_*_flags() functions from - <trace/events/mmflags.h>. +Note that this format should not be used directly in :c:func:`TP_printk()` part +of a tracepoint. Instead, use the ``show_*_flags()`` functions from +<trace/events/mmflags.h>. - Passed by reference. +Passed by reference. + +Network device features +======================= -Network device features: +:: %pNF 0x000000000000c000 - For printing netdev_features_t. +For printing netdev_features_t. - Passed by reference. +Passed by reference. -If you add other %p extensions, please extend lib/test_printf.c with +If you add other ``%p`` extensions, please extend lib/test_printf.c with one or more test cases, if at all feasible. Thank you for your cooperation and attention. - - -By Randy Dunlap <rdunlap@infradead.org> and -Andrew Murray <amurray@mpc-data.co.uk> diff --git a/Documentation/process/changes.rst b/Documentation/process/changes.rst index e25d63f8c0da..adbb50ae5246 100644 --- a/Documentation/process/changes.rst +++ b/Documentation/process/changes.rst @@ -31,7 +31,7 @@ you probably needn't concern yourself with isdn4k-utils. ====================== =============== ======================================== GNU C 3.2 gcc --version GNU make 3.81 make --version -binutils 2.12 ld -v +binutils 2.20 ld -v util-linux 2.10o fdformat --version module-init-tools 0.9.10 depmod -V e2fsprogs 1.41.4 e2fsck -V @@ -75,10 +75,9 @@ You will need GNU make 3.81 or later to build the kernel. Binutils -------- -Linux on IA-32 has recently switched from using ``as86`` to using ``gas`` for -assembling the 16-bit boot code, removing the need for ``as86`` to compile -your kernel. This change does, however, mean that you need a recent -release of binutils. +The build system has, as of 4.13, switched to using thin archives (`ar T`) +rather than incremental linking (`ld -r`) for built-in.o intermediate steps. +This requires binutils 2.20 or newer. Perl ---- @@ -116,12 +115,11 @@ DevFS has been obsoleted in favour of udev Linux documentation for functions is transitioning to inline documentation via specially-formatted comments near their -definitions in the source. These comments can be combined with the -SGML templates in the Documentation/DocBook directory to make DocBook -files, which can then be converted by DocBook stylesheets to PostScript, -HTML, PDF files, and several other formats. In order to convert from -DocBook format to a format of your choice, you'll need to install Jade as -well as the desired DocBook stylesheets. +definitions in the source. These comments can be combined with ReST +files the Documentation/ directory to make enriched documentation, which can +then be converted to PostScript, HTML, LaTex, ePUB and PDF files. +In order to convert from ReST format to a format of your choice, you'll need +Sphinx. Util-linux ---------- @@ -323,12 +321,6 @@ PDF outputs, it is recommended to use version 1.4.6. functionalities required for ``XeLaTex`` to work. For PDF output you'll also need ``convert(1)`` from ImageMagick (https://www.imagemagick.org). -Other tools ------------ - -In order to produce documentation from DocBook, you'll also need ``xmlto``. -Please notice, however, that we're currently migrating all documents to use -``Sphinx``. Getting updated software ======================== @@ -409,15 +401,6 @@ Quota-tools - <http://sourceforge.net/projects/linuxquota/> -DocBook Stylesheets -------------------- - -- <http://sourceforge.net/projects/docbook/files/docbook-dsssl/> - -XMLTO XSLT Frontend -------------------- - -- <http://cyberelk.net/tim/xmlto/> Intel P6 microcode ------------------ diff --git a/Documentation/process/coding-style.rst b/Documentation/process/coding-style.rst index d20d52a4d812..a20b44a40ec4 100644 --- a/Documentation/process/coding-style.rst +++ b/Documentation/process/coding-style.rst @@ -980,8 +980,8 @@ do so, though, and doing so unnecessarily can limit optimization. When writing a single inline assembly statement containing multiple instructions, put each instruction on a separate line in a separate quoted -string, and end each string except the last with \n\t to properly indent the -next instruction in the assembly output: +string, and end each string except the last with ``\n\t`` to properly indent +the next instruction in the assembly output: .. code-block:: c diff --git a/Documentation/process/email-clients.rst b/Documentation/process/email-clients.rst index ac892b30815e..07faa5457bcb 100644 --- a/Documentation/process/email-clients.rst +++ b/Documentation/process/email-clients.rst @@ -167,6 +167,11 @@ Lotus Notes (GUI) Run away from it. +IBM Verse (Web GUI) +******************* + +See Lotus Notes. + Mutt (TUI) ********** diff --git a/Documentation/process/howto.rst b/Documentation/process/howto.rst index 1260f60d4cb9..c6875b1db56f 100644 --- a/Documentation/process/howto.rst +++ b/Documentation/process/howto.rst @@ -180,14 +180,6 @@ They can also be generated on LaTeX and ePub formats with:: make latexdocs make epubdocs -Currently, there are some documents written on DocBook that are in -the process of conversion to ReST. Such documents will be created in the -Documentation/DocBook/ directory and can be generated also as -Postscript or man pages by running:: - - make psdocs - make mandocs - Becoming A Kernel Developer --------------------------- diff --git a/Documentation/process/kernel-docs.rst b/Documentation/process/kernel-docs.rst index 05a7857a4a83..b8cac85a4001 100644 --- a/Documentation/process/kernel-docs.rst +++ b/Documentation/process/kernel-docs.rst @@ -40,50 +40,18 @@ Enjoy! Docs at the Linux Kernel tree ----------------------------- -The DocBook books should be built with ``make {htmldocs | psdocs | pdfdocs}``. The Sphinx books should be built with ``make {htmldocs | pdfdocs | epubdocs}``. * Name: **linux/Documentation** :Author: Many. :Location: Documentation/ - :Keywords: text files, Sphinx, DocBook. + :Keywords: text files, Sphinx. :Description: Documentation that comes with the kernel sources, inside the Documentation directory. Some pages from this document (including this document itself) have been moved there, and might be more up to date than the web version. - * Title: **The Kernel Hacking HOWTO** - - :Author: Various Talented People, and Rusty. - :Location: Documentation/DocBook/kernel-hacking.tmpl - :Keywords: HOWTO, kernel contexts, deadlock, locking, modules, - symbols, return conventions. - :Description: From the Introduction: "Please understand that I - never wanted to write this document, being grossly underqualified, - but I always wanted to read it, and this was the only way. I - simply explain some best practices, and give reading entry-points - into the kernel sources. I avoid implementation details: that's - what the code is for, and I ignore whole tracts of useful - routines. This document assumes familiarity with C, and an - understanding of what the kernel is, and how it is used. It was - originally written for the 2.3 kernels, but nearly all of it - applies to 2.2 too; 2.0 is slightly different". - - * Title: **Linux Kernel Locking HOWTO** - - :Author: Various Talented People, and Rusty. - :Location: Documentation/DocBook/kernel-locking.tmpl - :Keywords: locks, locking, spinlock, semaphore, atomic, race - condition, bottom halves, tasklets, softirqs. - :Description: The title says it all: document describing the - locking system in the Linux Kernel either in uniprocessor or SMP - systems. - :Notes: "It was originally written for the later (>2.3.47) 2.3 - kernels, but most of it applies to 2.2 too; 2.0 is slightly - different". Freely redistributable under the conditions of the GNU - General Public License. - On-line docs ------------ diff --git a/Documentation/pwm.txt b/Documentation/pwm.txt index 789b27c6ec99..8fbf0aa3ba2d 100644 --- a/Documentation/pwm.txt +++ b/Documentation/pwm.txt @@ -1,4 +1,6 @@ +====================================== Pulse Width Modulation (PWM) interface +====================================== This provides an overview about the Linux PWM interface @@ -16,7 +18,7 @@ Users of the legacy PWM API use unique IDs to refer to PWM devices. Instead of referring to a PWM device via its unique ID, board setup code should instead register a static mapping that can be used to match PWM -consumers to providers, as given in the following example: +consumers to providers, as given in the following example:: static struct pwm_lookup board_pwm_lookup[] = { PWM_LOOKUP("tegra-pwm", 0, "pwm-backlight", NULL, @@ -40,9 +42,9 @@ New users should use the pwm_get() function and pass to it the consumer device or a consumer name. pwm_put() is used to free the PWM device. Managed variants of these functions, devm_pwm_get() and devm_pwm_put(), also exist. -After being requested, a PWM has to be configured using: +After being requested, a PWM has to be configured using:: -int pwm_apply_state(struct pwm_device *pwm, struct pwm_state *state); + int pwm_apply_state(struct pwm_device *pwm, struct pwm_state *state); This API controls both the PWM period/duty_cycle config and the enable/disable state. @@ -72,11 +74,14 @@ interface is provided to use the PWMs from userspace. It is exposed at pwmchipN, where N is the base of the PWM chip. Inside the directory you will find: -npwm - The number of PWM channels this chip supports (read-only). + npwm + The number of PWM channels this chip supports (read-only). -export - Exports a PWM channel for use with sysfs (write-only). + export + Exports a PWM channel for use with sysfs (write-only). -unexport - Unexports a PWM channel from sysfs (write-only). + unexport + Unexports a PWM channel from sysfs (write-only). The PWM channels are numbered using a per-chip index from 0 to npwm-1. @@ -84,21 +89,26 @@ When a PWM channel is exported a pwmX directory will be created in the pwmchipN directory it is associated with, where X is the number of the channel that was exported. The following properties will then be available: -period - The total period of the PWM signal (read/write). - Value is in nanoseconds and is the sum of the active and inactive - time of the PWM. + period + The total period of the PWM signal (read/write). + Value is in nanoseconds and is the sum of the active and inactive + time of the PWM. -duty_cycle - The active time of the PWM signal (read/write). - Value is in nanoseconds and must be less than the period. + duty_cycle + The active time of the PWM signal (read/write). + Value is in nanoseconds and must be less than the period. -polarity - Changes the polarity of the PWM signal (read/write). - Writes to this property only work if the PWM chip supports changing - the polarity. The polarity can only be changed if the PWM is not - enabled. Value is the string "normal" or "inversed". + polarity + Changes the polarity of the PWM signal (read/write). + Writes to this property only work if the PWM chip supports changing + the polarity. The polarity can only be changed if the PWM is not + enabled. Value is the string "normal" or "inversed". -enable - Enable/disable the PWM signal (read/write). - 0 - disabled - 1 - enabled + enable + Enable/disable the PWM signal (read/write). + + - 0 - disabled + - 1 - enabled Implementing a PWM driver ------------------------- diff --git a/Documentation/rbtree.txt b/Documentation/rbtree.txt index b9d9cc57be18..b8a8c70b0188 100644 --- a/Documentation/rbtree.txt +++ b/Documentation/rbtree.txt @@ -1,7 +1,10 @@ +================================= Red-black Trees (rbtree) in Linux -January 18, 2007 -Rob Landley <rob@landley.net> -============================= +================================= + + +:Date: January 18, 2007 +:Author: Rob Landley <rob@landley.net> What are red-black trees, and what are they for? ------------------------------------------------ @@ -56,7 +59,7 @@ user of the rbtree code. Creating a new rbtree --------------------- -Data nodes in an rbtree tree are structures containing a struct rb_node member: +Data nodes in an rbtree tree are structures containing a struct rb_node member:: struct mytype { struct rb_node node; @@ -78,7 +81,7 @@ Searching for a value in an rbtree Writing a search function for your tree is fairly straightforward: start at the root, compare each value, and follow the left or right branch as necessary. -Example: +Example:: struct mytype *my_search(struct rb_root *root, char *string) { @@ -110,7 +113,7 @@ The search for insertion differs from the previous search by finding the location of the pointer on which to graft the new node. The new node also needs a link to its parent node for rebalancing purposes. -Example: +Example:: int my_insert(struct rb_root *root, struct mytype *data) { @@ -140,11 +143,11 @@ Example: Removing or replacing existing data in an rbtree ------------------------------------------------ -To remove an existing node from a tree, call: +To remove an existing node from a tree, call:: void rb_erase(struct rb_node *victim, struct rb_root *tree); -Example: +Example:: struct mytype *data = mysearch(&mytree, "walrus"); @@ -153,7 +156,7 @@ Example: myfree(data); } -To replace an existing node in a tree with a new one with the same key, call: +To replace an existing node in a tree with a new one with the same key, call:: void rb_replace_node(struct rb_node *old, struct rb_node *new, struct rb_root *tree); @@ -166,7 +169,7 @@ Iterating through the elements stored in an rbtree (in sort order) Four functions are provided for iterating through an rbtree's contents in sorted order. These work on arbitrary trees, and should not need to be -modified or wrapped (except for locking purposes): +modified or wrapped (except for locking purposes):: struct rb_node *rb_first(struct rb_root *tree); struct rb_node *rb_last(struct rb_root *tree); @@ -184,7 +187,7 @@ which the containing data structure may be accessed with the container_of() macro, and individual members may be accessed directly via rb_entry(node, type, member). -Example: +Example:: struct rb_node *node; for (node = rb_first(&mytree); node; node = rb_next(node)) @@ -241,7 +244,8 @@ user should have a single rb_erase_augmented() call site in order to limit compiled code size. -Sample usage: +Sample usage +^^^^^^^^^^^^ Interval tree is an example of augmented rb tree. Reference - "Introduction to Algorithms" by Cormen, Leiserson, Rivest and Stein. @@ -259,12 +263,12 @@ This "extra information" stored in each node is the maximum hi information can be maintained at each node just be looking at the node and its immediate children. And this will be used in O(log n) lookup for lowest match (lowest start address among all possible matches) -with something like: +with something like:: -struct interval_tree_node * -interval_tree_first_match(struct rb_root *root, - unsigned long start, unsigned long last) -{ + struct interval_tree_node * + interval_tree_first_match(struct rb_root *root, + unsigned long start, unsigned long last) + { struct interval_tree_node *node; if (!root->rb_node) @@ -301,13 +305,13 @@ interval_tree_first_match(struct rb_root *root, } return NULL; /* No match */ } -} + } -Insertion/removal are defined using the following augmented callbacks: +Insertion/removal are defined using the following augmented callbacks:: -static inline unsigned long -compute_subtree_last(struct interval_tree_node *node) -{ + static inline unsigned long + compute_subtree_last(struct interval_tree_node *node) + { unsigned long max = node->last, subtree_last; if (node->rb.rb_left) { subtree_last = rb_entry(node->rb.rb_left, @@ -322,10 +326,10 @@ compute_subtree_last(struct interval_tree_node *node) max = subtree_last; } return max; -} + } -static void augment_propagate(struct rb_node *rb, struct rb_node *stop) -{ + static void augment_propagate(struct rb_node *rb, struct rb_node *stop) + { while (rb != stop) { struct interval_tree_node *node = rb_entry(rb, struct interval_tree_node, rb); @@ -335,20 +339,20 @@ static void augment_propagate(struct rb_node *rb, struct rb_node *stop) node->__subtree_last = subtree_last; rb = rb_parent(&node->rb); } -} + } -static void augment_copy(struct rb_node *rb_old, struct rb_node *rb_new) -{ + static void augment_copy(struct rb_node *rb_old, struct rb_node *rb_new) + { struct interval_tree_node *old = rb_entry(rb_old, struct interval_tree_node, rb); struct interval_tree_node *new = rb_entry(rb_new, struct interval_tree_node, rb); new->__subtree_last = old->__subtree_last; -} + } -static void augment_rotate(struct rb_node *rb_old, struct rb_node *rb_new) -{ + static void augment_rotate(struct rb_node *rb_old, struct rb_node *rb_new) + { struct interval_tree_node *old = rb_entry(rb_old, struct interval_tree_node, rb); struct interval_tree_node *new = @@ -356,15 +360,15 @@ static void augment_rotate(struct rb_node *rb_old, struct rb_node *rb_new) new->__subtree_last = old->__subtree_last; old->__subtree_last = compute_subtree_last(old); -} + } -static const struct rb_augment_callbacks augment_callbacks = { + static const struct rb_augment_callbacks augment_callbacks = { augment_propagate, augment_copy, augment_rotate -}; + }; -void interval_tree_insert(struct interval_tree_node *node, - struct rb_root *root) -{ + void interval_tree_insert(struct interval_tree_node *node, + struct rb_root *root) + { struct rb_node **link = &root->rb_node, *rb_parent = NULL; unsigned long start = node->start, last = node->last; struct interval_tree_node *parent; @@ -383,10 +387,10 @@ void interval_tree_insert(struct interval_tree_node *node, node->__subtree_last = last; rb_link_node(&node->rb, rb_parent, link); rb_insert_augmented(&node->rb, root, &augment_callbacks); -} + } -void interval_tree_remove(struct interval_tree_node *node, - struct rb_root *root) -{ + void interval_tree_remove(struct interval_tree_node *node, + struct rb_root *root) + { rb_erase_augmented(&node->rb, root, &augment_callbacks); -} + } diff --git a/Documentation/remoteproc.txt b/Documentation/remoteproc.txt index f07597482351..77fb03acdbb4 100644 --- a/Documentation/remoteproc.txt +++ b/Documentation/remoteproc.txt @@ -1,6 +1,9 @@ +========================== Remote Processor Framework +========================== -1. Introduction +Introduction +============ Modern SoCs typically have heterogeneous remote processor devices in asymmetric multiprocessing (AMP) configurations, which may be running different instances @@ -26,44 +29,62 @@ remoteproc will add those devices. This makes it possible to reuse the existing virtio drivers with remote processor backends at a minimal development cost. -2. User API +User API +======== + +:: int rproc_boot(struct rproc *rproc) - - Boot a remote processor (i.e. load its firmware, power it on, ...). - If the remote processor is already powered on, this function immediately - returns (successfully). - Returns 0 on success, and an appropriate error value otherwise. - Note: to use this function you should already have a valid rproc - handle. There are several ways to achieve that cleanly (devres, pdata, - the way remoteproc_rpmsg.c does this, or, if this becomes prevalent, we - might also consider using dev_archdata for this). + +Boot a remote processor (i.e. load its firmware, power it on, ...). + +If the remote processor is already powered on, this function immediately +returns (successfully). + +Returns 0 on success, and an appropriate error value otherwise. +Note: to use this function you should already have a valid rproc +handle. There are several ways to achieve that cleanly (devres, pdata, +the way remoteproc_rpmsg.c does this, or, if this becomes prevalent, we +might also consider using dev_archdata for this). + +:: void rproc_shutdown(struct rproc *rproc) - - Power off a remote processor (previously booted with rproc_boot()). - In case @rproc is still being used by an additional user(s), then - this function will just decrement the power refcount and exit, - without really powering off the device. - Every call to rproc_boot() must (eventually) be accompanied by a call - to rproc_shutdown(). Calling rproc_shutdown() redundantly is a bug. - Notes: - - we're not decrementing the rproc's refcount, only the power refcount. - which means that the @rproc handle stays valid even after - rproc_shutdown() returns, and users can still use it with a subsequent - rproc_boot(), if needed. + +Power off a remote processor (previously booted with rproc_boot()). +In case @rproc is still being used by an additional user(s), then +this function will just decrement the power refcount and exit, +without really powering off the device. + +Every call to rproc_boot() must (eventually) be accompanied by a call +to rproc_shutdown(). Calling rproc_shutdown() redundantly is a bug. + +.. note:: + + we're not decrementing the rproc's refcount, only the power refcount. + which means that the @rproc handle stays valid even after + rproc_shutdown() returns, and users can still use it with a subsequent + rproc_boot(), if needed. + +:: struct rproc *rproc_get_by_phandle(phandle phandle) - - Find an rproc handle using a device tree phandle. Returns the rproc - handle on success, and NULL on failure. This function increments - the remote processor's refcount, so always use rproc_put() to - decrement it back once rproc isn't needed anymore. -3. Typical usage +Find an rproc handle using a device tree phandle. Returns the rproc +handle on success, and NULL on failure. This function increments +the remote processor's refcount, so always use rproc_put() to +decrement it back once rproc isn't needed anymore. + +Typical usage +============= -#include <linux/remoteproc.h> +:: -/* in case we were given a valid 'rproc' handle */ -int dummy_rproc_example(struct rproc *my_rproc) -{ + #include <linux/remoteproc.h> + + /* in case we were given a valid 'rproc' handle */ + int dummy_rproc_example(struct rproc *my_rproc) + { int ret; /* let's power on and boot our remote processor */ @@ -80,84 +101,111 @@ int dummy_rproc_example(struct rproc *my_rproc) /* let's shut it down now */ rproc_shutdown(my_rproc); -} + } + +API for implementors +==================== -4. API for implementors +:: struct rproc *rproc_alloc(struct device *dev, const char *name, const struct rproc_ops *ops, const char *firmware, int len) - - Allocate a new remote processor handle, but don't register - it yet. Required parameters are the underlying device, the - name of this remote processor, platform-specific ops handlers, - the name of the firmware to boot this rproc with, and the - length of private data needed by the allocating rproc driver (in bytes). - - This function should be used by rproc implementations during - initialization of the remote processor. - After creating an rproc handle using this function, and when ready, - implementations should then call rproc_add() to complete - the registration of the remote processor. - On success, the new rproc is returned, and on failure, NULL. - - Note: _never_ directly deallocate @rproc, even if it was not registered - yet. Instead, when you need to unroll rproc_alloc(), use rproc_free(). + +Allocate a new remote processor handle, but don't register +it yet. Required parameters are the underlying device, the +name of this remote processor, platform-specific ops handlers, +the name of the firmware to boot this rproc with, and the +length of private data needed by the allocating rproc driver (in bytes). + +This function should be used by rproc implementations during +initialization of the remote processor. + +After creating an rproc handle using this function, and when ready, +implementations should then call rproc_add() to complete +the registration of the remote processor. + +On success, the new rproc is returned, and on failure, NULL. + +.. note:: + + **never** directly deallocate @rproc, even if it was not registered + yet. Instead, when you need to unroll rproc_alloc(), use rproc_free(). + +:: void rproc_free(struct rproc *rproc) - - Free an rproc handle that was allocated by rproc_alloc. - This function essentially unrolls rproc_alloc(), by decrementing the - rproc's refcount. It doesn't directly free rproc; that would happen - only if there are no other references to rproc and its refcount now - dropped to zero. + +Free an rproc handle that was allocated by rproc_alloc. + +This function essentially unrolls rproc_alloc(), by decrementing the +rproc's refcount. It doesn't directly free rproc; that would happen +only if there are no other references to rproc and its refcount now +dropped to zero. + +:: int rproc_add(struct rproc *rproc) - - Register @rproc with the remoteproc framework, after it has been - allocated with rproc_alloc(). - This is called by the platform-specific rproc implementation, whenever - a new remote processor device is probed. - Returns 0 on success and an appropriate error code otherwise. - Note: this function initiates an asynchronous firmware loading - context, which will look for virtio devices supported by the rproc's - firmware. - If found, those virtio devices will be created and added, so as a result - of registering this remote processor, additional virtio drivers might get - probed. + +Register @rproc with the remoteproc framework, after it has been +allocated with rproc_alloc(). + +This is called by the platform-specific rproc implementation, whenever +a new remote processor device is probed. + +Returns 0 on success and an appropriate error code otherwise. +Note: this function initiates an asynchronous firmware loading +context, which will look for virtio devices supported by the rproc's +firmware. + +If found, those virtio devices will be created and added, so as a result +of registering this remote processor, additional virtio drivers might get +probed. + +:: int rproc_del(struct rproc *rproc) - - Unroll rproc_add(). - This function should be called when the platform specific rproc - implementation decides to remove the rproc device. it should - _only_ be called if a previous invocation of rproc_add() - has completed successfully. - After rproc_del() returns, @rproc is still valid, and its - last refcount should be decremented by calling rproc_free(). +Unroll rproc_add(). + +This function should be called when the platform specific rproc +implementation decides to remove the rproc device. it should +_only_ be called if a previous invocation of rproc_add() +has completed successfully. - Returns 0 on success and -EINVAL if @rproc isn't valid. +After rproc_del() returns, @rproc is still valid, and its +last refcount should be decremented by calling rproc_free(). + +Returns 0 on success and -EINVAL if @rproc isn't valid. + +:: void rproc_report_crash(struct rproc *rproc, enum rproc_crash_type type) - - Report a crash in a remoteproc - This function must be called every time a crash is detected by the - platform specific rproc implementation. This should not be called from a - non-remoteproc driver. This function can be called from atomic/interrupt - context. -5. Implementation callbacks +Report a crash in a remoteproc + +This function must be called every time a crash is detected by the +platform specific rproc implementation. This should not be called from a +non-remoteproc driver. This function can be called from atomic/interrupt +context. + +Implementation callbacks +======================== These callbacks should be provided by platform-specific remoteproc -drivers: - -/** - * struct rproc_ops - platform-specific device handlers - * @start: power on the device and boot it - * @stop: power off the device - * @kick: kick a virtqueue (virtqueue id given as a parameter) - */ -struct rproc_ops { +drivers:: + + /** + * struct rproc_ops - platform-specific device handlers + * @start: power on the device and boot it + * @stop: power off the device + * @kick: kick a virtqueue (virtqueue id given as a parameter) + */ + struct rproc_ops { int (*start)(struct rproc *rproc); int (*stop)(struct rproc *rproc); void (*kick)(struct rproc *rproc, int vqid); -}; + }; Every remoteproc implementation should at least provide the ->start and ->stop handlers. If rpmsg/virtio functionality is also desired, then the ->kick handler @@ -179,7 +227,8 @@ the exact virtqueue index to look in is optional: it is easy (and not too expensive) to go through the existing virtqueues and look for new buffers in the used rings. -6. Binary Firmware Structure +Binary Firmware Structure +========================= At this point remoteproc only supports ELF32 firmware binaries. However, it is quite expected that other platforms/devices which we'd want to @@ -207,43 +256,43 @@ resource entries that publish the existence of supported features or configurations by the remote processor, such as trace buffers and supported virtio devices (and their configurations). -The resource table begins with this header: - -/** - * struct resource_table - firmware resource table header - * @ver: version number - * @num: number of resource entries - * @reserved: reserved (must be zero) - * @offset: array of offsets pointing at the various resource entries - * - * The header of the resource table, as expressed by this structure, - * contains a version number (should we need to change this format in the - * future), the number of available resource entries, and their offsets - * in the table. - */ -struct resource_table { +The resource table begins with this header:: + + /** + * struct resource_table - firmware resource table header + * @ver: version number + * @num: number of resource entries + * @reserved: reserved (must be zero) + * @offset: array of offsets pointing at the various resource entries + * + * The header of the resource table, as expressed by this structure, + * contains a version number (should we need to change this format in the + * future), the number of available resource entries, and their offsets + * in the table. + */ + struct resource_table { u32 ver; u32 num; u32 reserved[2]; u32 offset[0]; -} __packed; + } __packed; Immediately following this header are the resource entries themselves, -each of which begins with the following resource entry header: - -/** - * struct fw_rsc_hdr - firmware resource entry header - * @type: resource type - * @data: resource data - * - * Every resource entry begins with a 'struct fw_rsc_hdr' header providing - * its @type. The content of the entry itself will immediately follow - * this header, and it should be parsed according to the resource type. - */ -struct fw_rsc_hdr { +each of which begins with the following resource entry header:: + + /** + * struct fw_rsc_hdr - firmware resource entry header + * @type: resource type + * @data: resource data + * + * Every resource entry begins with a 'struct fw_rsc_hdr' header providing + * its @type. The content of the entry itself will immediately follow + * this header, and it should be parsed according to the resource type. + */ + struct fw_rsc_hdr { u32 type; u8 data[0]; -} __packed; + } __packed; Some resources entries are mere announcements, where the host is informed of specific remoteproc configuration. Other entries require the host to @@ -252,32 +301,32 @@ is expected, where the firmware requests a resource, and once allocated, the host should provide back its details (e.g. address of an allocated memory region). -Here are the various resource types that are currently supported: - -/** - * enum fw_resource_type - types of resource entries - * - * @RSC_CARVEOUT: request for allocation of a physically contiguous - * memory region. - * @RSC_DEVMEM: request to iommu_map a memory-based peripheral. - * @RSC_TRACE: announces the availability of a trace buffer into which - * the remote processor will be writing logs. - * @RSC_VDEV: declare support for a virtio device, and serve as its - * virtio header. - * @RSC_LAST: just keep this one at the end - * - * Please note that these values are used as indices to the rproc_handle_rsc - * lookup table, so please keep them sane. Moreover, @RSC_LAST is used to - * check the validity of an index before the lookup table is accessed, so - * please update it as needed. - */ -enum fw_resource_type { +Here are the various resource types that are currently supported:: + + /** + * enum fw_resource_type - types of resource entries + * + * @RSC_CARVEOUT: request for allocation of a physically contiguous + * memory region. + * @RSC_DEVMEM: request to iommu_map a memory-based peripheral. + * @RSC_TRACE: announces the availability of a trace buffer into which + * the remote processor will be writing logs. + * @RSC_VDEV: declare support for a virtio device, and serve as its + * virtio header. + * @RSC_LAST: just keep this one at the end + * + * Please note that these values are used as indices to the rproc_handle_rsc + * lookup table, so please keep them sane. Moreover, @RSC_LAST is used to + * check the validity of an index before the lookup table is accessed, so + * please update it as needed. + */ + enum fw_resource_type { RSC_CARVEOUT = 0, RSC_DEVMEM = 1, RSC_TRACE = 2, RSC_VDEV = 3, RSC_LAST = 4, -}; + }; For more details regarding a specific resource type, please see its dedicated structure in include/linux/remoteproc.h. @@ -286,7 +335,8 @@ We also expect that platform-specific resource entries will show up at some point. When that happens, we could easily add a new RSC_PLATFORM type, and hand those resources to the platform-specific rproc driver to handle. -7. Virtio and remoteproc +Virtio and remoteproc +===================== The firmware should provide remoteproc information about virtio devices that it supports, and their configurations: a RSC_VDEV resource entry diff --git a/Documentation/rfkill.txt b/Documentation/rfkill.txt index 8c174063b3f0..a289285d2412 100644 --- a/Documentation/rfkill.txt +++ b/Documentation/rfkill.txt @@ -1,13 +1,13 @@ +=============================== rfkill - RF kill switch support =============================== -1. Introduction -2. Implementation details -3. Kernel API -4. Userspace support +.. contents:: + :depth: 2 -1. Introduction +Introduction +============ The rfkill subsystem provides a generic interface to disabling any radio transmitter in the system. When a transmitter is blocked, it shall not @@ -21,17 +21,24 @@ aircraft. The rfkill subsystem has a concept of "hard" and "soft" block, which differ little in their meaning (block == transmitters off) but rather in whether they can be changed or not: - - hard block: read-only radio block that cannot be overridden by software - - soft block: writable radio block (need not be readable) that is set by - the system software. + + - hard block + read-only radio block that cannot be overridden by software + + - soft block + writable radio block (need not be readable) that is set by + the system software. The rfkill subsystem has two parameters, rfkill.default_state and -rfkill.master_switch_mode, which are documented in admin-guide/kernel-parameters.rst. +rfkill.master_switch_mode, which are documented in +admin-guide/kernel-parameters.rst. -2. Implementation details +Implementation details +====================== The rfkill subsystem is composed of three main components: + * the rfkill core, * the deprecated rfkill-input module (an input layer handler, being replaced by userspace policy code) and @@ -55,7 +62,8 @@ use the return value of rfkill_set_hw_state() unless the hardware actually keeps track of soft and hard block separately. -3. Kernel API +Kernel API +========== Drivers for radio transmitters normally implement an rfkill driver. @@ -69,7 +77,7 @@ For some platforms, it is possible that the hardware state changes during suspend/hibernation, in which case it will be necessary to update the rfkill core with the current state is at resume time. -To create an rfkill driver, driver's Kconfig needs to have +To create an rfkill driver, driver's Kconfig needs to have:: depends on RFKILL || !RFKILL @@ -87,7 +95,8 @@ RFKill provides per-switch LED triggers, which can be used to drive LEDs according to the switch state (LED_FULL when blocked, LED_OFF otherwise). -5. Userspace support +Userspace support +================= The recommended userspace interface to use is /dev/rfkill, which is a misc character device that allows userspace to obtain and set the state of rfkill @@ -112,11 +121,11 @@ rfkill core framework. Additionally, each rfkill device is registered in sysfs and emits uevents. rfkill devices issue uevents (with an action of "change"), with the following -environment variables set: +environment variables set:: -RFKILL_NAME -RFKILL_STATE -RFKILL_TYPE + RFKILL_NAME + RFKILL_STATE + RFKILL_TYPE The contents of these variables corresponds to the "name", "state" and "type" sysfs files explained above. diff --git a/Documentation/robust-futex-ABI.txt b/Documentation/robust-futex-ABI.txt index 16eb314f56cc..8a5d34abf726 100644 --- a/Documentation/robust-futex-ABI.txt +++ b/Documentation/robust-futex-ABI.txt @@ -1,7 +1,9 @@ -Started by Paul Jackson <pj@sgi.com> - +==================== The robust futex ABI --------------------- +==================== + +:Author: Started by Paul Jackson <pj@sgi.com> + Robust_futexes provide a mechanism that is used in addition to normal futexes, for kernel assist of cleanup of held locks on task exit. @@ -32,7 +34,7 @@ probably causing deadlock or other such failure of the other threads waiting on the same locks. A thread that anticipates possibly using robust_futexes should first -issue the system call: +issue the system call:: asmlinkage long sys_set_robust_list(struct robust_list_head __user *head, size_t len); @@ -91,7 +93,7 @@ that lock using the futex mechanism. When a thread has invoked the above system call to indicate it anticipates using robust_futexes, the kernel stores the passed in 'head' pointer for that task. The task may retrieve that value later on by -using the system call: +using the system call:: asmlinkage long sys_get_robust_list(int pid, struct robust_list_head __user **head_ptr, @@ -135,6 +137,7 @@ manipulating this list), the user code must observe the following protocol on 'lock entry' insertion and removal: On insertion: + 1) set the 'list_op_pending' word to the address of the 'lock entry' to be inserted, 2) acquire the futex lock, @@ -143,6 +146,7 @@ On insertion: 4) clear the 'list_op_pending' word. On removal: + 1) set the 'list_op_pending' word to the address of the 'lock entry' to be removed, 2) remove the lock entry for this lock from the 'head' list, diff --git a/Documentation/robust-futexes.txt b/Documentation/robust-futexes.txt index 61c22d608759..6c42c75103eb 100644 --- a/Documentation/robust-futexes.txt +++ b/Documentation/robust-futexes.txt @@ -1,4 +1,8 @@ -Started by: Ingo Molnar <mingo@redhat.com> +======================================== +A description of what robust futexes are +======================================== + +:Started by: Ingo Molnar <mingo@redhat.com> Background ---------- @@ -163,7 +167,7 @@ Implementation details ---------------------- The patch adds two new syscalls: one to register the userspace list, and -one to query the registered list pointer: +one to query the registered list pointer:: asmlinkage long sys_set_robust_list(struct robust_list_head __user *head, @@ -185,7 +189,7 @@ straightforward. The kernel doesn't have any internal distinction between robust and normal futexes. If a futex is found to be held at exit time, the kernel sets the -following bit of the futex word: +following bit of the futex word:: #define FUTEX_OWNER_DIED 0x40000000 @@ -193,7 +197,7 @@ and wakes up the next futex waiter (if any). User-space does the rest of the cleanup. Otherwise, robust futexes are acquired by glibc by putting the TID into -the futex field atomically. Waiters set the FUTEX_WAITERS bit: +the futex field atomically. Waiters set the FUTEX_WAITERS bit:: #define FUTEX_WAITERS 0x80000000 diff --git a/Documentation/rpmsg.txt b/Documentation/rpmsg.txt index a95e36a43288..24b7a9e1a5f9 100644 --- a/Documentation/rpmsg.txt +++ b/Documentation/rpmsg.txt @@ -1,10 +1,15 @@ +============================================ Remote Processor Messaging (rpmsg) Framework +============================================ -Note: this document describes the rpmsg bus and how to write rpmsg drivers. -To learn how to add rpmsg support for new platforms, check out remoteproc.txt -(also a resident of Documentation/). +.. note:: -1. Introduction + This document describes the rpmsg bus and how to write rpmsg drivers. + To learn how to add rpmsg support for new platforms, check out remoteproc.txt + (also a resident of Documentation/). + +Introduction +============ Modern SoCs typically employ heterogeneous remote processor devices in asymmetric multiprocessing (AMP) configurations, which may be running @@ -58,170 +63,222 @@ to their destination address (this is done by invoking the driver's rx handler with the payload of the inbound message). -2. User API +User API +======== + +:: int rpmsg_send(struct rpmsg_channel *rpdev, void *data, int len); - - sends a message across to the remote processor on a given channel. - The caller should specify the channel, the data it wants to send, - and its length (in bytes). The message will be sent on the specified - channel, i.e. its source and destination address fields will be - set to the channel's src and dst addresses. - - In case there are no TX buffers available, the function will block until - one becomes available (i.e. until the remote processor consumes - a tx buffer and puts it back on virtio's used descriptor ring), - or a timeout of 15 seconds elapses. When the latter happens, - -ERESTARTSYS is returned. - The function can only be called from a process context (for now). - Returns 0 on success and an appropriate error value on failure. + +sends a message across to the remote processor on a given channel. +The caller should specify the channel, the data it wants to send, +and its length (in bytes). The message will be sent on the specified +channel, i.e. its source and destination address fields will be +set to the channel's src and dst addresses. + +In case there are no TX buffers available, the function will block until +one becomes available (i.e. until the remote processor consumes +a tx buffer and puts it back on virtio's used descriptor ring), +or a timeout of 15 seconds elapses. When the latter happens, +-ERESTARTSYS is returned. + +The function can only be called from a process context (for now). +Returns 0 on success and an appropriate error value on failure. + +:: int rpmsg_sendto(struct rpmsg_channel *rpdev, void *data, int len, u32 dst); - - sends a message across to the remote processor on a given channel, - to a destination address provided by the caller. - The caller should specify the channel, the data it wants to send, - its length (in bytes), and an explicit destination address. - The message will then be sent to the remote processor to which the - channel belongs, using the channel's src address, and the user-provided - dst address (thus the channel's dst address will be ignored). - - In case there are no TX buffers available, the function will block until - one becomes available (i.e. until the remote processor consumes - a tx buffer and puts it back on virtio's used descriptor ring), - or a timeout of 15 seconds elapses. When the latter happens, - -ERESTARTSYS is returned. - The function can only be called from a process context (for now). - Returns 0 on success and an appropriate error value on failure. + +sends a message across to the remote processor on a given channel, +to a destination address provided by the caller. + +The caller should specify the channel, the data it wants to send, +its length (in bytes), and an explicit destination address. + +The message will then be sent to the remote processor to which the +channel belongs, using the channel's src address, and the user-provided +dst address (thus the channel's dst address will be ignored). + +In case there are no TX buffers available, the function will block until +one becomes available (i.e. until the remote processor consumes +a tx buffer and puts it back on virtio's used descriptor ring), +or a timeout of 15 seconds elapses. When the latter happens, +-ERESTARTSYS is returned. + +The function can only be called from a process context (for now). +Returns 0 on success and an appropriate error value on failure. + +:: int rpmsg_send_offchannel(struct rpmsg_channel *rpdev, u32 src, u32 dst, void *data, int len); - - sends a message across to the remote processor, using the src and dst - addresses provided by the user. - The caller should specify the channel, the data it wants to send, - its length (in bytes), and explicit source and destination addresses. - The message will then be sent to the remote processor to which the - channel belongs, but the channel's src and dst addresses will be - ignored (and the user-provided addresses will be used instead). - - In case there are no TX buffers available, the function will block until - one becomes available (i.e. until the remote processor consumes - a tx buffer and puts it back on virtio's used descriptor ring), - or a timeout of 15 seconds elapses. When the latter happens, - -ERESTARTSYS is returned. - The function can only be called from a process context (for now). - Returns 0 on success and an appropriate error value on failure. + + +sends a message across to the remote processor, using the src and dst +addresses provided by the user. + +The caller should specify the channel, the data it wants to send, +its length (in bytes), and explicit source and destination addresses. +The message will then be sent to the remote processor to which the +channel belongs, but the channel's src and dst addresses will be +ignored (and the user-provided addresses will be used instead). + +In case there are no TX buffers available, the function will block until +one becomes available (i.e. until the remote processor consumes +a tx buffer and puts it back on virtio's used descriptor ring), +or a timeout of 15 seconds elapses. When the latter happens, +-ERESTARTSYS is returned. + +The function can only be called from a process context (for now). +Returns 0 on success and an appropriate error value on failure. + +:: int rpmsg_trysend(struct rpmsg_channel *rpdev, void *data, int len); - - sends a message across to the remote processor on a given channel. - The caller should specify the channel, the data it wants to send, - and its length (in bytes). The message will be sent on the specified - channel, i.e. its source and destination address fields will be - set to the channel's src and dst addresses. - In case there are no TX buffers available, the function will immediately - return -ENOMEM without waiting until one becomes available. - The function can only be called from a process context (for now). - Returns 0 on success and an appropriate error value on failure. +sends a message across to the remote processor on a given channel. +The caller should specify the channel, the data it wants to send, +and its length (in bytes). The message will be sent on the specified +channel, i.e. its source and destination address fields will be +set to the channel's src and dst addresses. + +In case there are no TX buffers available, the function will immediately +return -ENOMEM without waiting until one becomes available. + +The function can only be called from a process context (for now). +Returns 0 on success and an appropriate error value on failure. + +:: int rpmsg_trysendto(struct rpmsg_channel *rpdev, void *data, int len, u32 dst) - - sends a message across to the remote processor on a given channel, - to a destination address provided by the user. - The user should specify the channel, the data it wants to send, - its length (in bytes), and an explicit destination address. - The message will then be sent to the remote processor to which the - channel belongs, using the channel's src address, and the user-provided - dst address (thus the channel's dst address will be ignored). - - In case there are no TX buffers available, the function will immediately - return -ENOMEM without waiting until one becomes available. - The function can only be called from a process context (for now). - Returns 0 on success and an appropriate error value on failure. + + +sends a message across to the remote processor on a given channel, +to a destination address provided by the user. + +The user should specify the channel, the data it wants to send, +its length (in bytes), and an explicit destination address. + +The message will then be sent to the remote processor to which the +channel belongs, using the channel's src address, and the user-provided +dst address (thus the channel's dst address will be ignored). + +In case there are no TX buffers available, the function will immediately +return -ENOMEM without waiting until one becomes available. + +The function can only be called from a process context (for now). +Returns 0 on success and an appropriate error value on failure. + +:: int rpmsg_trysend_offchannel(struct rpmsg_channel *rpdev, u32 src, u32 dst, void *data, int len); - - sends a message across to the remote processor, using source and - destination addresses provided by the user. - The user should specify the channel, the data it wants to send, - its length (in bytes), and explicit source and destination addresses. - The message will then be sent to the remote processor to which the - channel belongs, but the channel's src and dst addresses will be - ignored (and the user-provided addresses will be used instead). - - In case there are no TX buffers available, the function will immediately - return -ENOMEM without waiting until one becomes available. - The function can only be called from a process context (for now). - Returns 0 on success and an appropriate error value on failure. + + +sends a message across to the remote processor, using source and +destination addresses provided by the user. + +The user should specify the channel, the data it wants to send, +its length (in bytes), and explicit source and destination addresses. +The message will then be sent to the remote processor to which the +channel belongs, but the channel's src and dst addresses will be +ignored (and the user-provided addresses will be used instead). + +In case there are no TX buffers available, the function will immediately +return -ENOMEM without waiting until one becomes available. + +The function can only be called from a process context (for now). +Returns 0 on success and an appropriate error value on failure. + +:: struct rpmsg_endpoint *rpmsg_create_ept(struct rpmsg_channel *rpdev, void (*cb)(struct rpmsg_channel *, void *, int, void *, u32), void *priv, u32 addr); - - every rpmsg address in the system is bound to an rx callback (so when - inbound messages arrive, they are dispatched by the rpmsg bus using the - appropriate callback handler) by means of an rpmsg_endpoint struct. - - This function allows drivers to create such an endpoint, and by that, - bind a callback, and possibly some private data too, to an rpmsg address - (either one that is known in advance, or one that will be dynamically - assigned for them). - - Simple rpmsg drivers need not call rpmsg_create_ept, because an endpoint - is already created for them when they are probed by the rpmsg bus - (using the rx callback they provide when they registered to the rpmsg bus). - - So things should just work for simple drivers: they already have an - endpoint, their rx callback is bound to their rpmsg address, and when - relevant inbound messages arrive (i.e. messages which their dst address - equals to the src address of their rpmsg channel), the driver's handler - is invoked to process it. - - That said, more complicated drivers might do need to allocate - additional rpmsg addresses, and bind them to different rx callbacks. - To accomplish that, those drivers need to call this function. - Drivers should provide their channel (so the new endpoint would bind - to the same remote processor their channel belongs to), an rx callback - function, an optional private data (which is provided back when the - rx callback is invoked), and an address they want to bind with the - callback. If addr is RPMSG_ADDR_ANY, then rpmsg_create_ept will - dynamically assign them an available rpmsg address (drivers should have - a very good reason why not to always use RPMSG_ADDR_ANY here). - - Returns a pointer to the endpoint on success, or NULL on error. + +every rpmsg address in the system is bound to an rx callback (so when +inbound messages arrive, they are dispatched by the rpmsg bus using the +appropriate callback handler) by means of an rpmsg_endpoint struct. + +This function allows drivers to create such an endpoint, and by that, +bind a callback, and possibly some private data too, to an rpmsg address +(either one that is known in advance, or one that will be dynamically +assigned for them). + +Simple rpmsg drivers need not call rpmsg_create_ept, because an endpoint +is already created for them when they are probed by the rpmsg bus +(using the rx callback they provide when they registered to the rpmsg bus). + +So things should just work for simple drivers: they already have an +endpoint, their rx callback is bound to their rpmsg address, and when +relevant inbound messages arrive (i.e. messages which their dst address +equals to the src address of their rpmsg channel), the driver's handler +is invoked to process it. + +That said, more complicated drivers might do need to allocate +additional rpmsg addresses, and bind them to different rx callbacks. +To accomplish that, those drivers need to call this function. +Drivers should provide their channel (so the new endpoint would bind +to the same remote processor their channel belongs to), an rx callback +function, an optional private data (which is provided back when the +rx callback is invoked), and an address they want to bind with the +callback. If addr is RPMSG_ADDR_ANY, then rpmsg_create_ept will +dynamically assign them an available rpmsg address (drivers should have +a very good reason why not to always use RPMSG_ADDR_ANY here). + +Returns a pointer to the endpoint on success, or NULL on error. + +:: void rpmsg_destroy_ept(struct rpmsg_endpoint *ept); - - destroys an existing rpmsg endpoint. user should provide a pointer - to an rpmsg endpoint that was previously created with rpmsg_create_ept(). + + +destroys an existing rpmsg endpoint. user should provide a pointer +to an rpmsg endpoint that was previously created with rpmsg_create_ept(). + +:: int register_rpmsg_driver(struct rpmsg_driver *rpdrv); - - registers an rpmsg driver with the rpmsg bus. user should provide - a pointer to an rpmsg_driver struct, which contains the driver's - ->probe() and ->remove() functions, an rx callback, and an id_table - specifying the names of the channels this driver is interested to - be probed with. + + +registers an rpmsg driver with the rpmsg bus. user should provide +a pointer to an rpmsg_driver struct, which contains the driver's +->probe() and ->remove() functions, an rx callback, and an id_table +specifying the names of the channels this driver is interested to +be probed with. + +:: void unregister_rpmsg_driver(struct rpmsg_driver *rpdrv); - - unregisters an rpmsg driver from the rpmsg bus. user should provide - a pointer to a previously-registered rpmsg_driver struct. - Returns 0 on success, and an appropriate error value on failure. -3. Typical usage +unregisters an rpmsg driver from the rpmsg bus. user should provide +a pointer to a previously-registered rpmsg_driver struct. +Returns 0 on success, and an appropriate error value on failure. + + +Typical usage +============= The following is a simple rpmsg driver, that sends an "hello!" message on probe(), and whenever it receives an incoming message, it dumps its content to the console. -#include <linux/kernel.h> -#include <linux/module.h> -#include <linux/rpmsg.h> +:: + + #include <linux/kernel.h> + #include <linux/module.h> + #include <linux/rpmsg.h> -static void rpmsg_sample_cb(struct rpmsg_channel *rpdev, void *data, int len, + static void rpmsg_sample_cb(struct rpmsg_channel *rpdev, void *data, int len, void *priv, u32 src) -{ + { print_hex_dump(KERN_INFO, "incoming message:", DUMP_PREFIX_NONE, 16, 1, data, len, true); -} + } -static int rpmsg_sample_probe(struct rpmsg_channel *rpdev) -{ + static int rpmsg_sample_probe(struct rpmsg_channel *rpdev) + { int err; dev_info(&rpdev->dev, "chnl: 0x%x -> 0x%x\n", rpdev->src, rpdev->dst); @@ -234,32 +291,35 @@ static int rpmsg_sample_probe(struct rpmsg_channel *rpdev) } return 0; -} + } -static void rpmsg_sample_remove(struct rpmsg_channel *rpdev) -{ + static void rpmsg_sample_remove(struct rpmsg_channel *rpdev) + { dev_info(&rpdev->dev, "rpmsg sample client driver is removed\n"); -} + } -static struct rpmsg_device_id rpmsg_driver_sample_id_table[] = { + static struct rpmsg_device_id rpmsg_driver_sample_id_table[] = { { .name = "rpmsg-client-sample" }, { }, -}; -MODULE_DEVICE_TABLE(rpmsg, rpmsg_driver_sample_id_table); + }; + MODULE_DEVICE_TABLE(rpmsg, rpmsg_driver_sample_id_table); -static struct rpmsg_driver rpmsg_sample_client = { + static struct rpmsg_driver rpmsg_sample_client = { .drv.name = KBUILD_MODNAME, .id_table = rpmsg_driver_sample_id_table, .probe = rpmsg_sample_probe, .callback = rpmsg_sample_cb, .remove = rpmsg_sample_remove, -}; -module_rpmsg_driver(rpmsg_sample_client); + }; + module_rpmsg_driver(rpmsg_sample_client); + +.. note:: -Note: a similar sample which can be built and loaded can be found -in samples/rpmsg/. + a similar sample which can be built and loaded can be found + in samples/rpmsg/. -4. Allocations of rpmsg channels: +Allocations of rpmsg channels +============================= At this point we only support dynamic allocations of rpmsg channels. diff --git a/Documentation/rtc.txt b/Documentation/rtc.txt index ddc366026e00..c0c977445fb9 100644 --- a/Documentation/rtc.txt +++ b/Documentation/rtc.txt @@ -1,6 +1,6 @@ - - Real Time Clock (RTC) Drivers for Linux - ======================================= +======================================= +Real Time Clock (RTC) Drivers for Linux +======================================= When Linux developers talk about a "Real Time Clock", they usually mean something that tracks wall clock time and is battery backed so that it @@ -32,8 +32,8 @@ only issue an alarm up to 24 hours in the future, other hardware may be able to schedule one any time in the upcoming century. - Old PC/AT-Compatible driver: /dev/rtc - -------------------------------------- +Old PC/AT-Compatible driver: /dev/rtc +-------------------------------------- All PCs (even Alpha machines) have a Real Time Clock built into them. Usually they are built into the chipset of the computer, but some may @@ -105,8 +105,8 @@ that will be using this driver. See the code at the end of this document. (The original /dev/rtc driver was written by Paul Gortmaker.) - New portable "RTC Class" drivers: /dev/rtcN - -------------------------------------------- +New portable "RTC Class" drivers: /dev/rtcN +-------------------------------------------- Because Linux supports many non-ACPI and non-PC platforms, some of which have more than one RTC style clock, it needed a more portable solution @@ -136,35 +136,39 @@ a high functionality RTC is integrated into the SOC. That system might read the system clock from the discrete RTC, but use the integrated one for all other tasks, because of its greater functionality. -SYSFS INTERFACE +SYSFS interface --------------- The sysfs interface under /sys/class/rtc/rtcN provides access to various rtc attributes without requiring the use of ioctls. All dates and times are in the RTC's timezone, rather than in system time. -date: RTC-provided date -hctosys: 1 if the RTC provided the system time at boot via the +================ ============================================================== +date RTC-provided date +hctosys 1 if the RTC provided the system time at boot via the CONFIG_RTC_HCTOSYS kernel option, 0 otherwise -max_user_freq: The maximum interrupt rate an unprivileged user may request +max_user_freq The maximum interrupt rate an unprivileged user may request from this RTC. -name: The name of the RTC corresponding to this sysfs directory -since_epoch: The number of seconds since the epoch according to the RTC -time: RTC-provided time -wakealarm: The time at which the clock will generate a system wakeup +name The name of the RTC corresponding to this sysfs directory +since_epoch The number of seconds since the epoch according to the RTC +time RTC-provided time +wakealarm The time at which the clock will generate a system wakeup event. This is a one shot wakeup event, so must be reset - after wake if a daily wakeup is required. Format is seconds since - the epoch by default, or if there's a leading +, seconds in the - future, or if there is a leading +=, seconds ahead of the current - alarm. -offset: The amount which the rtc clock has been adjusted in firmware. + after wake if a daily wakeup is required. Format is seconds + since the epoch by default, or if there's a leading +, seconds + in the future, or if there is a leading +=, seconds ahead of + the current alarm. +offset The amount which the rtc clock has been adjusted in firmware. Visible only if the driver supports clock offset adjustment. The unit is parts per billion, i.e. The number of clock ticks which are added to or removed from the rtc's base clock per billion ticks. A positive value makes a day pass more slowly, longer, and a negative value makes a day pass more quickly. +*/nvmem The non volatile storage exported as a raw file, as described + in Documentation/nvmem/nvmem.txt +================ ============================================================== -IOCTL INTERFACE +IOCTL interface --------------- The ioctl() calls supported by /dev/rtc are also supported by the RTC class diff --git a/Documentation/scheduler/sched-deadline.txt b/Documentation/scheduler/sched-deadline.txt index cbc1b46cbf70..e89e36ec15a5 100644 --- a/Documentation/scheduler/sched-deadline.txt +++ b/Documentation/scheduler/sched-deadline.txt @@ -7,6 +7,8 @@ CONTENTS 0. WARNING 1. Overview 2. Scheduling algorithm + 2.1 Main algorithm + 2.2 Bandwidth reclaiming 3. Scheduling Real-Time Tasks 3.1 Definitions 3.2 Schedulability Analysis for Uniprocessor Systems @@ -44,6 +46,9 @@ CONTENTS 2. Scheduling algorithm ================== +2.1 Main algorithm +------------------ + SCHED_DEADLINE uses three parameters, named "runtime", "period", and "deadline", to schedule tasks. A SCHED_DEADLINE task should receive "runtime" microseconds of execution time every "period" microseconds, and @@ -113,6 +118,160 @@ CONTENTS remaining runtime = remaining runtime + runtime +2.2 Bandwidth reclaiming +------------------------ + + Bandwidth reclaiming for deadline tasks is based on the GRUB (Greedy + Reclamation of Unused Bandwidth) algorithm [15, 16, 17] and it is enabled + when flag SCHED_FLAG_RECLAIM is set. + + The following diagram illustrates the state names for tasks handled by GRUB: + + ------------ + (d) | Active | + ------------->| | + | | Contending | + | ------------ + | A | + ---------- | | + | | | | + | Inactive | |(b) | (a) + | | | | + ---------- | | + A | V + | ------------ + | | Active | + --------------| Non | + (c) | Contending | + ------------ + + A task can be in one of the following states: + + - ActiveContending: if it is ready for execution (or executing); + + - ActiveNonContending: if it just blocked and has not yet surpassed the 0-lag + time; + + - Inactive: if it is blocked and has surpassed the 0-lag time. + + State transitions: + + (a) When a task blocks, it does not become immediately inactive since its + bandwidth cannot be immediately reclaimed without breaking the + real-time guarantees. It therefore enters a transitional state called + ActiveNonContending. The scheduler arms the "inactive timer" to fire at + the 0-lag time, when the task's bandwidth can be reclaimed without + breaking the real-time guarantees. + + The 0-lag time for a task entering the ActiveNonContending state is + computed as + + (runtime * dl_period) + deadline - --------------------- + dl_runtime + + where runtime is the remaining runtime, while dl_runtime and dl_period + are the reservation parameters. + + (b) If the task wakes up before the inactive timer fires, the task re-enters + the ActiveContending state and the "inactive timer" is canceled. + In addition, if the task wakes up on a different runqueue, then + the task's utilization must be removed from the previous runqueue's active + utilization and must be added to the new runqueue's active utilization. + In order to avoid races between a task waking up on a runqueue while the + "inactive timer" is running on a different CPU, the "dl_non_contending" + flag is used to indicate that a task is not on a runqueue but is active + (so, the flag is set when the task blocks and is cleared when the + "inactive timer" fires or when the task wakes up). + + (c) When the "inactive timer" fires, the task enters the Inactive state and + its utilization is removed from the runqueue's active utilization. + + (d) When an inactive task wakes up, it enters the ActiveContending state and + its utilization is added to the active utilization of the runqueue where + it has been enqueued. + + For each runqueue, the algorithm GRUB keeps track of two different bandwidths: + + - Active bandwidth (running_bw): this is the sum of the bandwidths of all + tasks in active state (i.e., ActiveContending or ActiveNonContending); + + - Total bandwidth (this_bw): this is the sum of all tasks "belonging" to the + runqueue, including the tasks in Inactive state. + + + The algorithm reclaims the bandwidth of the tasks in Inactive state. + It does so by decrementing the runtime of the executing task Ti at a pace equal + to + + dq = -max{ Ui, (1 - Uinact) } dt + + where Uinact is the inactive utilization, computed as (this_bq - running_bw), + and Ui is the bandwidth of task Ti. + + + Let's now see a trivial example of two deadline tasks with runtime equal + to 4 and period equal to 8 (i.e., bandwidth equal to 0.5): + + A Task T1 + | + | | + | | + |-------- |---- + | | V + |---|---|---|---|---|---|---|---|--------->t + 0 1 2 3 4 5 6 7 8 + + + A Task T2 + | + | | + | | + | ------------------------| + | | V + |---|---|---|---|---|---|---|---|--------->t + 0 1 2 3 4 5 6 7 8 + + + A running_bw + | + 1 ----------------- ------ + | | | + 0.5- ----------------- + | | + |---|---|---|---|---|---|---|---|--------->t + 0 1 2 3 4 5 6 7 8 + + + - Time t = 0: + + Both tasks are ready for execution and therefore in ActiveContending state. + Suppose Task T1 is the first task to start execution. + Since there are no inactive tasks, its runtime is decreased as dq = -1 dt. + + - Time t = 2: + + Suppose that task T1 blocks + Task T1 therefore enters the ActiveNonContending state. Since its remaining + runtime is equal to 2, its 0-lag time is equal to t = 4. + Task T2 start execution, with runtime still decreased as dq = -1 dt since + there are no inactive tasks. + + - Time t = 4: + + This is the 0-lag time for Task T1. Since it didn't woken up in the + meantime, it enters the Inactive state. Its bandwidth is removed from + running_bw. + Task T2 continues its execution. However, its runtime is now decreased as + dq = - 0.5 dt because Uinact = 0.5. + Task T2 therefore reclaims the bandwidth unused by Task T1. + + - Time t = 8: + + Task T1 wakes up. It enters the ActiveContending state again, and the + running_bw is incremented. + + 3. Scheduling Real-Time Tasks ============================= @@ -330,6 +489,15 @@ CONTENTS 14 - J. Erickson, U. Devi and S. Baruah. Improved tardiness bounds for Global EDF. Proceedings of the 22nd Euromicro Conference on Real-Time Systems, 2010. + 15 - G. Lipari, S. Baruah, Greedy reclamation of unused bandwidth in + constant-bandwidth servers, 12th IEEE Euromicro Conference on Real-Time + Systems, 2000. + 16 - L. Abeni, J. Lelli, C. Scordino, L. Palopoli, Greedy CPU reclaiming for + SCHED DEADLINE. In Proceedings of the Real-Time Linux Workshop (RTLWS), + Dusseldorf, Germany, 2014. + 17 - L. Abeni, G. Lipari, A. Parri, Y. Sun, Multicore CPU reclaiming: parallel + or sequential?. In Proceedings of the 31st Annual ACM Symposium on Applied + Computing, 2016. 4. Bandwidth management diff --git a/Documentation/security/00-INDEX b/Documentation/security/00-INDEX deleted file mode 100644 index 45c82fd3e9d3..000000000000 --- a/Documentation/security/00-INDEX +++ /dev/null @@ -1,26 +0,0 @@ -00-INDEX - - this file. -LSM.txt - - description of the Linux Security Module framework. -SELinux.txt - - how to get started with the SELinux security enhancement. -Smack.txt - - documentation on the Smack Linux Security Module. -Yama.txt - - documentation on the Yama Linux Security Module. -apparmor.txt - - documentation on the AppArmor security extension. -credentials.txt - - documentation about credentials in Linux. -keys-ecryptfs.txt - - description of the encryption keys for the ecryptfs filesystem. -keys-request-key.txt - - description of the kernel key request service. -keys-trusted-encrypted.txt - - info on the Trusted and Encrypted keys in the kernel key ring service. -keys.txt - - description of the kernel key retention service. -tomoyo.txt - - documentation on the TOMOYO Linux Security Module. -IMA-templates.txt - - documentation on the template management mechanism for IMA. diff --git a/Documentation/security/IMA-templates.txt b/Documentation/security/IMA-templates.rst index 839b5dad9226..2cd0e273cc9a 100644 --- a/Documentation/security/IMA-templates.txt +++ b/Documentation/security/IMA-templates.rst @@ -1,9 +1,12 @@ - IMA Template Management Mechanism +================================= +IMA Template Management Mechanism +================================= -==== INTRODUCTION ==== +Introduction +============ -The original 'ima' template is fixed length, containing the filedata hash +The original ``ima`` template is fixed length, containing the filedata hash and pathname. The filedata hash is limited to 20 bytes (md5/sha1). The pathname is a null terminated string, limited to 255 characters. To overcome these limitations and to add additional file metadata, it is @@ -28,61 +31,64 @@ a new data type, developers define the field identifier and implement two functions, init() and show(), respectively to generate and display measurement entries. Defining a new template descriptor requires specifying the template format (a string of field identifiers separated -by the '|' character) through the 'ima_template_fmt' kernel command line +by the ``|`` character) through the ``ima_template_fmt`` kernel command line parameter. At boot time, IMA initializes the chosen template descriptor by translating the format into an array of template fields structures taken from the set of the supported ones. -After the initialization step, IMA will call ima_alloc_init_template() +After the initialization step, IMA will call ``ima_alloc_init_template()`` (new function defined within the patches for the new template management mechanism) to generate a new measurement entry by using the template descriptor chosen through the kernel configuration or through the newly -introduced 'ima_template' and 'ima_template_fmt' kernel command line parameters. +introduced ``ima_template`` and ``ima_template_fmt`` kernel command line parameters. It is during this phase that the advantages of the new architecture are clearly shown: the latter function will not contain specific code to handle -a given template but, instead, it simply calls the init() method of the template +a given template but, instead, it simply calls the ``init()`` method of the template fields associated to the chosen template descriptor and store the result (pointer to allocated data and data length) in the measurement entry structure. The same mechanism is employed to display measurements entries. -The functions ima[_ascii]_measurements_show() retrieve, for each entry, +The functions ``ima[_ascii]_measurements_show()`` retrieve, for each entry, the template descriptor used to produce that entry and call the show() method for each item of the array of template fields structures. -==== SUPPORTED TEMPLATE FIELDS AND DESCRIPTORS ==== +Supported Template Fields and Descriptors +========================================= In the following, there is the list of supported template fields -('<identifier>': description), that can be used to define new template +``('<identifier>': description)``, that can be used to define new template descriptors by adding their identifier to the format string (support for more data types will be added later): - 'd': the digest of the event (i.e. the digest of a measured file), - calculated with the SHA1 or MD5 hash algorithm; + calculated with the SHA1 or MD5 hash algorithm; - 'n': the name of the event (i.e. the file name), with size up to 255 bytes; - 'd-ng': the digest of the event, calculated with an arbitrary hash - algorithm (field format: [<hash algo>:]digest, where the digest - prefix is shown only if the hash algorithm is not SHA1 or MD5); + algorithm (field format: [<hash algo>:]digest, where the digest + prefix is shown only if the hash algorithm is not SHA1 or MD5); - 'n-ng': the name of the event, without size limitations; - 'sig': the file signature. Below, there is the list of defined template descriptors: - - "ima": its format is 'd|n'; - - "ima-ng" (default): its format is 'd-ng|n-ng'; - - "ima-sig": its format is 'd-ng|n-ng|sig'. + - "ima": its format is ``d|n``; + - "ima-ng" (default): its format is ``d-ng|n-ng``; + - "ima-sig": its format is ``d-ng|n-ng|sig``. -==== USE ==== + +Use +=== To specify the template descriptor to be used to generate measurement entries, currently the following methods are supported: - select a template descriptor among those supported in the kernel - configuration ('ima-ng' is the default choice); + configuration (``ima-ng`` is the default choice); - specify a template descriptor name from the kernel command line through - the 'ima_template=' parameter; + the ``ima_template=`` parameter; - register a new template descriptor with custom format through the kernel - command line parameter 'ima_template_fmt='. + command line parameter ``ima_template_fmt=``. diff --git a/Documentation/security/LSM.rst b/Documentation/security/LSM.rst new file mode 100644 index 000000000000..d75778b0fa10 --- /dev/null +++ b/Documentation/security/LSM.rst @@ -0,0 +1,14 @@ +================================= +Linux Security Module Development +================================= + +Based on https://lkml.org/lkml/2007/10/26/215, +a new LSM is accepted into the kernel when its intent (a description of +what it tries to protect against and in what cases one would expect to +use it) has been appropriately documented in ``Documentation/security/LSM``. +This allows an LSM's code to be easily compared to its goals, and so +that end users and distros can make a more informed decision about which +LSMs suit their requirements. + +For extensive documentation on the available LSM hook interfaces, please +see ``include/linux/lsm_hooks.h``. diff --git a/Documentation/security/conf.py b/Documentation/security/conf.py deleted file mode 100644 index 472fc9a8eb67..000000000000 --- a/Documentation/security/conf.py +++ /dev/null @@ -1,8 +0,0 @@ -project = "The kernel security subsystem manual" - -tags.add("subproject") - -latex_documents = [ - ('index', 'security.tex', project, - 'The kernel development community', 'manual'), -] diff --git a/Documentation/security/credentials.txt b/Documentation/security/credentials.rst index 86257052e31a..038a7e19eff9 100644 --- a/Documentation/security/credentials.txt +++ b/Documentation/security/credentials.rst @@ -1,38 +1,18 @@ - ==================== - CREDENTIALS IN LINUX - ==================== +==================== +Credentials in Linux +==================== By: David Howells <dhowells@redhat.com> -Contents: - - (*) Overview. - - (*) Types of credentials. - - (*) File markings. - - (*) Task credentials. +.. contents:: :local: - - Immutable credentials. - - Accessing task credentials. - - Accessing another task's credentials. - - Altering credentials. - - Managing credentials. - - (*) Open file credentials. - - (*) Overriding the VFS's use of credentials. - - -======== -OVERVIEW +Overview ======== There are several parts to the security check performed by Linux when one object acts upon another: - (1) Objects. + 1. Objects. Objects are things in the system that may be acted upon directly by userspace programs. Linux has a variety of actionable objects, including: @@ -48,7 +28,7 @@ object acts upon another: As a part of the description of all these objects there is a set of credentials. What's in the set depends on the type of object. - (2) Object ownership. + 2. Object ownership. Amongst the credentials of most objects, there will be a subset that indicates the ownership of that object. This is used for resource @@ -57,7 +37,7 @@ object acts upon another: In a standard UNIX filesystem, for instance, this will be defined by the UID marked on the inode. - (3) The objective context. + 3. The objective context. Also amongst the credentials of those objects, there will be a subset that indicates the 'objective context' of that object. This may or may not be @@ -67,7 +47,7 @@ object acts upon another: The objective context is used as part of the security calculation that is carried out when an object is acted upon. - (4) Subjects. + 4. Subjects. A subject is an object that is acting upon another object. @@ -77,10 +57,10 @@ object acts upon another: Objects other than tasks may under some circumstances also be subjects. For instance an open file may send SIGIO to a task using the UID and EUID - given to it by a task that called fcntl(F_SETOWN) upon it. In this case, + given to it by a task that called ``fcntl(F_SETOWN)`` upon it. In this case, the file struct will have a subjective context too. - (5) The subjective context. + 5. The subjective context. A subject has an additional interpretation of its credentials. A subset of its credentials forms the 'subjective context'. The subjective context @@ -92,7 +72,7 @@ object acts upon another: from the real UID and GID that normally form the objective context of the task. - (6) Actions. + 6. Actions. Linux has a number of actions available that a subject may perform upon an object. The set of actions available depends on the nature of the subject @@ -101,7 +81,7 @@ object acts upon another: Actions include reading, writing, creating and deleting files; forking or signalling and tracing tasks. - (7) Rules, access control lists and security calculations. + 7. Rules, access control lists and security calculations. When a subject acts upon an object, a security calculation is made. This involves taking the subjective context, the objective context and the @@ -111,7 +91,7 @@ object acts upon another: There are two main sources of rules: - (a) Discretionary access control (DAC): + a. Discretionary access control (DAC): Sometimes the object will include sets of rules as part of its description. This is an 'Access Control List' or 'ACL'. A Linux @@ -127,7 +107,7 @@ object acts upon another: A Linux file might also sport a POSIX ACL. This is a list of rules that grants various permissions to arbitrary subjects. - (b) Mandatory access control (MAC): + b. Mandatory access control (MAC): The system as a whole may have one or more sets of rules that get applied to all subjects and objects, regardless of their source. @@ -139,65 +119,65 @@ object acts upon another: that says that this action is either granted or denied. -==================== -TYPES OF CREDENTIALS +Types of Credentials ==================== The Linux kernel supports the following types of credentials: - (1) Traditional UNIX credentials. + 1. Traditional UNIX credentials. - Real User ID - Real Group ID + - Real User ID + - Real Group ID The UID and GID are carried by most, if not all, Linux objects, even if in some cases it has to be invented (FAT or CIFS files for example, which are derived from Windows). These (mostly) define the objective context of that object, with tasks being slightly different in some cases. - Effective, Saved and FS User ID - Effective, Saved and FS Group ID - Supplementary groups + - Effective, Saved and FS User ID + - Effective, Saved and FS Group ID + - Supplementary groups These are additional credentials used by tasks only. Usually, an EUID/EGID/GROUPS will be used as the subjective context, and real UID/GID will be used as the objective. For tasks, it should be noted that this is not always true. - (2) Capabilities. + 2. Capabilities. - Set of permitted capabilities - Set of inheritable capabilities - Set of effective capabilities - Capability bounding set + - Set of permitted capabilities + - Set of inheritable capabilities + - Set of effective capabilities + - Capability bounding set These are only carried by tasks. They indicate superior capabilities granted piecemeal to a task that an ordinary task wouldn't otherwise have. These are manipulated implicitly by changes to the traditional UNIX - credentials, but can also be manipulated directly by the capset() system - call. + credentials, but can also be manipulated directly by the ``capset()`` + system call. The permitted capabilities are those caps that the process might grant - itself to its effective or permitted sets through capset(). This + itself to its effective or permitted sets through ``capset()``. This inheritable set might also be so constrained. The effective capabilities are the ones that a task is actually allowed to make use of itself. The inheritable capabilities are the ones that may get passed across - execve(). + ``execve()``. The bounding set limits the capabilities that may be inherited across - execve(), especially when a binary is executed that will execute as UID 0. + ``execve()``, especially when a binary is executed that will execute as + UID 0. - (3) Secure management flags (securebits). + 3. Secure management flags (securebits). These are only carried by tasks. These govern the way the above credentials are manipulated and inherited over certain operations such as execve(). They aren't used directly as objective or subjective credentials. - (4) Keys and keyrings. + 4. Keys and keyrings. These are only carried by tasks. They carry and cache security tokens that don't fit into the other standard UNIX credentials. They are for @@ -218,7 +198,7 @@ The Linux kernel supports the following types of credentials: For more information on using keys, see Documentation/security/keys.txt. - (5) LSM + 5. LSM The Linux Security Module allows extra controls to be placed over the operations that a task may do. Currently Linux supports several LSM @@ -228,7 +208,7 @@ The Linux kernel supports the following types of credentials: rules (policies) that say what operations a task with one label may do to an object with another label. - (6) AF_KEY + 6. AF_KEY This is a socket-based approach to credential management for networking stacks [RFC 2367]. It isn't discussed by this document as it doesn't @@ -244,25 +224,19 @@ network filesystem where the credentials of the opened file should be presented to the server, regardless of who is actually doing a read or a write upon it. -============= -FILE MARKINGS +File Markings ============= Files on disk or obtained over the network may have annotations that form the objective security context of that file. Depending on the type of filesystem, this may include one or more of the following: - (*) UNIX UID, GID, mode; - - (*) Windows user ID; - - (*) Access control list; - - (*) LSM security label; - - (*) UNIX exec privilege escalation bits (SUID/SGID); - - (*) File capabilities exec privilege escalation bits. + * UNIX UID, GID, mode; + * Windows user ID; + * Access control list; + * LSM security label; + * UNIX exec privilege escalation bits (SUID/SGID); + * File capabilities exec privilege escalation bits. These are compared to the task's subjective security context, and certain operations allowed or disallowed as a result. In the case of execve(), the @@ -270,8 +244,7 @@ privilege escalation bits come into play, and may allow the resulting process extra privileges, based on the annotations on the executable file. -================ -TASK CREDENTIALS +Task Credentials ================ In Linux, all of a task's credentials are held in (uid, gid) or through @@ -282,20 +255,20 @@ task_struct. Once a set of credentials has been prepared and committed, it may not be changed, barring the following exceptions: - (1) its reference count may be changed; + 1. its reference count may be changed; - (2) the reference count on the group_info struct it points to may be changed; + 2. the reference count on the group_info struct it points to may be changed; - (3) the reference count on the security data it points to may be changed; + 3. the reference count on the security data it points to may be changed; - (4) the reference count on any keyrings it points to may be changed; + 4. the reference count on any keyrings it points to may be changed; - (5) any keyrings it points to may be revoked, expired or have their security - attributes changed; and + 5. any keyrings it points to may be revoked, expired or have their security + attributes changed; and - (6) the contents of any keyrings to which it points may be changed (the whole - point of keyrings being a shared set of credentials, modifiable by anyone - with appropriate access). + 6. the contents of any keyrings to which it points may be changed (the whole + point of keyrings being a shared set of credentials, modifiable by anyone + with appropriate access). To alter anything in the cred struct, the copy-and-replace principle must be adhered to. First take a copy, then alter the copy and then use RCU to change @@ -303,37 +276,37 @@ the task pointer to make it point to the new copy. There are wrappers to aid with this (see below). A task may only alter its _own_ credentials; it is no longer permitted for a -task to alter another's credentials. This means the capset() system call is no -longer permitted to take any PID other than the one of the current process. -Also keyctl_instantiate() and keyctl_negate() functions no longer permit -attachment to process-specific keyrings in the requesting process as the -instantiating process may need to create them. +task to alter another's credentials. This means the ``capset()`` system call +is no longer permitted to take any PID other than the one of the current +process. Also ``keyctl_instantiate()`` and ``keyctl_negate()`` functions no +longer permit attachment to process-specific keyrings in the requesting +process as the instantiating process may need to create them. -IMMUTABLE CREDENTIALS +Immutable Credentials --------------------- -Once a set of credentials has been made public (by calling commit_creds() for -example), it must be considered immutable, barring two exceptions: +Once a set of credentials has been made public (by calling ``commit_creds()`` +for example), it must be considered immutable, barring two exceptions: - (1) The reference count may be altered. + 1. The reference count may be altered. - (2) Whilst the keyring subscriptions of a set of credentials may not be - changed, the keyrings subscribed to may have their contents altered. + 2. Whilst the keyring subscriptions of a set of credentials may not be + changed, the keyrings subscribed to may have their contents altered. To catch accidental credential alteration at compile time, struct task_struct has _const_ pointers to its credential sets, as does struct file. Furthermore, -certain functions such as get_cred() and put_cred() operate on const pointers, -thus rendering casts unnecessary, but require to temporarily ditch the const -qualification to be able to alter the reference count. +certain functions such as ``get_cred()`` and ``put_cred()`` operate on const +pointers, thus rendering casts unnecessary, but require to temporarily ditch +the const qualification to be able to alter the reference count. -ACCESSING TASK CREDENTIALS +Accessing Task Credentials -------------------------- A task being able to alter only its own credentials permits the current process to read or replace its own credentials without the need for any form of locking -- which simplifies things greatly. It can just call: +-- which simplifies things greatly. It can just call:: const struct cred *current_cred() @@ -341,7 +314,7 @@ to get a pointer to its credentials structure, and it doesn't have to release it afterwards. There are convenience wrappers for retrieving specific aspects of a task's -credentials (the value is simply returned in each case): +credentials (the value is simply returned in each case):: uid_t current_uid(void) Current's real UID gid_t current_gid(void) Current's real GID @@ -354,7 +327,7 @@ credentials (the value is simply returned in each case): struct user_struct *current_user(void) Current's user account There are also convenience wrappers for retrieving specific associated pairs of -a task's credentials: +a task's credentials:: void current_uid_gid(uid_t *, gid_t *); void current_euid_egid(uid_t *, gid_t *); @@ -365,12 +338,12 @@ them from the current task's credentials. In addition, there is a function for obtaining a reference on the current -process's current set of credentials: +process's current set of credentials:: const struct cred *get_current_cred(void); and functions for getting references to one of the credentials that don't -actually live in struct cred: +actually live in struct cred:: struct user_struct *get_current_user(void); struct group_info *get_current_groups(void); @@ -378,22 +351,22 @@ actually live in struct cred: which get references to the current process's user accounting structure and supplementary groups list respectively. -Once a reference has been obtained, it must be released with put_cred(), -free_uid() or put_group_info() as appropriate. +Once a reference has been obtained, it must be released with ``put_cred()``, +``free_uid()`` or ``put_group_info()`` as appropriate. -ACCESSING ANOTHER TASK'S CREDENTIALS +Accessing Another Task's Credentials ------------------------------------ Whilst a task may access its own credentials without the need for locking, the same is not true of a task wanting to access another task's credentials. It -must use the RCU read lock and rcu_dereference(). +must use the RCU read lock and ``rcu_dereference()``. -The rcu_dereference() is wrapped by: +The ``rcu_dereference()`` is wrapped by:: const struct cred *__task_cred(struct task_struct *task); -This should be used inside the RCU read lock, as in the following example: +This should be used inside the RCU read lock, as in the following example:: void foo(struct task_struct *t, struct foo_data *f) { @@ -410,39 +383,40 @@ This should be used inside the RCU read lock, as in the following example: Should it be necessary to hold another task's credentials for a long period of time, and possibly to sleep whilst doing so, then the caller should get a -reference on them using: +reference on them using:: const struct cred *get_task_cred(struct task_struct *task); This does all the RCU magic inside of it. The caller must call put_cred() on the credentials so obtained when they're finished with. - [*] Note: The result of __task_cred() should not be passed directly to - get_cred() as this may race with commit_cred(). +.. note:: + The result of ``__task_cred()`` should not be passed directly to + ``get_cred()`` as this may race with ``commit_cred()``. There are a couple of convenience functions to access bits of another task's -credentials, hiding the RCU magic from the caller: +credentials, hiding the RCU magic from the caller:: uid_t task_uid(task) Task's real UID uid_t task_euid(task) Task's effective UID -If the caller is holding the RCU read lock at the time anyway, then: +If the caller is holding the RCU read lock at the time anyway, then:: __task_cred(task)->uid __task_cred(task)->euid should be used instead. Similarly, if multiple aspects of a task's credentials -need to be accessed, RCU read lock should be used, __task_cred() called, the -result stored in a temporary pointer and then the credential aspects called +need to be accessed, RCU read lock should be used, ``__task_cred()`` called, +the result stored in a temporary pointer and then the credential aspects called from that before dropping the lock. This prevents the potentially expensive RCU magic from being invoked multiple times. Should some other single aspect of another task's credentials need to be -accessed, then this can be used: +accessed, then this can be used:: task_cred_xxx(task, member) -where 'member' is a non-pointer member of the cred struct. For instance: +where 'member' is a non-pointer member of the cred struct. For instance:: uid_t task_cred_xxx(task, suid); @@ -451,7 +425,7 @@ magic. This may not be used for pointer members as what they point to may disappear the moment the RCU read lock is dropped. -ALTERING CREDENTIALS +Altering Credentials -------------------- As previously mentioned, a task may only alter its own credentials, and may not @@ -459,7 +433,7 @@ alter those of another task. This means that it doesn't need to use any locking to alter its own credentials. To alter the current process's credentials, a function should first prepare a -new set of credentials by calling: +new set of credentials by calling:: struct cred *prepare_creds(void); @@ -467,9 +441,10 @@ this locks current->cred_replace_mutex and then allocates and constructs a duplicate of the current process's credentials, returning with the mutex still held if successful. It returns NULL if not successful (out of memory). -The mutex prevents ptrace() from altering the ptrace state of a process whilst -security checks on credentials construction and changing is taking place as -the ptrace state may alter the outcome, particularly in the case of execve(). +The mutex prevents ``ptrace()`` from altering the ptrace state of a process +whilst security checks on credentials construction and changing is taking place +as the ptrace state may alter the outcome, particularly in the case of +``execve()``. The new credentials set should be altered appropriately, and any security checks and hooks done. Both the current and the proposed sets of credentials @@ -478,36 +453,37 @@ still at this point. When the credential set is ready, it should be committed to the current process -by calling: +by calling:: int commit_creds(struct cred *new); This will alter various aspects of the credentials and the process, giving the -LSM a chance to do likewise, then it will use rcu_assign_pointer() to actually -commit the new credentials to current->cred, it will release -current->cred_replace_mutex to allow ptrace() to take place, and it will notify -the scheduler and others of the changes. +LSM a chance to do likewise, then it will use ``rcu_assign_pointer()`` to +actually commit the new credentials to ``current->cred``, it will release +``current->cred_replace_mutex`` to allow ``ptrace()`` to take place, and it +will notify the scheduler and others of the changes. This function is guaranteed to return 0, so that it can be tail-called at the -end of such functions as sys_setresuid(). +end of such functions as ``sys_setresuid()``. Note that this function consumes the caller's reference to the new credentials. -The caller should _not_ call put_cred() on the new credentials afterwards. +The caller should _not_ call ``put_cred()`` on the new credentials afterwards. Furthermore, once this function has been called on a new set of credentials, those credentials may _not_ be changed further. -Should the security checks fail or some other error occur after prepare_creds() -has been called, then the following function should be invoked: +Should the security checks fail or some other error occur after +``prepare_creds()`` has been called, then the following function should be +invoked:: void abort_creds(struct cred *new); -This releases the lock on current->cred_replace_mutex that prepare_creds() got -and then releases the new credentials. +This releases the lock on ``current->cred_replace_mutex`` that +``prepare_creds()`` got and then releases the new credentials. -A typical credentials alteration function would look something like this: +A typical credentials alteration function would look something like this:: int alter_suid(uid_t suid) { @@ -529,53 +505,50 @@ A typical credentials alteration function would look something like this: } -MANAGING CREDENTIALS +Managing Credentials -------------------- There are some functions to help manage credentials: - (*) void put_cred(const struct cred *cred); + - ``void put_cred(const struct cred *cred);`` This releases a reference to the given set of credentials. If the reference count reaches zero, the credentials will be scheduled for destruction by the RCU system. - (*) const struct cred *get_cred(const struct cred *cred); + - ``const struct cred *get_cred(const struct cred *cred);`` This gets a reference on a live set of credentials, returning a pointer to that set of credentials. - (*) struct cred *get_new_cred(struct cred *cred); + - ``struct cred *get_new_cred(struct cred *cred);`` This gets a reference on a set of credentials that is under construction and is thus still mutable, returning a pointer to that set of credentials. -===================== -OPEN FILE CREDENTIALS +Open File Credentials ===================== When a new file is opened, a reference is obtained on the opening task's -credentials and this is attached to the file struct as 'f_cred' in place of -'f_uid' and 'f_gid'. Code that used to access file->f_uid and file->f_gid -should now access file->f_cred->fsuid and file->f_cred->fsgid. +credentials and this is attached to the file struct as ``f_cred`` in place of +``f_uid`` and ``f_gid``. Code that used to access ``file->f_uid`` and +``file->f_gid`` should now access ``file->f_cred->fsuid`` and +``file->f_cred->fsgid``. -It is safe to access f_cred without the use of RCU or locking because the +It is safe to access ``f_cred`` without the use of RCU or locking because the pointer will not change over the lifetime of the file struct, and nor will the contents of the cred struct pointed to, barring the exceptions listed above (see the Task Credentials section). -======================================= -OVERRIDING THE VFS'S USE OF CREDENTIALS +Overriding the VFS's Use of Credentials ======================================= Under some circumstances it is desirable to override the credentials used by -the VFS, and that can be done by calling into such as vfs_mkdir() with a +the VFS, and that can be done by calling into such as ``vfs_mkdir()`` with a different set of credentials. This is done in the following places: - (*) sys_faccessat(). - - (*) do_coredump(). - - (*) nfs4recover.c. + * ``sys_faccessat()``. + * ``do_coredump()``. + * nfs4recover.c. diff --git a/Documentation/security/index.rst b/Documentation/security/index.rst index 9bae6bb20e7f..298a94a33f05 100644 --- a/Documentation/security/index.rst +++ b/Documentation/security/index.rst @@ -1,7 +1,13 @@ ====================== -Security documentation +Security Documentation ====================== .. toctree:: + :maxdepth: 1 + credentials + IMA-templates + keys/index + LSM + self-protection tpm/index diff --git a/Documentation/security/keys.txt b/Documentation/security/keys/core.rst index cd5019934d7f..1648fa80b3bf 100644 --- a/Documentation/security/keys.txt +++ b/Documentation/security/keys/core.rst @@ -1,6 +1,6 @@ - ============================ - KERNEL KEY RETENTION SERVICE - ============================ +============================ +Kernel Key Retention Service +============================ This service allows cryptographic keys, authentication tokens, cross-domain user mappings, and similar to be cached in the kernel for the use of @@ -29,8 +29,7 @@ This document has the following sections: - Garbage collection -============ -KEY OVERVIEW +Key Overview ============ In this context, keys represent units of cryptographic data, authentication @@ -47,14 +46,14 @@ Each key has a number of attributes: - State. - (*) Each key is issued a serial number of type key_serial_t that is unique for + * Each key is issued a serial number of type key_serial_t that is unique for the lifetime of that key. All serial numbers are positive non-zero 32-bit integers. Userspace programs can use a key's serial numbers as a way to gain access to it, subject to permission checking. - (*) Each key is of a defined "type". Types must be registered inside the + * Each key is of a defined "type". Types must be registered inside the kernel by a kernel service (such as a filesystem) before keys of that type can be added or used. Userspace programs cannot define new types directly. @@ -64,18 +63,18 @@ Each key has a number of attributes: Should a type be removed from the system, all the keys of that type will be invalidated. - (*) Each key has a description. This should be a printable string. The key + * Each key has a description. This should be a printable string. The key type provides an operation to perform a match between the description on a key and a criterion string. - (*) Each key has an owner user ID, a group ID and a permissions mask. These + * Each key has an owner user ID, a group ID and a permissions mask. These are used to control what a process may do to a key from userspace, and whether a kernel service will be able to find the key. - (*) Each key can be set to expire at a specific time by the key type's + * Each key can be set to expire at a specific time by the key type's instantiation function. Keys can also be immortal. - (*) Each key can have a payload. This is a quantity of data that represent the + * Each key can have a payload. This is a quantity of data that represent the actual "key". In the case of a keyring, this is a list of keys to which the keyring links; in the case of a user-defined key, it's an arbitrary blob of data. @@ -91,39 +90,38 @@ Each key has a number of attributes: permitted, another key type operation will be called to convert the key's attached payload back into a blob of data. - (*) Each key can be in one of a number of basic states: + * Each key can be in one of a number of basic states: - (*) Uninstantiated. The key exists, but does not have any data attached. + * Uninstantiated. The key exists, but does not have any data attached. Keys being requested from userspace will be in this state. - (*) Instantiated. This is the normal state. The key is fully formed, and + * Instantiated. This is the normal state. The key is fully formed, and has data attached. - (*) Negative. This is a relatively short-lived state. The key acts as a + * Negative. This is a relatively short-lived state. The key acts as a note saying that a previous call out to userspace failed, and acts as a throttle on key lookups. A negative key can be updated to a normal state. - (*) Expired. Keys can have lifetimes set. If their lifetime is exceeded, + * Expired. Keys can have lifetimes set. If their lifetime is exceeded, they traverse to this state. An expired key can be updated back to a normal state. - (*) Revoked. A key is put in this state by userspace action. It can't be + * Revoked. A key is put in this state by userspace action. It can't be found or operated upon (apart from by unlinking it). - (*) Dead. The key's type was unregistered, and so the key is now useless. + * Dead. The key's type was unregistered, and so the key is now useless. Keys in the last three states are subject to garbage collection. See the section on "Garbage collection". -==================== -KEY SERVICE OVERVIEW +Key Service Overview ==================== The key service provides a number of features besides keys: - (*) The key service defines three special key types: + * The key service defines three special key types: (+) "keyring" @@ -149,7 +147,7 @@ The key service provides a number of features besides keys: be created and updated from userspace, but the payload is only readable from kernel space. - (*) Each process subscribes to three keyrings: a thread-specific keyring, a + * Each process subscribes to three keyrings: a thread-specific keyring, a process-specific keyring, and a session-specific keyring. The thread-specific keyring is discarded from the child when any sort of @@ -170,7 +168,7 @@ The key service provides a number of features besides keys: The ownership of the thread keyring changes when the real UID and GID of the thread changes. - (*) Each user ID resident in the system holds two special keyrings: a user + * Each user ID resident in the system holds two special keyrings: a user specific keyring and a default user session keyring. The default session keyring is initialised with a link to the user-specific keyring. @@ -180,7 +178,7 @@ The key service provides a number of features besides keys: If a process attempts to access its session key when it doesn't have one, it will be subscribed to the default for its current UID. - (*) Each user has two quotas against which the keys they own are tracked. One + * Each user has two quotas against which the keys they own are tracked. One limits the total number of keys and keyrings, the other limits the total amount of description and payload space that can be consumed. @@ -194,54 +192,53 @@ The key service provides a number of features besides keys: If a system call that modifies a key or keyring in some way would put the user over quota, the operation is refused and error EDQUOT is returned. - (*) There's a system call interface by which userspace programs can create and + * There's a system call interface by which userspace programs can create and manipulate keys and keyrings. - (*) There's a kernel interface by which services can register types and search + * There's a kernel interface by which services can register types and search for keys. - (*) There's a way for the a search done from the kernel to call back to + * There's a way for the a search done from the kernel to call back to userspace to request a key that can't be found in a process's keyrings. - (*) An optional filesystem is available through which the key database can be + * An optional filesystem is available through which the key database can be viewed and manipulated. -====================== -KEY ACCESS PERMISSIONS +Key Access Permissions ====================== Keys have an owner user ID, a group access ID, and a permissions mask. The mask has up to eight bits each for possessor, user, group and other access. Only six of each set of eight bits are defined. These permissions granted are: - (*) View + * View This permits a key or keyring's attributes to be viewed - including key type and description. - (*) Read + * Read This permits a key's payload to be viewed or a keyring's list of linked keys. - (*) Write + * Write This permits a key's payload to be instantiated or updated, or it allows a link to be added to or removed from a keyring. - (*) Search + * Search This permits keyrings to be searched and keys to be found. Searches can only recurse into nested keyrings that have search permission set. - (*) Link + * Link This permits a key or keyring to be linked to. To create a link from a keyring to a key, a process must have Write permission on the keyring and Link permission on the key. - (*) Set Attribute + * Set Attribute This permits a key's UID, GID and permissions mask to be changed. @@ -249,8 +246,7 @@ For changing the ownership, group ID or permissions mask, being the owner of the key or having the sysadmin capability is sufficient. -=============== -SELINUX SUPPORT +SELinux Support =============== The security class "key" has been added to SELinux so that mandatory access @@ -282,14 +278,13 @@ their associated thread, and both session and process keyrings are handled similarly. -================ -NEW PROCFS FILES +New ProcFS Files ================ Two files have been added to procfs by which an administrator can find out about the status of the key service: - (*) /proc/keys + * /proc/keys This lists the keys that are currently viewable by the task reading the file, giving information about their type, description and permissions. @@ -301,7 +296,7 @@ about the status of the key service: security checks are still performed, and may further filter out keys that the current process is not authorised to view. - The contents of the file look like this: + The contents of the file look like this:: SERIAL FLAGS USAGE EXPY PERM UID GID TYPE DESCRIPTION: SUMMARY 00000001 I----- 39 perm 1f3f0000 0 0 keyring _uid_ses.0: 1/4 @@ -314,7 +309,7 @@ about the status of the key service: 00000893 I--Q-N 1 35s 1f3f0000 0 0 user metal:silver: 0 00000894 I--Q-- 1 10h 003f0000 0 0 user metal:gold: 0 - The flags are: + The flags are:: I Instantiated R Revoked @@ -324,10 +319,10 @@ about the status of the key service: N Negative key - (*) /proc/key-users + * /proc/key-users This file lists the tracking data for each user that has at least one key - on the system. Such data includes quota information and statistics: + on the system. Such data includes quota information and statistics:: [root@andromeda root]# cat /proc/key-users 0: 46 45/45 1/100 13/10000 @@ -335,7 +330,8 @@ about the status of the key service: 32: 2 2/2 2/100 40/10000 38: 2 2/2 2/100 40/10000 - The format of each line is + The format of each line is:: + <UID>: User ID to which this applies <usage> Structure refcount <inst>/<keys> Total number of keys and number instantiated @@ -346,14 +342,14 @@ about the status of the key service: Four new sysctl files have been added also for the purpose of controlling the quota limits on keys: - (*) /proc/sys/kernel/keys/root_maxkeys + * /proc/sys/kernel/keys/root_maxkeys /proc/sys/kernel/keys/root_maxbytes These files hold the maximum number of keys that root may have and the maximum total number of bytes of data that root may have stored in those keys. - (*) /proc/sys/kernel/keys/maxkeys + * /proc/sys/kernel/keys/maxkeys /proc/sys/kernel/keys/maxbytes These files hold the maximum number of keys that each non-root user may @@ -364,8 +360,7 @@ Root may alter these by writing each new limit as a decimal number string to the appropriate file. -=============================== -USERSPACE SYSTEM CALL INTERFACE +Userspace System Call Interface =============================== Userspace can manipulate keys directly through three new syscalls: add_key, @@ -375,7 +370,7 @@ manipulating keys. When referring to a key directly, userspace programs should use the key's serial number (a positive 32-bit integer). However, there are some special values available for referring to special keys and keyrings that relate to the -process making the call: +process making the call:: CONSTANT VALUE KEY REFERENCED ============================== ====== =========================== @@ -391,8 +386,8 @@ process making the call: The main syscalls are: - (*) Create a new key of given type, description and payload and add it to the - nominated keyring: + * Create a new key of given type, description and payload and add it to the + nominated keyring:: key_serial_t add_key(const char *type, const char *desc, const void *payload, size_t plen, @@ -432,8 +427,8 @@ The main syscalls are: The ID of the new or updated key is returned if successful. - (*) Search the process's keyrings for a key, potentially calling out to - userspace to create it. + * Search the process's keyrings for a key, potentially calling out to + userspace to create it:: key_serial_t request_key(const char *type, const char *description, const char *callout_info, @@ -453,7 +448,7 @@ The main syscalls are: The keyctl syscall functions are: - (*) Map a special key ID to a real key ID for this process: + * Map a special key ID to a real key ID for this process:: key_serial_t keyctl(KEYCTL_GET_KEYRING_ID, key_serial_t id, int create); @@ -466,7 +461,7 @@ The keyctl syscall functions are: non-zero; and the error ENOKEY will be returned if "create" is zero. - (*) Replace the session keyring this process subscribes to with a new one: + * Replace the session keyring this process subscribes to with a new one:: key_serial_t keyctl(KEYCTL_JOIN_SESSION_KEYRING, const char *name); @@ -484,7 +479,7 @@ The keyctl syscall functions are: The ID of the new session keyring is returned if successful. - (*) Update the specified key: + * Update the specified key:: long keyctl(KEYCTL_UPDATE, key_serial_t key, const void *payload, size_t plen); @@ -498,7 +493,7 @@ The keyctl syscall functions are: add_key(). - (*) Revoke a key: + * Revoke a key:: long keyctl(KEYCTL_REVOKE, key_serial_t key); @@ -507,7 +502,7 @@ The keyctl syscall functions are: be findable. - (*) Change the ownership of a key: + * Change the ownership of a key:: long keyctl(KEYCTL_CHOWN, key_serial_t key, uid_t uid, gid_t gid); @@ -520,7 +515,7 @@ The keyctl syscall functions are: its group list members. - (*) Change the permissions mask on a key: + * Change the permissions mask on a key:: long keyctl(KEYCTL_SETPERM, key_serial_t key, key_perm_t perm); @@ -531,7 +526,7 @@ The keyctl syscall functions are: error EINVAL will be returned. - (*) Describe a key: + * Describe a key:: long keyctl(KEYCTL_DESCRIBE, key_serial_t key, char *buffer, size_t buflen); @@ -547,7 +542,7 @@ The keyctl syscall functions are: A process must have view permission on the key for this function to be successful. - If successful, a string is placed in the buffer in the following format: + If successful, a string is placed in the buffer in the following format:: <type>;<uid>;<gid>;<perm>;<description> @@ -555,12 +550,12 @@ The keyctl syscall functions are: is hexadecimal. A NUL character is included at the end of the string if the buffer is sufficiently big. - This can be parsed with + This can be parsed with:: sscanf(buffer, "%[^;];%d;%d;%o;%s", type, &uid, &gid, &mode, desc); - (*) Clear out a keyring: + * Clear out a keyring:: long keyctl(KEYCTL_CLEAR, key_serial_t keyring); @@ -573,7 +568,7 @@ The keyctl syscall functions are: DNS resolver cache keyring is an example of this. - (*) Link a key into a keyring: + * Link a key into a keyring:: long keyctl(KEYCTL_LINK, key_serial_t keyring, key_serial_t key); @@ -592,7 +587,7 @@ The keyctl syscall functions are: added. - (*) Unlink a key or keyring from another keyring: + * Unlink a key or keyring from another keyring:: long keyctl(KEYCTL_UNLINK, key_serial_t keyring, key_serial_t key); @@ -604,7 +599,7 @@ The keyctl syscall functions are: is not present, error ENOENT will be the result. - (*) Search a keyring tree for a key: + * Search a keyring tree for a key:: key_serial_t keyctl(KEYCTL_SEARCH, key_serial_t keyring, const char *type, const char *description, @@ -628,7 +623,7 @@ The keyctl syscall functions are: fails. On success, the resulting key ID will be returned. - (*) Read the payload data from a key: + * Read the payload data from a key:: long keyctl(KEYCTL_READ, key_serial_t keyring, char *buffer, size_t buflen); @@ -650,7 +645,7 @@ The keyctl syscall functions are: available rather than the amount copied. - (*) Instantiate a partially constructed key. + * Instantiate a partially constructed key:: long keyctl(KEYCTL_INSTANTIATE, key_serial_t key, const void *payload, size_t plen, @@ -677,7 +672,7 @@ The keyctl syscall functions are: array instead of a single buffer. - (*) Negatively instantiate a partially constructed key. + * Negatively instantiate a partially constructed key:: long keyctl(KEYCTL_NEGATE, key_serial_t key, unsigned timeout, key_serial_t keyring); @@ -700,12 +695,12 @@ The keyctl syscall functions are: as rejecting the key with ENOKEY as the error code. - (*) Set the default request-key destination keyring. + * Set the default request-key destination keyring:: long keyctl(KEYCTL_SET_REQKEY_KEYRING, int reqkey_defl); This sets the default keyring to which implicitly requested keys will be - attached for this thread. reqkey_defl should be one of these constants: + attached for this thread. reqkey_defl should be one of these constants:: CONSTANT VALUE NEW DEFAULT KEYRING ====================================== ====== ======================= @@ -731,7 +726,7 @@ The keyctl syscall functions are: there is one, otherwise the user default session keyring. - (*) Set the timeout on a key. + * Set the timeout on a key:: long keyctl(KEYCTL_SET_TIMEOUT, key_serial_t key, unsigned timeout); @@ -744,7 +739,7 @@ The keyctl syscall functions are: or expired keys. - (*) Assume the authority granted to instantiate a key + * Assume the authority granted to instantiate a key:: long keyctl(KEYCTL_ASSUME_AUTHORITY, key_serial_t key); @@ -766,7 +761,7 @@ The keyctl syscall functions are: The assumed authoritative key is inherited across fork and exec. - (*) Get the LSM security context attached to a key. + * Get the LSM security context attached to a key:: long keyctl(KEYCTL_GET_SECURITY, key_serial_t key, char *buffer, size_t buflen) @@ -787,7 +782,7 @@ The keyctl syscall functions are: successful. - (*) Install the calling process's session keyring on its parent. + * Install the calling process's session keyring on its parent:: long keyctl(KEYCTL_SESSION_TO_PARENT); @@ -807,7 +802,7 @@ The keyctl syscall functions are: kernel and resumes executing userspace. - (*) Invalidate a key. + * Invalidate a key:: long keyctl(KEYCTL_INVALIDATE, key_serial_t key); @@ -823,20 +818,19 @@ The keyctl syscall functions are: A process must have search permission on the key for this function to be successful. - (*) Compute a Diffie-Hellman shared secret or public key + * Compute a Diffie-Hellman shared secret or public key:: - long keyctl(KEYCTL_DH_COMPUTE, struct keyctl_dh_params *params, - char *buffer, size_t buflen, - struct keyctl_kdf_params *kdf); + long keyctl(KEYCTL_DH_COMPUTE, struct keyctl_dh_params *params, + char *buffer, size_t buflen, struct keyctl_kdf_params *kdf); - The params struct contains serial numbers for three keys: + The params struct contains serial numbers for three keys:: - The prime, p, known to both parties - The local private key - The base integer, which is either a shared generator or the remote public key - The value computed is: + The value computed is:: result = base ^ private (mod prime) @@ -858,12 +852,12 @@ The keyctl syscall functions are: of the KDF is returned to the caller. The KDF is characterized with struct keyctl_kdf_params as follows: - - char *hashname specifies the NUL terminated string identifying + - ``char *hashname`` specifies the NUL terminated string identifying the hash used from the kernel crypto API and applied for the KDF operation. The KDF implemenation complies with SP800-56A as well as with SP800-108 (the counter KDF). - - char *otherinfo specifies the OtherInfo data as documented in + - ``char *otherinfo`` specifies the OtherInfo data as documented in SP800-56A section 5.8.1.2. The length of the buffer is given with otherinfolen. The format of OtherInfo is defined by the caller. The otherinfo pointer may be NULL if no OtherInfo shall be used. @@ -875,10 +869,10 @@ The keyctl syscall functions are: and either the buffer length or the OtherInfo length exceeds the allowed length. - (*) Restrict keyring linkage + * Restrict keyring linkage:: - long keyctl(KEYCTL_RESTRICT_KEYRING, key_serial_t keyring, - const char *type, const char *restriction); + long keyctl(KEYCTL_RESTRICT_KEYRING, key_serial_t keyring, + const char *type, const char *restriction); An existing keyring can restrict linkage of additional keys by evaluating the contents of the key according to a restriction scheme. @@ -900,8 +894,13 @@ The keyctl syscall functions are: To apply a keyring restriction the process must have Set Attribute permission and the keyring must not be previously restricted. -=============== -KERNEL SERVICES + One application of restricted keyrings is to verify X.509 certificate + chains or individual certificate signatures using the asymmetric key type. + See Documentation/crypto/asymmetric-keys.txt for specific restrictions + applicable to the asymmetric key type. + + +Kernel Services =============== The kernel services for key management are fairly simple to deal with. They can @@ -915,29 +914,29 @@ call, and the key released upon close. How to deal with conflicting keys due to two different users opening the same file is left to the filesystem author to solve. -To access the key manager, the following header must be #included: +To access the key manager, the following header must be #included:: <linux/key.h> Specific key types should have a header file under include/keys/ that should be -used to access that type. For keys of type "user", for example, that would be: +used to access that type. For keys of type "user", for example, that would be:: <keys/user-type.h> Note that there are two different types of pointers to keys that may be encountered: - (*) struct key * + * struct key * This simply points to the key structure itself. Key structures will be at least four-byte aligned. - (*) key_ref_t + * key_ref_t - This is equivalent to a struct key *, but the least significant bit is set + This is equivalent to a ``struct key *``, but the least significant bit is set if the caller "possesses" the key. By "possession" it is meant that the calling processes has a searchable link to the key from one of its - keyrings. There are three functions for dealing with these: + keyrings. There are three functions for dealing with these:: key_ref_t make_key_ref(const struct key *key, bool possession); @@ -955,7 +954,7 @@ When accessing a key's payload contents, certain precautions must be taken to prevent access vs modification races. See the section "Notes on accessing payload contents" for more information. -(*) To search for a key, call: + * To search for a key, call:: struct key *request_key(const struct key_type *type, const char *description, @@ -977,7 +976,7 @@ payload contents" for more information. See also Documentation/security/keys-request-key.txt. -(*) To search for a key, passing auxiliary data to the upcaller, call: + * To search for a key, passing auxiliary data to the upcaller, call:: struct key *request_key_with_auxdata(const struct key_type *type, const char *description, @@ -990,14 +989,14 @@ payload contents" for more information. is a blob of length callout_len, if given (the length may be 0). -(*) A key can be requested asynchronously by calling one of: + * A key can be requested asynchronously by calling one of:: struct key *request_key_async(const struct key_type *type, const char *description, const void *callout_info, size_t callout_len); - or: + or:: struct key *request_key_async_with_auxdata(const struct key_type *type, const char *description, @@ -1010,7 +1009,7 @@ payload contents" for more information. These two functions return with the key potentially still under construction. To wait for construction completion, the following should be - called: + called:: int wait_for_key_construction(struct key *key, bool intr); @@ -1022,11 +1021,11 @@ payload contents" for more information. case error ERESTARTSYS will be returned. -(*) When it is no longer required, the key should be released using: + * When it is no longer required, the key should be released using:: void key_put(struct key *key); - Or: + Or:: void key_ref_put(key_ref_t key_ref); @@ -1034,8 +1033,8 @@ payload contents" for more information. the argument will not be parsed. -(*) Extra references can be made to a key by calling one of the following - functions: + * Extra references can be made to a key by calling one of the following + functions:: struct key *__key_get(struct key *key); struct key *key_get(struct key *key); @@ -1047,7 +1046,7 @@ payload contents" for more information. then the key will not be dereferenced and no increment will take place. -(*) A key's serial number can be obtained by calling: + * A key's serial number can be obtained by calling:: key_serial_t key_serial(struct key *key); @@ -1055,7 +1054,7 @@ payload contents" for more information. latter case without parsing the argument). -(*) If a keyring was found in the search, this can be further searched by: + * If a keyring was found in the search, this can be further searched by:: key_ref_t keyring_search(key_ref_t keyring_ref, const struct key_type *type, @@ -1070,7 +1069,7 @@ payload contents" for more information. reference pointer if successful. -(*) A keyring can be created by: + * A keyring can be created by:: struct key *keyring_alloc(const char *description, uid_t uid, gid_t gid, const struct cred *cred, @@ -1109,7 +1108,7 @@ payload contents" for more information. -EPERM to in this case. -(*) To check the validity of a key, this function can be called: + * To check the validity of a key, this function can be called:: int validate_key(struct key *key); @@ -1119,7 +1118,7 @@ payload contents" for more information. returned (in the latter case without parsing the argument). -(*) To register a key type, the following function should be called: + * To register a key type, the following function should be called:: int register_key_type(struct key_type *type); @@ -1127,13 +1126,13 @@ payload contents" for more information. present. -(*) To unregister a key type, call: + * To unregister a key type, call:: void unregister_key_type(struct key_type *type); Under some circumstances, it may be desirable to deal with a bundle of keys. -The facility provides access to the keyring type for managing such a bundle: +The facility provides access to the keyring type for managing such a bundle:: struct key_type key_type_keyring; @@ -1143,8 +1142,7 @@ with keyring_search(). Note that it is not possible to use request_key() to search a specific keyring, so using keyrings in this way is of limited utility. -=================================== -NOTES ON ACCESSING PAYLOAD CONTENTS +Notes On Accessing Payload Contents =================================== The simplest payload is just data stored in key->payload directly. In this @@ -1154,31 +1152,31 @@ More complex payload contents must be allocated and pointers to them set in the key->payload.data[] array. One of the following ways must be selected to access the data: - (1) Unmodifiable key type. + 1) Unmodifiable key type. If the key type does not have a modify method, then the key's payload can be accessed without any form of locking, provided that it's known to be instantiated (uninstantiated keys cannot be "found"). - (2) The key's semaphore. + 2) The key's semaphore. The semaphore could be used to govern access to the payload and to control the payload pointer. It must be write-locked for modifications and would have to be read-locked for general access. The disadvantage of doing this is that the accessor may be required to sleep. - (3) RCU. + 3) RCU. RCU must be used when the semaphore isn't already held; if the semaphore is held then the contents can't change under you unexpectedly as the semaphore must still be used to serialise modifications to the key. The key management code takes care of this for the key type. - However, this means using: + However, this means using:: rcu_read_lock() ... rcu_dereference() ... rcu_read_unlock() - to read the pointer, and: + to read the pointer, and:: rcu_dereference() ... rcu_assign_pointer() ... call_rcu() @@ -1194,11 +1192,11 @@ access the data: usage. This is called key->payload.rcu_data0. The following accessors wrap the RCU calls to this element: - (a) Set or change the first payload pointer: + a) Set or change the first payload pointer:: rcu_assign_keypointer(struct key *key, void *data); - (b) Read the first payload pointer with the key semaphore held: + b) Read the first payload pointer with the key semaphore held:: [const] void *dereference_key_locked([const] struct key *key); @@ -1206,39 +1204,38 @@ access the data: parameter. Static analysis will give an error if it things the lock isn't held. - (c) Read the first payload pointer with the RCU read lock held: + c) Read the first payload pointer with the RCU read lock held:: const void *dereference_key_rcu(const struct key *key); -=================== -DEFINING A KEY TYPE +Defining a Key Type =================== A kernel service may want to define its own key type. For instance, an AFS filesystem might want to define a Kerberos 5 ticket key type. To do this, it author fills in a key_type struct and registers it with the system. -Source files that implement key types should include the following header file: +Source files that implement key types should include the following header file:: <linux/key-type.h> The structure has a number of fields, some of which are mandatory: - (*) const char *name + * ``const char *name`` The name of the key type. This is used to translate a key type name supplied by userspace into a pointer to the structure. - (*) size_t def_datalen + * ``size_t def_datalen`` This is optional - it supplies the default payload data length as contributed to the quota. If the key type's payload is always or almost always the same size, then this is a more efficient way to do things. The data length (and quota) on a particular key can always be changed - during instantiation or update by calling: + during instantiation or update by calling:: int key_payload_reserve(struct key *key, size_t datalen); @@ -1246,18 +1243,18 @@ The structure has a number of fields, some of which are mandatory: viable. - (*) int (*vet_description)(const char *description); + * ``int (*vet_description)(const char *description);`` This optional method is called to vet a key description. If the key type doesn't approve of the key description, it may return an error, otherwise it should return 0. - (*) int (*preparse)(struct key_preparsed_payload *prep); + * ``int (*preparse)(struct key_preparsed_payload *prep);`` This optional method permits the key type to attempt to parse payload before a key is created (add key) or the key semaphore is taken (update or - instantiate key). The structure pointed to by prep looks like: + instantiate key). The structure pointed to by prep looks like:: struct key_preparsed_payload { char *description; @@ -1285,7 +1282,7 @@ The structure has a number of fields, some of which are mandatory: otherwise. - (*) void (*free_preparse)(struct key_preparsed_payload *prep); + * ``void (*free_preparse)(struct key_preparsed_payload *prep);`` This method is only required if the preparse() method is provided, otherwise it is unused. It cleans up anything attached to the description @@ -1294,7 +1291,7 @@ The structure has a number of fields, some of which are mandatory: successfully, even if instantiate() or update() succeed. - (*) int (*instantiate)(struct key *key, struct key_preparsed_payload *prep); + * ``int (*instantiate)(struct key *key, struct key_preparsed_payload *prep);`` This method is called to attach a payload to a key during construction. The payload attached need not bear any relation to the data passed to this @@ -1318,7 +1315,7 @@ The structure has a number of fields, some of which are mandatory: free_preparse method doesn't release the data. - (*) int (*update)(struct key *key, const void *data, size_t datalen); + * ``int (*update)(struct key *key, const void *data, size_t datalen);`` If this type of key can be updated, then this method should be provided. It is called to update a key's payload from the blob of data provided. @@ -1343,10 +1340,10 @@ The structure has a number of fields, some of which are mandatory: It is safe to sleep in this method. - (*) int (*match_preparse)(struct key_match_data *match_data); + * ``int (*match_preparse)(struct key_match_data *match_data);`` This method is optional. It is called when a key search is about to be - performed. It is given the following structure: + performed. It is given the following structure:: struct key_match_data { bool (*cmp)(const struct key *key, @@ -1357,23 +1354,23 @@ The structure has a number of fields, some of which are mandatory: }; On entry, raw_data will be pointing to the criteria to be used in matching - a key by the caller and should not be modified. (*cmp)() will be pointing + a key by the caller and should not be modified. ``(*cmp)()`` will be pointing to the default matcher function (which does an exact description match against raw_data) and lookup_type will be set to indicate a direct lookup. The following lookup_type values are available: - [*] KEYRING_SEARCH_LOOKUP_DIRECT - A direct lookup hashes the type and + * KEYRING_SEARCH_LOOKUP_DIRECT - A direct lookup hashes the type and description to narrow down the search to a small number of keys. - [*] KEYRING_SEARCH_LOOKUP_ITERATE - An iterative lookup walks all the + * KEYRING_SEARCH_LOOKUP_ITERATE - An iterative lookup walks all the keys in the keyring until one is matched. This must be used for any search that's not doing a simple direct match on the key description. The method may set cmp to point to a function of its choice that does some other form of match, may set lookup_type to KEYRING_SEARCH_LOOKUP_ITERATE - and may attach something to the preparsed pointer for use by (*cmp)(). - (*cmp)() should return true if a key matches and false otherwise. + and may attach something to the preparsed pointer for use by ``(*cmp)()``. + ``(*cmp)()`` should return true if a key matches and false otherwise. If preparsed is set, it may be necessary to use the match_free() method to clean it up. @@ -1381,20 +1378,20 @@ The structure has a number of fields, some of which are mandatory: The method should return 0 if successful or a negative error code otherwise. - It is permitted to sleep in this method, but (*cmp)() may not sleep as + It is permitted to sleep in this method, but ``(*cmp)()`` may not sleep as locks will be held over it. If match_preparse() is not provided, keys of this type will be matched exactly by their description. - (*) void (*match_free)(struct key_match_data *match_data); + * ``void (*match_free)(struct key_match_data *match_data);`` This method is optional. If given, it called to clean up match_data->preparsed after a successful call to match_preparse(). - (*) void (*revoke)(struct key *key); + * ``void (*revoke)(struct key *key);`` This method is optional. It is called to discard part of the payload data upon a key being revoked. The caller will have the key semaphore @@ -1404,7 +1401,7 @@ The structure has a number of fields, some of which are mandatory: a deadlock against the key semaphore. - (*) void (*destroy)(struct key *key); + * ``void (*destroy)(struct key *key);`` This method is optional. It is called to discard the payload data on a key when it is being destroyed. @@ -1416,7 +1413,7 @@ The structure has a number of fields, some of which are mandatory: It is not safe to sleep in this method; the caller may hold spinlocks. - (*) void (*describe)(const struct key *key, struct seq_file *p); + * ``void (*describe)(const struct key *key, struct seq_file *p);`` This method is optional. It is called during /proc/keys reading to summarise a key's description and payload in text form. @@ -1432,7 +1429,7 @@ The structure has a number of fields, some of which are mandatory: caller. - (*) long (*read)(const struct key *key, char __user *buffer, size_t buflen); + * ``long (*read)(const struct key *key, char __user *buffer, size_t buflen);`` This method is optional. It is called by KEYCTL_READ to translate the key's payload into something a blob of data for userspace to deal with. @@ -1448,8 +1445,7 @@ The structure has a number of fields, some of which are mandatory: as might happen when the userspace buffer is accessed. - (*) int (*request_key)(struct key_construction *cons, const char *op, - void *aux); + * ``int (*request_key)(struct key_construction *cons, const char *op, void *aux);`` This method is optional. If provided, request_key() and friends will invoke this function rather than upcalling to /sbin/request-key to operate @@ -1463,7 +1459,7 @@ The structure has a number of fields, some of which are mandatory: This method is permitted to return before the upcall is complete, but the following function must be called under all circumstances to complete the instantiation process, whether or not it succeeds, whether or not there's - an error: + an error:: void complete_request_key(struct key_construction *cons, int error); @@ -1479,16 +1475,16 @@ The structure has a number of fields, some of which are mandatory: The key under construction and the authorisation key can be found in the key_construction struct pointed to by cons: - (*) struct key *key; + * ``struct key *key;`` The key under construction. - (*) struct key *authkey; + * ``struct key *authkey;`` The authorisation key. - (*) struct key_restriction *(*lookup_restriction)(const char *params); + * ``struct key_restriction *(*lookup_restriction)(const char *params);`` This optional method is used to enable userspace configuration of keyring restrictions. The restriction parameter string (not including the key type @@ -1497,12 +1493,11 @@ The structure has a number of fields, some of which are mandatory: attempted key link operation. If there is no match, -EINVAL is returned. -============================ -REQUEST-KEY CALLBACK SERVICE +Request-Key Callback Service ============================ To create a new key, the kernel will attempt to execute the following command -line: +line:: /sbin/request-key create <key> <uid> <gid> \ <threadring> <processring> <sessionring> <callout_info> @@ -1511,10 +1506,10 @@ line: keyrings from the process that caused the search to be issued. These are included for two reasons: - (1) There may be an authentication token in one of the keyrings that is + 1 There may be an authentication token in one of the keyrings that is required to obtain the key, eg: a Kerberos Ticket-Granting Ticket. - (2) The new key should probably be cached in one of these rings. + 2 The new key should probably be cached in one of these rings. This program should set it UID and GID to those specified before attempting to access any more keys. It may then look around for a user specific process to @@ -1539,7 +1534,7 @@ instead. Similarly, the kernel may attempt to update an expired or a soon to expire key -by executing: +by executing:: /sbin/request-key update <key> <uid> <gid> \ <threadring> <processring> <sessionring> @@ -1548,8 +1543,7 @@ In this case, the program isn't required to actually attach the key to a ring; the rings are provided for reference. -================== -GARBAGE COLLECTION +Garbage Collection ================== Dead keys (for which the type has been removed) will be automatically unlinked @@ -1557,6 +1551,6 @@ from those keyrings that point to them and deleted as soon as possible by a background garbage collector. Similarly, revoked and expired keys will be garbage collected, but only after a -certain amount of time has passed. This time is set as a number of seconds in: +certain amount of time has passed. This time is set as a number of seconds in:: /proc/sys/kernel/keys/gc_delay diff --git a/Documentation/security/keys-ecryptfs.txt b/Documentation/security/keys/ecryptfs.rst index c3bbeba63562..4920f3a8ea75 100644 --- a/Documentation/security/keys-ecryptfs.txt +++ b/Documentation/security/keys/ecryptfs.rst @@ -1,4 +1,6 @@ - Encrypted keys for the eCryptfs filesystem +========================================== +Encrypted keys for the eCryptfs filesystem +========================================== ECryptfs is a stacked filesystem which transparently encrypts and decrypts each file using a randomly generated File Encryption Key (FEK). @@ -35,20 +37,23 @@ controlled environment. Another advantage is that the key is not exposed to threats of malicious software, because it is available in clear form only at kernel level. -Usage: +Usage:: + keyctl add encrypted name "new ecryptfs key-type:master-key-name keylen" ring keyctl add encrypted name "load hex_blob" ring keyctl update keyid "update key-type:master-key-name" -name:= '<16 hexadecimal characters>' -key-type:= 'trusted' | 'user' -keylen:= 64 +Where:: + + name:= '<16 hexadecimal characters>' + key-type:= 'trusted' | 'user' + keylen:= 64 Example of encrypted key usage with the eCryptfs filesystem: Create an encrypted key "1000100010001000" of length 64 bytes with format -'ecryptfs' and save it using a previously loaded user key "test": +'ecryptfs' and save it using a previously loaded user key "test":: $ keyctl add encrypted 1000100010001000 "new ecryptfs user:test 64" @u 19184530 @@ -62,7 +67,7 @@ Create an encrypted key "1000100010001000" of length 64 bytes with format $ keyctl pipe 19184530 > ecryptfs.blob Mount an eCryptfs filesystem using the created encrypted key "1000100010001000" -into the '/secret' directory: +into the '/secret' directory:: $ mount -i -t ecryptfs -oecryptfs_sig=1000100010001000,\ ecryptfs_cipher=aes,ecryptfs_key_bytes=32 /secret /secret diff --git a/Documentation/security/keys/index.rst b/Documentation/security/keys/index.rst new file mode 100644 index 000000000000..647d58f2588e --- /dev/null +++ b/Documentation/security/keys/index.rst @@ -0,0 +1,11 @@ +=========== +Kernel Keys +=========== + +.. toctree:: + :maxdepth: 1 + + core + ecryptfs + request-key + trusted-encrypted diff --git a/Documentation/security/keys-request-key.txt b/Documentation/security/keys/request-key.rst index 51987bfecfed..aba32784174c 100644 --- a/Documentation/security/keys-request-key.txt +++ b/Documentation/security/keys/request-key.rst @@ -1,19 +1,19 @@ - =================== - KEY REQUEST SERVICE - =================== +=================== +Key Request Service +=================== The key request service is part of the key retention service (refer to Documentation/security/keys.txt). This document explains more fully how the requesting algorithm works. The process starts by either the kernel requesting a service by calling -request_key*(): +``request_key*()``:: struct key *request_key(const struct key_type *type, const char *description, const char *callout_info); -or: +or:: struct key *request_key_with_auxdata(const struct key_type *type, const char *description, @@ -21,14 +21,14 @@ or: size_t callout_len, void *aux); -or: +or:: struct key *request_key_async(const struct key_type *type, const char *description, const char *callout_info, size_t callout_len); -or: +or:: struct key *request_key_async_with_auxdata(const struct key_type *type, const char *description, @@ -36,7 +36,7 @@ or: size_t callout_len, void *aux); -Or by userspace invoking the request_key system call: +Or by userspace invoking the request_key system call:: key_serial_t request_key(const char *type, const char *description, @@ -67,38 +67,37 @@ own upcall mechanisms. If they do, then those should be substituted for the forking and execution of /sbin/request-key. -=========== -THE PROCESS +The Process =========== A request proceeds in the following manner: - (1) Process A calls request_key() [the userspace syscall calls the kernel + 1) Process A calls request_key() [the userspace syscall calls the kernel interface]. - (2) request_key() searches the process's subscribed keyrings to see if there's + 2) request_key() searches the process's subscribed keyrings to see if there's a suitable key there. If there is, it returns the key. If there isn't, and callout_info is not set, an error is returned. Otherwise the process proceeds to the next step. - (3) request_key() sees that A doesn't have the desired key yet, so it creates + 3) request_key() sees that A doesn't have the desired key yet, so it creates two things: - (a) An uninstantiated key U of requested type and description. + a) An uninstantiated key U of requested type and description. - (b) An authorisation key V that refers to key U and notes that process A + b) An authorisation key V that refers to key U and notes that process A is the context in which key U should be instantiated and secured, and from which associated key requests may be satisfied. - (4) request_key() then forks and executes /sbin/request-key with a new session + 4) request_key() then forks and executes /sbin/request-key with a new session keyring that contains a link to auth key V. - (5) /sbin/request-key assumes the authority associated with key U. + 5) /sbin/request-key assumes the authority associated with key U. - (6) /sbin/request-key execs an appropriate program to perform the actual + 6) /sbin/request-key execs an appropriate program to perform the actual instantiation. - (7) The program may want to access another key from A's context (say a + 7) The program may want to access another key from A's context (say a Kerberos TGT key). It just requests the appropriate key, and the keyring search notes that the session keyring has auth key V in its bottom level. @@ -106,15 +105,15 @@ A request proceeds in the following manner: UID, GID, groups and security info of process A as if it was process A, and come up with key W. - (8) The program then does what it must to get the data with which to + 8) The program then does what it must to get the data with which to instantiate key U, using key W as a reference (perhaps it contacts a Kerberos server using the TGT) and then instantiates key U. - (9) Upon instantiating key U, auth key V is automatically revoked so that it + 9) Upon instantiating key U, auth key V is automatically revoked so that it may not be used again. -(10) The program then exits 0 and request_key() deletes key V and returns key - U to the caller. + 10) The program then exits 0 and request_key() deletes key V and returns key + U to the caller. This also extends further. If key W (step 7 above) didn't exist, key W would be created uninstantiated, another auth key (X) would be created (as per step @@ -127,8 +126,7 @@ This is because process A's keyrings can't simply be attached to of them, and (b) it requires the same UID/GID/Groups all the way through. -==================================== -NEGATIVE INSTANTIATION AND REJECTION +Negative Instantiation And Rejection ==================================== Rather than instantiating a key, it is possible for the possessor of an @@ -145,23 +143,22 @@ signal, the key under construction will be automatically negatively instantiated for a short amount of time. -==================== -THE SEARCH ALGORITHM +The Search Algorithm ==================== A search of any particular keyring proceeds in the following fashion: - (1) When the key management code searches for a key (keyring_search_aux) it + 1) When the key management code searches for a key (keyring_search_aux) it firstly calls key_permission(SEARCH) on the keyring it's starting with, if this denies permission, it doesn't search further. - (2) It considers all the non-keyring keys within that keyring and, if any key + 2) It considers all the non-keyring keys within that keyring and, if any key matches the criteria specified, calls key_permission(SEARCH) on it to see if the key is allowed to be found. If it is, that key is returned; if not, the search continues, and the error code is retained if of higher priority than the one currently set. - (3) It then considers all the keyring-type keys in the keyring it's currently + 3) It then considers all the keyring-type keys in the keyring it's currently searching. It calls key_permission(SEARCH) on each keyring, and if this grants permission, it recurses, executing steps (2) and (3) on that keyring. @@ -173,20 +170,20 @@ returned. When search_process_keyrings() is invoked, it performs the following searches until one succeeds: - (1) If extant, the process's thread keyring is searched. + 1) If extant, the process's thread keyring is searched. - (2) If extant, the process's process keyring is searched. + 2) If extant, the process's process keyring is searched. - (3) The process's session keyring is searched. + 3) The process's session keyring is searched. - (4) If the process has assumed the authority associated with a request_key() + 4) If the process has assumed the authority associated with a request_key() authorisation key then: - (a) If extant, the calling process's thread keyring is searched. + a) If extant, the calling process's thread keyring is searched. - (b) If extant, the calling process's process keyring is searched. + b) If extant, the calling process's process keyring is searched. - (c) The calling process's session keyring is searched. + c) The calling process's session keyring is searched. The moment one succeeds, all pending errors are discarded and the found key is returned. @@ -194,7 +191,7 @@ returned. Only if all these fail does the whole thing fail with the highest priority error. Note that several errors may have come from LSM. -The error priority is: +The error priority is:: EKEYREVOKED > EKEYEXPIRED > ENOKEY diff --git a/Documentation/security/keys-trusted-encrypted.txt b/Documentation/security/keys/trusted-encrypted.rst index b20a993a32af..7b503831bdea 100644 --- a/Documentation/security/keys-trusted-encrypted.txt +++ b/Documentation/security/keys/trusted-encrypted.rst @@ -1,4 +1,6 @@ - Trusted and Encrypted Keys +========================== +Trusted and Encrypted Keys +========================== Trusted and Encrypted Keys are two new key types added to the existing kernel key ring service. Both of these new types are variable length symmetric keys, @@ -20,7 +22,8 @@ By default, trusted keys are sealed under the SRK, which has the default authorization value (20 zeros). This can be set at takeownership time with the trouser's utility: "tpm_takeownership -u -z". -Usage: +Usage:: + keyctl add trusted name "new keylen [options]" ring keyctl add trusted name "load hex_blob [pcrlock=pcrnum]" ring keyctl update key "update [options]" @@ -64,19 +67,22 @@ The decrypted portion of encrypted keys can contain either a simple symmetric key or a more complex structure. The format of the more complex structure is application specific, which is identified by 'format'. -Usage: +Usage:: + keyctl add encrypted name "new [format] key-type:master-key-name keylen" ring keyctl add encrypted name "load hex_blob" ring keyctl update keyid "update key-type:master-key-name" -format:= 'default | ecryptfs' -key-type:= 'trusted' | 'user' +Where:: + + format:= 'default | ecryptfs' + key-type:= 'trusted' | 'user' Examples of trusted and encrypted key usage: -Create and save a trusted key named "kmk" of length 32 bytes: +Create and save a trusted key named "kmk" of length 32 bytes:: $ keyctl add trusted kmk "new 32" @u 440502848 @@ -99,7 +105,7 @@ Create and save a trusted key named "kmk" of length 32 bytes: $ keyctl pipe 440502848 > kmk.blob -Load a trusted key from the saved blob: +Load a trusted key from the saved blob:: $ keyctl add trusted kmk "load `cat kmk.blob`" @u 268728824 @@ -114,7 +120,7 @@ Load a trusted key from the saved blob: f1f8fff03ad0acb083725535636addb08d73dedb9832da198081e5deae84bfaf0409c22b e4a8aea2b607ec96931e6f4d4fe563ba -Reseal a trusted key under new pcr values: +Reseal a trusted key under new pcr values:: $ keyctl update 268728824 "update pcrinfo=`cat pcr.blob`" $ keyctl print 268728824 @@ -135,11 +141,13 @@ compromised by a user level problem, and when sealed to specific boot PCR values, protects against boot and offline attacks. Create and save an encrypted key "evm" using the above trusted key "kmk": -option 1: omitting 'format' +option 1: omitting 'format':: + $ keyctl add encrypted evm "new trusted:kmk 32" @u 159771175 -option 2: explicitly defining 'format' as 'default' +option 2: explicitly defining 'format' as 'default':: + $ keyctl add encrypted evm "new default trusted:kmk 32" @u 159771175 @@ -150,7 +158,7 @@ option 2: explicitly defining 'format' as 'default' $ keyctl pipe 159771175 > evm.blob -Load an encrypted key "evm" from saved blob: +Load an encrypted key "evm" from saved blob:: $ keyctl add encrypted evm "load `cat evm.blob`" @u 831684262 @@ -164,4 +172,4 @@ Other uses for trusted and encrypted keys, such as for disk and file encryption are anticipated. In particular the new format 'ecryptfs' has been defined in in order to use encrypted keys to mount an eCryptfs filesystem. More details about the usage can be found in the file -'Documentation/security/keys-ecryptfs.txt'. +``Documentation/security/keys-ecryptfs.txt``. diff --git a/Documentation/security/self-protection.txt b/Documentation/security/self-protection.rst index 141acfebe6ef..60c8bd8b77bf 100644 --- a/Documentation/security/self-protection.txt +++ b/Documentation/security/self-protection.rst @@ -1,4 +1,6 @@ -# Kernel Self-Protection +====================== +Kernel Self-Protection +====================== Kernel self-protection is the design and implementation of systems and structures within the Linux kernel to protect against security flaws in @@ -26,7 +28,8 @@ mentioning them, since these aspects need to be explored, dealt with, and/or accepted. -## Attack Surface Reduction +Attack Surface Reduction +======================== The most fundamental defense against security exploits is to reduce the areas of the kernel that can be used to redirect execution. This ranges @@ -34,13 +37,15 @@ from limiting the exposed APIs available to userspace, making in-kernel APIs hard to use incorrectly, minimizing the areas of writable kernel memory, etc. -### Strict kernel memory permissions +Strict kernel memory permissions +-------------------------------- When all of kernel memory is writable, it becomes trivial for attacks to redirect execution flow. To reduce the availability of these targets the kernel needs to protect its memory with a tight set of permissions. -#### Executable code and read-only data must not be writable +Executable code and read-only data must not be writable +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Any areas of the kernel with executable memory must not be writable. While this obviously includes the kernel text itself, we must consider @@ -51,18 +56,19 @@ kernel, they are implemented in a way where the memory is temporarily made writable during the update, and then returned to the original permissions.) -In support of this are CONFIG_STRICT_KERNEL_RWX and -CONFIG_STRICT_MODULE_RWX, which seek to make sure that code is not +In support of this are ``CONFIG_STRICT_KERNEL_RWX`` and +``CONFIG_STRICT_MODULE_RWX``, which seek to make sure that code is not writable, data is not executable, and read-only data is neither writable nor executable. Most architectures have these options on by default and not user selectable. For some architectures like arm that wish to have these be selectable, the architecture Kconfig can select ARCH_OPTIONAL_KERNEL_RWX to enable -a Kconfig prompt. CONFIG_ARCH_OPTIONAL_KERNEL_RWX_DEFAULT determines +a Kconfig prompt. ``CONFIG_ARCH_OPTIONAL_KERNEL_RWX_DEFAULT`` determines the default setting when ARCH_OPTIONAL_KERNEL_RWX is enabled. -#### Function pointers and sensitive variables must not be writable +Function pointers and sensitive variables must not be writable +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Vast areas of kernel memory contain function pointers that are looked up by the kernel and used to continue execution (e.g. descriptor/vector @@ -74,8 +80,8 @@ so that they live in the .rodata section instead of the .data section of the kernel, gaining the protection of the kernel's strict memory permissions as described above. -For variables that are initialized once at __init time, these can -be marked with the (new and under development) __ro_after_init +For variables that are initialized once at ``__init`` time, these can +be marked with the (new and under development) ``__ro_after_init`` attribute. What remains are variables that are updated rarely (e.g. GDT). These @@ -85,7 +91,8 @@ of their lifetime read-only. (For example, when being updated, only the CPU thread performing the update would be given uninterruptible write access to the memory.) -#### Segregation of kernel memory from userspace memory +Segregation of kernel memory from userspace memory +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The kernel must never execute userspace memory. The kernel must also never access userspace memory without explicit expectation to do so. These @@ -95,10 +102,11 @@ By blocking userspace memory in this way, execution and data parsing cannot be passed to trivially-controlled userspace memory, forcing attacks to operate entirely in kernel memory. -### Reduced access to syscalls +Reduced access to syscalls +-------------------------- One trivial way to eliminate many syscalls for 64-bit systems is building -without CONFIG_COMPAT. However, this is rarely a feasible scenario. +without ``CONFIG_COMPAT``. However, this is rarely a feasible scenario. The "seccomp" system provides an opt-in feature made available to userspace, which provides a way to reduce the number of kernel entry @@ -112,7 +120,8 @@ to trusted processes. This would keep the scope of kernel entry points restricted to the more regular set of normally available to unprivileged userspace. -### Restricting access to kernel modules +Restricting access to kernel modules +------------------------------------ The kernel should never allow an unprivileged user the ability to load specific kernel modules, since that would provide a facility to @@ -127,11 +136,12 @@ for debate in some scenarios.) To protect against even privileged users, systems may need to either disable module loading entirely (e.g. monolithic kernel builds or modules_disabled sysctl), or provide signed modules (e.g. -CONFIG_MODULE_SIG_FORCE, or dm-crypt with LoadPin), to keep from having +``CONFIG_MODULE_SIG_FORCE``, or dm-crypt with LoadPin), to keep from having root load arbitrary kernel code via the module loader interface. -## Memory integrity +Memory integrity +================ There are many memory structures in the kernel that are regularly abused to gain execution control during an attack, By far the most commonly @@ -139,16 +149,18 @@ understood is that of the stack buffer overflow in which the return address stored on the stack is overwritten. Many other examples of this kind of attack exist, and protections exist to defend against them. -### Stack buffer overflow +Stack buffer overflow +--------------------- The classic stack buffer overflow involves writing past the expected end of a variable stored on the stack, ultimately writing a controlled value to the stack frame's stored return address. The most widely used defense is the presence of a stack canary between the stack variables and the -return address (CONFIG_CC_STACKPROTECTOR), which is verified just before +return address (``CONFIG_CC_STACKPROTECTOR``), which is verified just before the function returns. Other defenses include things like shadow stacks. -### Stack depth overflow +Stack depth overflow +-------------------- A less well understood attack is using a bug that triggers the kernel to consume stack memory with deep function calls or large stack @@ -158,27 +170,31 @@ important changes need to be made for better protections: moving the sensitive thread_info structure elsewhere, and adding a faulting memory hole at the bottom of the stack to catch these overflows. -### Heap memory integrity +Heap memory integrity +--------------------- The structures used to track heap free lists can be sanity-checked during allocation and freeing to make sure they aren't being used to manipulate other memory areas. -### Counter integrity +Counter integrity +----------------- Many places in the kernel use atomic counters to track object references or perform similar lifetime management. When these counters can be made to wrap (over or under) this traditionally exposes a use-after-free flaw. By trapping atomic wrapping, this class of bug vanishes. -### Size calculation overflow detection +Size calculation overflow detection +----------------------------------- Similar to counter overflow, integer overflows (usually size calculations) need to be detected at runtime to kill this class of bug, which traditionally leads to being able to write past the end of kernel buffers. -## Statistical defenses +Probabilistic defenses +====================== While many protections can be considered deterministic (e.g. read-only memory cannot be written to), some protections provide only statistical @@ -186,7 +202,8 @@ defense, in that an attack must gather enough information about a running system to overcome the defense. While not perfect, these do provide meaningful defenses. -### Canaries, blinding, and other secrets +Canaries, blinding, and other secrets +------------------------------------- It should be noted that things like the stack canary discussed earlier are technically statistical defenses, since they rely on a secret value, @@ -201,7 +218,8 @@ It is critical that the secret values used must be separate (e.g. different canary per stack) and high entropy (e.g. is the RNG actually working?) in order to maximize their success. -### Kernel Address Space Layout Randomization (KASLR) +Kernel Address Space Layout Randomization (KASLR) +------------------------------------------------- Since the location of kernel memory is almost always instrumental in mounting a successful attack, making the location non-deterministic @@ -209,22 +227,25 @@ raises the difficulty of an exploit. (Note that this in turn makes the value of information exposures higher, since they may be used to discover desired memory locations.) -#### Text and module base +Text and module base +~~~~~~~~~~~~~~~~~~~~ By relocating the physical and virtual base address of the kernel at -boot-time (CONFIG_RANDOMIZE_BASE), attacks needing kernel code will be +boot-time (``CONFIG_RANDOMIZE_BASE``), attacks needing kernel code will be frustrated. Additionally, offsetting the module loading base address means that even systems that load the same set of modules in the same order every boot will not share a common base address with the rest of the kernel text. -#### Stack base +Stack base +~~~~~~~~~~ If the base address of the kernel stack is not the same between processes, or even not the same between syscalls, targets on or beyond the stack become more difficult to locate. -#### Dynamic memory base +Dynamic memory base +~~~~~~~~~~~~~~~~~~~ Much of the kernel's dynamic memory (e.g. kmalloc, vmalloc, etc) ends up being relatively deterministic in layout due to the order of early-boot @@ -232,7 +253,8 @@ initializations. If the base address of these areas is not the same between boots, targeting them is frustrated, requiring an information exposure specific to the region. -#### Structure layout +Structure layout +~~~~~~~~~~~~~~~~ By performing a per-build randomization of the layout of sensitive structures, attacks must either be tuned to known kernel builds or expose @@ -240,26 +262,30 @@ enough kernel memory to determine structure layouts before manipulating them. -## Preventing Information Exposures +Preventing Information Exposures +================================ Since the locations of sensitive structures are the primary target for attacks, it is important to defend against exposure of both kernel memory addresses and kernel memory contents (since they may contain kernel addresses or other sensitive things like canary values). -### Unique identifiers +Unique identifiers +------------------ Kernel memory addresses must never be used as identifiers exposed to userspace. Instead, use an atomic counter, an idr, or similar unique identifier. -### Memory initialization +Memory initialization +--------------------- Memory copied to userspace must always be fully initialized. If not explicitly memset(), this will require changes to the compiler to make sure structure holes are cleared. -### Memory poisoning +Memory poisoning +---------------- When releasing memory, it is best to poison the contents (clear stack on syscall return, wipe heap memory on a free), to avoid reuse attacks that @@ -267,9 +293,10 @@ rely on the old contents of memory. This frustrates many uninitialized variable attacks, stack content exposures, heap content exposures, and use-after-free attacks. -### Destination tracking +Destination tracking +-------------------- To help kill classes of bugs that result in kernel addresses being written to userspace, the destination of writes needs to be tracked. If -the buffer is destined for userspace (e.g. seq_file backed /proc files), +the buffer is destined for userspace (e.g. seq_file backed ``/proc`` files), it should automatically censor sensitive values. diff --git a/Documentation/serial/n_gsm.txt b/Documentation/serial/n_gsm.txt index a5d91126a8f7..875361bb7cb4 100644 --- a/Documentation/serial/n_gsm.txt +++ b/Documentation/serial/n_gsm.txt @@ -77,6 +77,13 @@ for example, it's possible : 6- first close all virtual ports before closing the physical port. +Note that after closing the physical port the modem is still in multiplexing +mode. This may prevent a successful re-opening of the port later. To avoid +this situation either reset the modem if your hardware allows that or send +a disconnect command frame manually before initializing the multiplexing mode +for the second time. The byte sequence for the disconnect command frame is: +0xf9, 0x03, 0xef, 0x03, 0xc3, 0x16, 0xf9. + Additional Documentation ------------------------ More practical details on the protocol and how it's supported by industrial diff --git a/Documentation/sgi-ioc4.txt b/Documentation/sgi-ioc4.txt index 876c96ae38db..72709222d3c0 100644 --- a/Documentation/sgi-ioc4.txt +++ b/Documentation/sgi-ioc4.txt @@ -1,3 +1,7 @@ +==================================== +SGI IOC4 PCI (multi function) device +==================================== + The SGI IOC4 PCI device is a bit of a strange beast, so some notes on it are in order. diff --git a/Documentation/sh/conf.py b/Documentation/sh/conf.py new file mode 100644 index 000000000000..1eb684a13ac8 --- /dev/null +++ b/Documentation/sh/conf.py @@ -0,0 +1,10 @@ +# -*- coding: utf-8; mode: python -*- + +project = "SuperH architecture implementation manual" + +tags.add("subproject") + +latex_documents = [ + ('index', 'sh.tex', project, + 'The kernel development community', 'manual'), +] diff --git a/Documentation/sh/index.rst b/Documentation/sh/index.rst new file mode 100644 index 000000000000..bc8db7ba894a --- /dev/null +++ b/Documentation/sh/index.rst @@ -0,0 +1,59 @@ +======================= +SuperH Interfaces Guide +======================= + +:Author: Paul Mundt + +Memory Management +================= + +SH-4 +---- + +Store Queue API +~~~~~~~~~~~~~~~ + +.. kernel-doc:: arch/sh/kernel/cpu/sh4/sq.c + :export: + +SH-5 +---- + +TLB Interfaces +~~~~~~~~~~~~~~ + +.. kernel-doc:: arch/sh/mm/tlb-sh5.c + :internal: + +.. kernel-doc:: arch/sh/include/asm/tlb_64.h + :internal: + +Machine Specific Interfaces +=========================== + +mach-dreamcast +-------------- + +.. kernel-doc:: arch/sh/boards/mach-dreamcast/rtc.c + :internal: + +mach-x3proto +------------ + +.. kernel-doc:: arch/sh/boards/mach-x3proto/ilsel.c + :export: + +Busses +====== + +SuperHyway +---------- + +.. kernel-doc:: drivers/sh/superhyway/superhyway.c + :export: + +Maple +----- + +.. kernel-doc:: drivers/sh/maple/maple.c + :export: diff --git a/Documentation/siphash.txt b/Documentation/siphash.txt index 908d348ff777..9965821ab333 100644 --- a/Documentation/siphash.txt +++ b/Documentation/siphash.txt @@ -1,6 +1,8 @@ - SipHash - a short input PRF ------------------------------------------------ -Written by Jason A. Donenfeld <jason@zx2c4.com> +=========================== +SipHash - a short input PRF +=========================== + +:Author: Written by Jason A. Donenfeld <jason@zx2c4.com> SipHash is a cryptographically secure PRF -- a keyed hash function -- that performs very well for short inputs, hence the name. It was designed by @@ -13,58 +15,61 @@ an input buffer or several input integers. It spits out an integer that is indistinguishable from random. You may then use that integer as part of secure sequence numbers, secure cookies, or mask it off for use in a hash table. -1. Generating a key +Generating a key +================ Keys should always be generated from a cryptographically secure source of -random numbers, either using get_random_bytes or get_random_once: +random numbers, either using get_random_bytes or get_random_once:: -siphash_key_t key; -get_random_bytes(&key, sizeof(key)); + siphash_key_t key; + get_random_bytes(&key, sizeof(key)); If you're not deriving your key from here, you're doing it wrong. -2. Using the functions +Using the functions +=================== There are two variants of the function, one that takes a list of integers, and -one that takes a buffer: +one that takes a buffer:: -u64 siphash(const void *data, size_t len, const siphash_key_t *key); + u64 siphash(const void *data, size_t len, const siphash_key_t *key); -And: +And:: -u64 siphash_1u64(u64, const siphash_key_t *key); -u64 siphash_2u64(u64, u64, const siphash_key_t *key); -u64 siphash_3u64(u64, u64, u64, const siphash_key_t *key); -u64 siphash_4u64(u64, u64, u64, u64, const siphash_key_t *key); -u64 siphash_1u32(u32, const siphash_key_t *key); -u64 siphash_2u32(u32, u32, const siphash_key_t *key); -u64 siphash_3u32(u32, u32, u32, const siphash_key_t *key); -u64 siphash_4u32(u32, u32, u32, u32, const siphash_key_t *key); + u64 siphash_1u64(u64, const siphash_key_t *key); + u64 siphash_2u64(u64, u64, const siphash_key_t *key); + u64 siphash_3u64(u64, u64, u64, const siphash_key_t *key); + u64 siphash_4u64(u64, u64, u64, u64, const siphash_key_t *key); + u64 siphash_1u32(u32, const siphash_key_t *key); + u64 siphash_2u32(u32, u32, const siphash_key_t *key); + u64 siphash_3u32(u32, u32, u32, const siphash_key_t *key); + u64 siphash_4u32(u32, u32, u32, u32, const siphash_key_t *key); If you pass the generic siphash function something of a constant length, it will constant fold at compile-time and automatically choose one of the optimized functions. -3. Hashtable key function usage: +Hashtable key function usage:: -struct some_hashtable { - DECLARE_HASHTABLE(hashtable, 8); - siphash_key_t key; -}; + struct some_hashtable { + DECLARE_HASHTABLE(hashtable, 8); + siphash_key_t key; + }; -void init_hashtable(struct some_hashtable *table) -{ - get_random_bytes(&table->key, sizeof(table->key)); -} + void init_hashtable(struct some_hashtable *table) + { + get_random_bytes(&table->key, sizeof(table->key)); + } -static inline hlist_head *some_hashtable_bucket(struct some_hashtable *table, struct interesting_input *input) -{ - return &table->hashtable[siphash(input, sizeof(*input), &table->key) & (HASH_SIZE(table->hashtable) - 1)]; -} + static inline hlist_head *some_hashtable_bucket(struct some_hashtable *table, struct interesting_input *input) + { + return &table->hashtable[siphash(input, sizeof(*input), &table->key) & (HASH_SIZE(table->hashtable) - 1)]; + } You may then iterate like usual over the returned hash bucket. -4. Security +Security +======== SipHash has a very high security margin, with its 128-bit key. So long as the key is kept secret, it is impossible for an attacker to guess the outputs of @@ -73,7 +78,8 @@ is significant. Linux implements the "2-4" variant of SipHash. -5. Struct-passing Pitfalls +Struct-passing Pitfalls +======================= Often times the XuY functions will not be large enough, and instead you'll want to pass a pre-filled struct to siphash. When doing this, it's important @@ -81,30 +87,32 @@ to always ensure the struct has no padding holes. The easiest way to do this is to simply arrange the members of the struct in descending order of size, and to use offsetendof() instead of sizeof() for getting the size. For performance reasons, if possible, it's probably a good thing to align the -struct to the right boundary. Here's an example: - -const struct { - struct in6_addr saddr; - u32 counter; - u16 dport; -} __aligned(SIPHASH_ALIGNMENT) combined = { - .saddr = *(struct in6_addr *)saddr, - .counter = counter, - .dport = dport -}; -u64 h = siphash(&combined, offsetofend(typeof(combined), dport), &secret); - -6. Resources +struct to the right boundary. Here's an example:: + + const struct { + struct in6_addr saddr; + u32 counter; + u16 dport; + } __aligned(SIPHASH_ALIGNMENT) combined = { + .saddr = *(struct in6_addr *)saddr, + .counter = counter, + .dport = dport + }; + u64 h = siphash(&combined, offsetofend(typeof(combined), dport), &secret); + +Resources +========= Read the SipHash paper if you're interested in learning more: https://131002.net/siphash/siphash.pdf +------------------------------------------------------------------------------- -~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~ - +=============================================== HalfSipHash - SipHash's insecure younger cousin ------------------------------------------------ -Written by Jason A. Donenfeld <jason@zx2c4.com> +=============================================== + +:Author: Written by Jason A. Donenfeld <jason@zx2c4.com> On the off-chance that SipHash is not fast enough for your needs, you might be able to justify using HalfSipHash, a terrifying but potentially useful @@ -120,7 +128,8 @@ then when you can be absolutely certain that the outputs will never be transmitted out of the kernel. This is only remotely useful over `jhash` as a means of mitigating hashtable flooding denial of service attacks. -1. Generating a key +Generating a key +================ Keys should always be generated from a cryptographically secure source of random numbers, either using get_random_bytes or get_random_once: @@ -130,44 +139,49 @@ get_random_bytes(&key, sizeof(key)); If you're not deriving your key from here, you're doing it wrong. -2. Using the functions +Using the functions +=================== There are two variants of the function, one that takes a list of integers, and -one that takes a buffer: +one that takes a buffer:: -u32 hsiphash(const void *data, size_t len, const hsiphash_key_t *key); + u32 hsiphash(const void *data, size_t len, const hsiphash_key_t *key); -And: +And:: -u32 hsiphash_1u32(u32, const hsiphash_key_t *key); -u32 hsiphash_2u32(u32, u32, const hsiphash_key_t *key); -u32 hsiphash_3u32(u32, u32, u32, const hsiphash_key_t *key); -u32 hsiphash_4u32(u32, u32, u32, u32, const hsiphash_key_t *key); + u32 hsiphash_1u32(u32, const hsiphash_key_t *key); + u32 hsiphash_2u32(u32, u32, const hsiphash_key_t *key); + u32 hsiphash_3u32(u32, u32, u32, const hsiphash_key_t *key); + u32 hsiphash_4u32(u32, u32, u32, u32, const hsiphash_key_t *key); If you pass the generic hsiphash function something of a constant length, it will constant fold at compile-time and automatically choose one of the optimized functions. -3. Hashtable key function usage: +Hashtable key function usage +============================ + +:: -struct some_hashtable { - DECLARE_HASHTABLE(hashtable, 8); - hsiphash_key_t key; -}; + struct some_hashtable { + DECLARE_HASHTABLE(hashtable, 8); + hsiphash_key_t key; + }; -void init_hashtable(struct some_hashtable *table) -{ - get_random_bytes(&table->key, sizeof(table->key)); -} + void init_hashtable(struct some_hashtable *table) + { + get_random_bytes(&table->key, sizeof(table->key)); + } -static inline hlist_head *some_hashtable_bucket(struct some_hashtable *table, struct interesting_input *input) -{ - return &table->hashtable[hsiphash(input, sizeof(*input), &table->key) & (HASH_SIZE(table->hashtable) - 1)]; -} + static inline hlist_head *some_hashtable_bucket(struct some_hashtable *table, struct interesting_input *input) + { + return &table->hashtable[hsiphash(input, sizeof(*input), &table->key) & (HASH_SIZE(table->hashtable) - 1)]; + } You may then iterate like usual over the returned hash bucket. -4. Performance +Performance +=========== HalfSipHash is roughly 3 times slower than JenkinsHash. For many replacements, this will not be a problem, as the hashtable lookup isn't the bottleneck. And diff --git a/Documentation/smsc_ece1099.txt b/Documentation/smsc_ece1099.txt index 6b492e82b43d..079277421eaf 100644 --- a/Documentation/smsc_ece1099.txt +++ b/Documentation/smsc_ece1099.txt @@ -1,3 +1,7 @@ +================================================= +Msc Keyboard Scan Expansion/GPIO Expansion device +================================================= + What is smsc-ece1099? ---------------------- diff --git a/Documentation/sound/conf.py b/Documentation/sound/conf.py new file mode 100644 index 000000000000..3f1fc5e74e7b --- /dev/null +++ b/Documentation/sound/conf.py @@ -0,0 +1,10 @@ +# -*- coding: utf-8; mode: python -*- + +project = "Linux Sound Subsystem Documentation" + +tags.add("subproject") + +latex_documents = [ + ('index', 'sound.tex', project, + 'The kernel development community', 'manual'), +] diff --git a/Documentation/sound/designs/index.rst b/Documentation/sound/designs/index.rst index 04dcdae3e4f2..f0749943ccb2 100644 --- a/Documentation/sound/designs/index.rst +++ b/Documentation/sound/designs/index.rst @@ -9,6 +9,7 @@ Designs and Implementations compress-offload timestamping jack-controls + tracepoints procfile powersave oss-emulation diff --git a/Documentation/sound/designs/tracepoints.rst b/Documentation/sound/designs/tracepoints.rst new file mode 100644 index 000000000000..78bc5572f829 --- /dev/null +++ b/Documentation/sound/designs/tracepoints.rst @@ -0,0 +1,172 @@ +=================== +Tracepoints in ALSA +=================== + +2017/07/02 +Takasahi Sakamoto + +Tracepoints in ALSA PCM core +============================ + +ALSA PCM core registers ``snd_pcm`` subsystem to kernel tracepoint system. +This subsystem includes two categories of tracepoints; for state of PCM buffer +and for processing of PCM hardware parameters. These tracepoints are available +when corresponding kernel configurations are enabled. When ``CONFIG_SND_DEBUG`` +is enabled, the latter tracepoints are available. When additional +``SND_PCM_XRUN_DEBUG`` is enabled too, the former trace points are enabled. + +Tracepoints for state of PCM buffer +------------------------------------ + +This category includes four tracepoints; ``hwptr``, ``applptr``, ``xrun`` and +``hw_ptr_error``. + +Tracepoints for processing of PCM hardware parameters +----------------------------------------------------- + +This category includes two tracepoints; ``hw_mask_param`` and +``hw_interval_param``. + +In a design of ALSA PCM core, data transmission is abstracted as PCM substream. +Applications manage PCM substream to maintain data transmission for PCM frames. +Before starting the data transmission, applications need to configure PCM +substream. In this procedure, PCM hardware parameters are decided by +interaction between applications and ALSA PCM core. Once decided, runtime of +the PCM substream keeps the parameters. + +The parameters are described in :c:type:`struct snd_pcm_hw_params`. This +structure includes several types of parameters. Applications set preferable +value to these parameters, then execute ioctl(2) with SNDRV_PCM_IOCTL_HW_REFINE +or SNDRV_PCM_IOCTL_HW_PARAMS. The former is used just for refining available +set of parameters. The latter is used for an actual decision of the parameters. + +The :c:type:`struct snd_pcm_hw_params` structure has below members: + +``flags`` + Configurable. ALSA PCM core and some drivers handle this flag to select + convenient parameters or change their behaviour. +``masks`` + Configurable. This type of parameter is described in + :c:type:`struct snd_mask` and represent mask values. As of PCM protocol + v2.0.13, three types are defined. + + - SNDRV_PCM_HW_PARAM_ACCESS + - SNDRV_PCM_HW_PARAM_FORMAT + - SNDRV_PCM_HW_PARAM_SUBFORMAT +``intervals`` + Configurable. This type of parameter is described in + :c:type:`struct snd_interval` and represent values with a range. As of + PCM protocol v2.0.13, twelve types are defined. + + - SNDRV_PCM_HW_PARAM_SAMPLE_BITS + - SNDRV_PCM_HW_PARAM_FRAME_BITS + - SNDRV_PCM_HW_PARAM_CHANNELS + - SNDRV_PCM_HW_PARAM_RATE + - SNDRV_PCM_HW_PARAM_PERIOD_TIME + - SNDRV_PCM_HW_PARAM_PERIOD_SIZE + - SNDRV_PCM_HW_PARAM_PERIOD_BYTES + - SNDRV_PCM_HW_PARAM_PERIODS + - SNDRV_PCM_HW_PARAM_BUFFER_TIME + - SNDRV_PCM_HW_PARAM_BUFFER_SIZE + - SNDRV_PCM_HW_PARAM_BUFFER_BYTES + - SNDRV_PCM_HW_PARAM_TICK_TIME +``rmask`` + Configurable. This is evaluated at ioctl(2) with + SNDRV_PCM_IOCTL_HW_REFINE only. Applications can select which + mask/interval parameter can be changed by ALSA PCM core. For + SNDRV_PCM_IOCTL_HW_PARAMS, this mask is ignored and all of parameters + are going to be changed. +``cmask`` + Read-only. After returning from ioctl(2), buffer in user space for + :c:type:`struct snd_pcm_hw_params` includes result of each operation. + This mask represents which mask/interval parameter is actually changed. +``info`` + Read-only. This represents hardware/driver capabilities as bit flags + with SNDRV_PCM_INFO_XXX. Typically, applications execute ioctl(2) with + SNDRV_PCM_IOCTL_HW_REFINE to retrieve this flag, then decide candidates + of parameters and execute ioctl(2) with SNDRV_PCM_IOCTL_HW_PARAMS to + configure PCM substream. +``msbits`` + Read-only. This value represents available bit width in MSB side of + a PCM sample. When a parameter of SNDRV_PCM_HW_PARAM_SAMPLE_BITS was + decided as a fixed number, this value is also calculated according to + it. Else, zero. But this behaviour depends on implementations in driver + side. +``rate_num`` + Read-only. This value represents numerator of sampling rate in fraction + notation. Basically, when a parameter of SNDRV_PCM_HW_PARAM_RATE was + decided as a single value, this value is also calculated according to + it. Else, zero. But this behaviour depends on implementations in driver + side. +``rate_den`` + Read-only. This value represents denominator of sampling rate in + fraction notation. Basically, when a parameter of + SNDRV_PCM_HW_PARAM_RATE was decided as a single value, this value is + also calculated according to it. Else, zero. But this behaviour depends + on implementations in driver side. +``fifo_size`` + Read-only. This value represents the size of FIFO in serial sound + interface of hardware. Basically, each driver can assigns a proper + value to this parameter but some drivers intentionally set zero with + a care of hardware design or data transmission protocol. + +ALSA PCM core handles buffer of :c:type:`struct snd_pcm_hw_params` when +applications execute ioctl(2) with SNDRV_PCM_HW_REFINE or SNDRV_PCM_HW_PARAMS. +Parameters in the buffer are changed according to +:c:type:`struct snd_pcm_hardware` and rules of constraints in the runtime. The +structure describes capabilities of handled hardware. The rules describes +dependencies on which a parameter is decided according to several parameters. +A rule has a callback function, and drivers can register arbitrary functions +to compute the target parameter. ALSA PCM core registers some rules to the +runtime as a default. + +Each driver can join in the interaction as long as it prepared for two stuffs +in a callback of :c:type:`struct snd_pcm_ops.open`. + +1. In the callback, drivers are expected to change a member of + :c:type:`struct snd_pcm_hardware` type in the runtime, according to + capacities of corresponding hardware. +2. In the same callback, drivers are also expected to register additional rules + of constraints into the runtime when several parameters have dependencies + due to hardware design. + +The driver can refers to result of the interaction in a callback of +:c:type:`struct snd_pcm_ops.hw_params`, however it should not change the +content. + +Tracepoints in this category are designed to trace changes of the +mask/interval parameters. When ALSA PCM core changes them, ``hw_mask_param`` or +``hw_interval_param`` event is probed according to type of the changed parameter. + +ALSA PCM core also has a pretty print format for each of the tracepoints. Below +is an example for ``hw_mask_param``. + +:: + + hw_mask_param: pcmC0D0p 001/023 FORMAT 00000000000000000000001000000044 00000000000000000000001000000044 + + +Below is an example for ``hw_interval_param``. + +:: + + hw_interval_param: pcmC0D0p 000/023 BUFFER_SIZE 0 0 [0 4294967295] 0 1 [0 4294967295] + +The first three fields are common. They represent name of ALSA PCM character +device, rules of constraint and name of the changed parameter, in order. The +field for rules of constraint consists of two sub-fields; index of applied rule +and total number of rules added to the runtime. As an exception, the index 000 +means that the parameter is changed by ALSA PCM core, regardless of the rules. + +The rest of field represent state of the parameter before/after changing. These +fields are different according to type of the parameter. For parameters of mask +type, the fields represent hexadecimal dump of content of the parameter. For +parameters of interval type, the fields represent values of each member of +``empty``, ``integer``, ``openmin``, ``min``, ``max``, ``openmax`` in +:c:type:`struct snd_interval` in this order. + +Tracepoints in drivers +====================== + +Some drivers have tracepoints for developers' convenience. For them, please +refer to each documentation or implementation. diff --git a/Documentation/sound/kernel-api/writing-an-alsa-driver.rst b/Documentation/sound/kernel-api/writing-an-alsa-driver.rst index 95c5443eff38..58ffa3f5bda7 100644 --- a/Documentation/sound/kernel-api/writing-an-alsa-driver.rst +++ b/Documentation/sound/kernel-api/writing-an-alsa-driver.rst @@ -2080,8 +2080,8 @@ sleeping poll threads, etc. This callback is also atomic as default. -copy and silence callbacks -~~~~~~~~~~~~~~~~~~~~~~~~~~ +copy_user, copy_kernel and fill_silence ops +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ These callbacks are not mandatory, and can be omitted in most cases. These callbacks are used when the hardware buffer cannot be in the @@ -3532,8 +3532,9 @@ external hardware buffer in interrupts (or in tasklets, preferably). The first case works fine if the external hardware buffer is large enough. This method doesn't need any extra buffers and thus is more -effective. You need to define the ``copy`` and ``silence`` callbacks -for the data transfer. However, there is a drawback: it cannot be +effective. You need to define the ``copy_user`` and ``copy_kernel`` +callbacks for the data transfer, in addition to ``fill_silence`` +callback for playback. However, there is a drawback: it cannot be mmapped. The examples are GUS's GF1 PCM or emu8000's wavetable PCM. The second case allows for mmap on the buffer, although you have to @@ -3545,30 +3546,34 @@ Another case is when the chip uses a PCI memory-map region for the buffer instead of the host memory. In this case, mmap is available only on certain architectures like the Intel one. In non-mmap mode, the data cannot be transferred as in the normal way. Thus you need to define the -``copy`` and ``silence`` callbacks as well, as in the cases above. The -examples are found in ``rme32.c`` and ``rme96.c``. +``copy_user``, ``copy_kernel`` and ``fill_silence`` callbacks as well, +as in the cases above. The examples are found in ``rme32.c`` and +``rme96.c``. -The implementation of the ``copy`` and ``silence`` callbacks depends -upon whether the hardware supports interleaved or non-interleaved -samples. The ``copy`` callback is defined like below, a bit -differently depending whether the direction is playback or capture: +The implementation of the ``copy_user``, ``copy_kernel`` and +``silence`` callbacks depends upon whether the hardware supports +interleaved or non-interleaved samples. The ``copy_user`` callback is +defined like below, a bit differently depending whether the direction +is playback or capture: :: - static int playback_copy(struct snd_pcm_substream *substream, int channel, - snd_pcm_uframes_t pos, void *src, snd_pcm_uframes_t count); - static int capture_copy(struct snd_pcm_substream *substream, int channel, - snd_pcm_uframes_t pos, void *dst, snd_pcm_uframes_t count); + static int playback_copy_user(struct snd_pcm_substream *substream, + int channel, unsigned long pos, + void __user *src, unsigned long count); + static int capture_copy_user(struct snd_pcm_substream *substream, + int channel, unsigned long pos, + void __user *dst, unsigned long count); In the case of interleaved samples, the second argument (``channel``) is not used. The third argument (``pos``) points the current position -offset in frames. +offset in bytes. The meaning of the fourth argument is different between playback and capture. For playback, it holds the source data pointer, and for capture, it's the destination data pointer. -The last argument is the number of frames to be copied. +The last argument is the number of bytes to be copied. What you have to do in this callback is again different between playback and capture directions. In the playback case, you copy the given amount @@ -3578,8 +3583,7 @@ way, the copy would be like: :: - my_memcpy(my_buffer + frames_to_bytes(runtime, pos), src, - frames_to_bytes(runtime, count)); + my_memcpy_from_user(my_buffer + pos, src, count); For the capture direction, you copy the given amount of data (``count``) at the specified offset (``pos``) on the hardware buffer to the @@ -3587,31 +3591,68 @@ specified pointer (``dst``). :: - my_memcpy(dst, my_buffer + frames_to_bytes(runtime, pos), - frames_to_bytes(runtime, count)); + my_memcpy_to_user(dst, my_buffer + pos, count); + +Here the functions are named as ``from_user`` and ``to_user`` because +it's the user-space buffer that is passed to these callbacks. That +is, the callback is supposed to copy from/to the user-space data +directly to/from the hardware buffer. -Note that both the position and the amount of data are given in frames. +Careful readers might notice that these callbacks receive the +arguments in bytes, not in frames like other callbacks. It's because +it would make coding easier like the examples above, and also it makes +easier to unify both the interleaved and non-interleaved cases, as +explained in the following. In the case of non-interleaved samples, the implementation will be a bit -more complicated. +more complicated. The callback is called for each channel, passed by +the second argument, so totally it's called for N-channels times per +transfer. + +The meaning of other arguments are almost same as the interleaved +case. The callback is supposed to copy the data from/to the given +user-space buffer, but only for the given channel. For the detailed +implementations, please check ``isa/gus/gus_pcm.c`` or +"pci/rme9652/rme9652.c" as examples. + +The above callbacks are the copy from/to the user-space buffer. There +are some cases where we want copy from/to the kernel-space buffer +instead. In such a case, ``copy_kernel`` callback is called. It'd +look like: + +:: + + static int playback_copy_kernel(struct snd_pcm_substream *substream, + int channel, unsigned long pos, + void *src, unsigned long count); + static int capture_copy_kernel(struct snd_pcm_substream *substream, + int channel, unsigned long pos, + void *dst, unsigned long count); + +As found easily, the only difference is that the buffer pointer is +without ``__user`` prefix; that is, a kernel-buffer pointer is passed +in the fourth argument. Correspondingly, the implementation would be +a version without the user-copy, such as: -You need to check the channel argument, and if it's -1, copy the whole -channels. Otherwise, you have to copy only the specified channel. Please -check ``isa/gus/gus_pcm.c`` as an example. +:: + + my_memcpy(my_buffer + pos, src, count); -The ``silence`` callback is also implemented in a similar way +Usually for the playback, another callback ``fill_silence`` is +defined. It's implemented in a similar way as the copy callbacks +above: :: static int silence(struct snd_pcm_substream *substream, int channel, - snd_pcm_uframes_t pos, snd_pcm_uframes_t count); + unsigned long pos, unsigned long count); -The meanings of arguments are the same as in the ``copy`` callback, -although there is no ``src/dst`` argument. In the case of interleaved -samples, the channel argument has no meaning, as well as on ``copy`` -callback. +The meanings of arguments are the same as in the ``copy_user`` and +``copy_kernel`` callbacks, although there is no buffer pointer +argument. In the case of interleaved samples, the channel argument has +no meaning, as well as on ``copy_*`` callbacks. -The role of ``silence`` callback is to set the given amount +The role of ``fill_silence`` callback is to set the given amount (``count``) of silence data at the specified offset (``pos``) on the hardware buffer. Suppose that the data format is signed (that is, the silent-data is 0), and the implementation using a memset-like function @@ -3619,11 +3660,11 @@ would be like: :: - my_memcpy(my_buffer + frames_to_bytes(runtime, pos), 0, - frames_to_bytes(runtime, count)); + my_memset(my_buffer + pos, 0, count); In the case of non-interleaved samples, again, the implementation -becomes a bit more complicated. See, for example, ``isa/gus/gus_pcm.c``. +becomes a bit more complicated, as it's called N-times per transfer +for each channel. See, for example, ``isa/gus/gus_pcm.c``. Non-Contiguous Buffers ---------------------- diff --git a/Documentation/sound/soc/dapm.rst b/Documentation/sound/soc/dapm.rst index a27f42befa4d..8e44107933ab 100644 --- a/Documentation/sound/soc/dapm.rst +++ b/Documentation/sound/soc/dapm.rst @@ -105,6 +105,24 @@ Pre Special PRE widget (exec before all others) Post Special POST widget (exec after all others) +Buffer + Inter widget audio data buffer within a DSP. +Scheduler + DSP internal scheduler that schedules component/pipeline processing + work. +Effect + Widget that performs an audio processing effect. +SRC + Sample Rate Converter within DSP or CODEC +ASRC + Asynchronous Sample Rate Converter within DSP or CODEC +Encoder + Widget that encodes audio data from one format (usually PCM) to another + usually more compressed format. +Decoder + Widget that decodes audio data from a compressed format to an + uncompressed format like PCM. + (Widgets are defined in include/sound/soc-dapm.h) diff --git a/Documentation/sphinx/convert_template.sed b/Documentation/sphinx/convert_template.sed deleted file mode 100644 index c1503fcca4ec..000000000000 --- a/Documentation/sphinx/convert_template.sed +++ /dev/null @@ -1,18 +0,0 @@ -# -# Pandoc doesn't grok <function> or <structname>, so convert them -# ahead of time. -# -# Use the following escapes to pass through pandoc: -# $bq = "`" -# $lt = "<" -# $gt = ">" -# -s%<function>\([^<(]\+\)()</function>%:c:func:$bq\1()$bq%g -s%<function>\([^<(]\+\)</function>%:c:func:$bq\1()$bq%g -s%<structname>struct *\([^<]\+\)</structname>%:c:type:$bqstruct \1 $lt\1$gt$bq%g -s%struct <structname>\([^<]\+\)</structname>%:c:type:$bqstruct \1 $lt\1$gt$bq%g -s%<structname>\([^<]\+\)</structname>%:c:type:$bqstruct \1 $lt\1$gt$bq%g -# -# Wrap docproc directives in para and code blocks. -# -s%^\(!.*\)$%<para><code>DOCPROC: \1</code></para>% diff --git a/Documentation/sphinx/post_convert.sed b/Documentation/sphinx/post_convert.sed deleted file mode 100644 index 392770bac53b..000000000000 --- a/Documentation/sphinx/post_convert.sed +++ /dev/null @@ -1,23 +0,0 @@ -# -# Unescape. -# -s/$bq/`/g -s/$lt/</g -s/$gt/>/g -# -# pandoc thinks that both "_" needs to be escaped. Remove the extra -# backslashes. -# -s/\\_/_/g -# -# Unwrap docproc directives. -# -s/^``DOCPROC: !E\(.*\)``$/.. kernel-doc:: \1\n :export:/ -s/^``DOCPROC: !I\(.*\)``$/.. kernel-doc:: \1\n :internal:/ -s/^``DOCPROC: !F\([^ ]*\) \(.*\)``$/.. kernel-doc:: \1\n :functions: \2/ -s/^``DOCPROC: !P\([^ ]*\) \(.*\)``$/.. kernel-doc:: \1\n :doc: \2/ -s/^``DOCPROC: \(!.*\)``$/.. WARNING: DOCPROC directive not supported: \1/ -# -# Trim trailing whitespace. -# -s/[[:space:]]*$// diff --git a/Documentation/sphinx/tmplcvt b/Documentation/sphinx/tmplcvt deleted file mode 100755 index 6848f0a26fa5..000000000000 --- a/Documentation/sphinx/tmplcvt +++ /dev/null @@ -1,28 +0,0 @@ -#!/bin/bash -# -# Convert a template file into something like RST -# -# fix <function> -# feed to pandoc -# fix \_ -# title line? -# -set -eu - -if [ "$#" != "2" ]; then - echo "$0 <docbook file> <rst file>" - exit -fi - -DIR=$(dirname $0) - -in=$1 -rst=$2 -tmp=$rst.tmp - -cp $in $tmp -sed --in-place -f $DIR/convert_template.sed $tmp -pandoc -s -S -f docbook -t rst -o $rst $tmp -sed --in-place -f $DIR/post_convert.sed $rst -rm $tmp -echo "book writen to $rst" diff --git a/Documentation/spi/spi-summary b/Documentation/spi/spi-summary index d1824b399b2d..1721c1b570c3 100644 --- a/Documentation/spi/spi-summary +++ b/Documentation/spi/spi-summary @@ -62,8 +62,8 @@ chips described as using "three wire" signaling: SCK, data, nCSx. (That data line is sometimes called MOMI or SISO.) Microcontrollers often support both master and slave sides of the SPI -protocol. This document (and Linux) currently only supports the master -side of SPI interactions. +protocol. This document (and Linux) supports both the master and slave +sides of SPI interactions. Who uses it? On what kinds of systems? @@ -154,9 +154,8 @@ control audio interfaces, present touchscreen sensors as input interfaces, or monitor temperature and voltage levels during industrial processing. And those might all be sharing the same controller driver. -A "struct spi_device" encapsulates the master-side interface between -those two types of driver. At this writing, Linux has no slave side -programming interface. +A "struct spi_device" encapsulates the controller-side interface between +those two types of drivers. There is a minimal core of SPI programming interfaces, focussing on using the driver model to connect controller and protocol drivers using @@ -177,10 +176,24 @@ shows up in sysfs in several locations: /sys/bus/spi/drivers/D ... driver for one or more spi*.* devices /sys/class/spi_master/spiB ... symlink (or actual device node) to - a logical node which could hold class related state for the - controller managing bus "B". All spiB.* devices share one + a logical node which could hold class related state for the SPI + master controller managing bus "B". All spiB.* devices share one physical SPI bus segment, with SCLK, MOSI, and MISO. + /sys/devices/.../CTLR/slave ... virtual file for (un)registering the + slave device for an SPI slave controller. + Writing the driver name of an SPI slave handler to this file + registers the slave device; writing "(null)" unregisters the slave + device. + Reading from this file shows the name of the slave device ("(null)" + if not registered). + + /sys/class/spi_slave/spiB ... symlink (or actual device node) to + a logical node which could hold class related state for the SPI + slave controller on bus "B". When registered, a single spiB.* + device is present here, possible sharing the physical SPI bus + segment with other SPI slave devices. + Note that the actual location of the controller's class state depends on whether you enabled CONFIG_SYSFS_DEPRECATED or not. At this time, the only class-specific state is the bus number ("B" in "spiB"), so diff --git a/Documentation/static-keys.txt b/Documentation/static-keys.txt index ef419fd0897f..b83dfa1c0602 100644 --- a/Documentation/static-keys.txt +++ b/Documentation/static-keys.txt @@ -1,30 +1,34 @@ - Static Keys - ----------- +=========== +Static Keys +=========== -DEPRECATED API: +.. warning:: -The use of 'struct static_key' directly, is now DEPRECATED. In addition -static_key_{true,false}() is also DEPRECATED. IE DO NOT use the following: + DEPRECATED API: -struct static_key false = STATIC_KEY_INIT_FALSE; -struct static_key true = STATIC_KEY_INIT_TRUE; -static_key_true() -static_key_false() + The use of 'struct static_key' directly, is now DEPRECATED. In addition + static_key_{true,false}() is also DEPRECATED. IE DO NOT use the following:: -The updated API replacements are: + struct static_key false = STATIC_KEY_INIT_FALSE; + struct static_key true = STATIC_KEY_INIT_TRUE; + static_key_true() + static_key_false() -DEFINE_STATIC_KEY_TRUE(key); -DEFINE_STATIC_KEY_FALSE(key); -DEFINE_STATIC_KEY_ARRAY_TRUE(keys, count); -DEFINE_STATIC_KEY_ARRAY_FALSE(keys, count); -static_branch_likely() -static_branch_unlikely() + The updated API replacements are:: -0) Abstract + DEFINE_STATIC_KEY_TRUE(key); + DEFINE_STATIC_KEY_FALSE(key); + DEFINE_STATIC_KEY_ARRAY_TRUE(keys, count); + DEFINE_STATIC_KEY_ARRAY_FALSE(keys, count); + static_branch_likely() + static_branch_unlikely() + +Abstract +======== Static keys allows the inclusion of seldom used features in performance-sensitive fast-path kernel code, via a GCC feature and a code -patching technique. A quick example: +patching technique. A quick example:: DEFINE_STATIC_KEY_FALSE(key); @@ -45,7 +49,8 @@ The static_branch_unlikely() branch will be generated into the code with as litt impact to the likely code path as possible. -1) Motivation +Motivation +========== Currently, tracepoints are implemented using a conditional branch. The @@ -60,7 +65,8 @@ possible. Although tracepoints are the original motivation for this work, other kernel code paths should be able to make use of the static keys facility. -2) Solution +Solution +======== gcc (v4.5) adds a new 'asm goto' statement that allows branching to a label: @@ -71,7 +77,7 @@ Using the 'asm goto', we can create branches that are either taken or not taken by default, without the need to check memory. Then, at run-time, we can patch the branch site to change the branch direction. -For example, if we have a simple branch that is disabled by default: +For example, if we have a simple branch that is disabled by default:: if (static_branch_unlikely(&key)) printk("I am the true branch\n"); @@ -87,14 +93,15 @@ optimization. This lowlevel patching mechanism is called 'jump label patching', and it gives the basis for the static keys facility. -3) Static key label API, usage and examples: +Static key label API, usage and examples +======================================== -In order to make use of this optimization you must first define a key: +In order to make use of this optimization you must first define a key:: DEFINE_STATIC_KEY_TRUE(key); -or: +or:: DEFINE_STATIC_KEY_FALSE(key); @@ -102,14 +109,14 @@ or: The key must be global, that is, it can't be allocated on the stack or dynamically allocated at run-time. -The key is then used in code as: +The key is then used in code as:: if (static_branch_unlikely(&key)) do unlikely code else do likely code -Or: +Or:: if (static_branch_likely(&key)) do likely code @@ -120,15 +127,15 @@ Keys defined via DEFINE_STATIC_KEY_TRUE(), or DEFINE_STATIC_KEY_FALSE, may be used in either static_branch_likely() or static_branch_unlikely() statements. -Branch(es) can be set true via: +Branch(es) can be set true via:: -static_branch_enable(&key); + static_branch_enable(&key); -or false via: +or false via:: -static_branch_disable(&key); + static_branch_disable(&key); -The branch(es) can then be switched via reference counts: +The branch(es) can then be switched via reference counts:: static_branch_inc(&key); ... @@ -142,11 +149,11 @@ static_branch_inc(), will change the branch back to true. Likewise, if the key is initialized false, a 'static_branch_inc()', will change the branch to true. And then a 'static_branch_dec()', will again make the branch false. -Where an array of keys is required, it can be defined as: +Where an array of keys is required, it can be defined as:: DEFINE_STATIC_KEY_ARRAY_TRUE(keys, count); -or: +or:: DEFINE_STATIC_KEY_ARRAY_FALSE(keys, count); @@ -159,96 +166,98 @@ simply fall back to a traditional, load, test, and jump sequence. Also, the struct jump_entry table must be at least 4-byte aligned because the static_key->entry field makes use of the two least significant bits. -* select HAVE_ARCH_JUMP_LABEL, see: arch/x86/Kconfig - -* #define JUMP_LABEL_NOP_SIZE, see: arch/x86/include/asm/jump_label.h +* ``select HAVE_ARCH_JUMP_LABEL``, + see: arch/x86/Kconfig -* __always_inline bool arch_static_branch(struct static_key *key, bool branch), see: - arch/x86/include/asm/jump_label.h +* ``#define JUMP_LABEL_NOP_SIZE``, + see: arch/x86/include/asm/jump_label.h -* __always_inline bool arch_static_branch_jump(struct static_key *key, bool branch), - see: arch/x86/include/asm/jump_label.h +* ``__always_inline bool arch_static_branch(struct static_key *key, bool branch)``, + see: arch/x86/include/asm/jump_label.h -* void arch_jump_label_transform(struct jump_entry *entry, enum jump_label_type type), - see: arch/x86/kernel/jump_label.c +* ``__always_inline bool arch_static_branch_jump(struct static_key *key, bool branch)``, + see: arch/x86/include/asm/jump_label.h -* __init_or_module void arch_jump_label_transform_static(struct jump_entry *entry, enum jump_label_type type), - see: arch/x86/kernel/jump_label.c +* ``void arch_jump_label_transform(struct jump_entry *entry, enum jump_label_type type)``, + see: arch/x86/kernel/jump_label.c +* ``__init_or_module void arch_jump_label_transform_static(struct jump_entry *entry, enum jump_label_type type)``, + see: arch/x86/kernel/jump_label.c -* struct jump_entry, see: arch/x86/include/asm/jump_label.h +* ``struct jump_entry``, + see: arch/x86/include/asm/jump_label.h 5) Static keys / jump label analysis, results (x86_64): As an example, let's add the following branch to 'getppid()', such that the -system call now looks like: +system call now looks like:: -SYSCALL_DEFINE0(getppid) -{ + SYSCALL_DEFINE0(getppid) + { int pid; -+ if (static_branch_unlikely(&key)) -+ printk("I am the true branch\n"); + + if (static_branch_unlikely(&key)) + + printk("I am the true branch\n"); rcu_read_lock(); pid = task_tgid_vnr(rcu_dereference(current->real_parent)); rcu_read_unlock(); return pid; -} - -The resulting instructions with jump labels generated by GCC is: - -ffffffff81044290 <sys_getppid>: -ffffffff81044290: 55 push %rbp -ffffffff81044291: 48 89 e5 mov %rsp,%rbp -ffffffff81044294: e9 00 00 00 00 jmpq ffffffff81044299 <sys_getppid+0x9> -ffffffff81044299: 65 48 8b 04 25 c0 b6 mov %gs:0xb6c0,%rax -ffffffff810442a0: 00 00 -ffffffff810442a2: 48 8b 80 80 02 00 00 mov 0x280(%rax),%rax -ffffffff810442a9: 48 8b 80 b0 02 00 00 mov 0x2b0(%rax),%rax -ffffffff810442b0: 48 8b b8 e8 02 00 00 mov 0x2e8(%rax),%rdi -ffffffff810442b7: e8 f4 d9 00 00 callq ffffffff81051cb0 <pid_vnr> -ffffffff810442bc: 5d pop %rbp -ffffffff810442bd: 48 98 cltq -ffffffff810442bf: c3 retq -ffffffff810442c0: 48 c7 c7 e3 54 98 81 mov $0xffffffff819854e3,%rdi -ffffffff810442c7: 31 c0 xor %eax,%eax -ffffffff810442c9: e8 71 13 6d 00 callq ffffffff8171563f <printk> -ffffffff810442ce: eb c9 jmp ffffffff81044299 <sys_getppid+0x9> - -Without the jump label optimization it looks like: - -ffffffff810441f0 <sys_getppid>: -ffffffff810441f0: 8b 05 8a 52 d8 00 mov 0xd8528a(%rip),%eax # ffffffff81dc9480 <key> -ffffffff810441f6: 55 push %rbp -ffffffff810441f7: 48 89 e5 mov %rsp,%rbp -ffffffff810441fa: 85 c0 test %eax,%eax -ffffffff810441fc: 75 27 jne ffffffff81044225 <sys_getppid+0x35> -ffffffff810441fe: 65 48 8b 04 25 c0 b6 mov %gs:0xb6c0,%rax -ffffffff81044205: 00 00 -ffffffff81044207: 48 8b 80 80 02 00 00 mov 0x280(%rax),%rax -ffffffff8104420e: 48 8b 80 b0 02 00 00 mov 0x2b0(%rax),%rax -ffffffff81044215: 48 8b b8 e8 02 00 00 mov 0x2e8(%rax),%rdi -ffffffff8104421c: e8 2f da 00 00 callq ffffffff81051c50 <pid_vnr> -ffffffff81044221: 5d pop %rbp -ffffffff81044222: 48 98 cltq -ffffffff81044224: c3 retq -ffffffff81044225: 48 c7 c7 13 53 98 81 mov $0xffffffff81985313,%rdi -ffffffff8104422c: 31 c0 xor %eax,%eax -ffffffff8104422e: e8 60 0f 6d 00 callq ffffffff81715193 <printk> -ffffffff81044233: eb c9 jmp ffffffff810441fe <sys_getppid+0xe> -ffffffff81044235: 66 66 2e 0f 1f 84 00 data32 nopw %cs:0x0(%rax,%rax,1) -ffffffff8104423c: 00 00 00 00 + } + +The resulting instructions with jump labels generated by GCC is:: + + ffffffff81044290 <sys_getppid>: + ffffffff81044290: 55 push %rbp + ffffffff81044291: 48 89 e5 mov %rsp,%rbp + ffffffff81044294: e9 00 00 00 00 jmpq ffffffff81044299 <sys_getppid+0x9> + ffffffff81044299: 65 48 8b 04 25 c0 b6 mov %gs:0xb6c0,%rax + ffffffff810442a0: 00 00 + ffffffff810442a2: 48 8b 80 80 02 00 00 mov 0x280(%rax),%rax + ffffffff810442a9: 48 8b 80 b0 02 00 00 mov 0x2b0(%rax),%rax + ffffffff810442b0: 48 8b b8 e8 02 00 00 mov 0x2e8(%rax),%rdi + ffffffff810442b7: e8 f4 d9 00 00 callq ffffffff81051cb0 <pid_vnr> + ffffffff810442bc: 5d pop %rbp + ffffffff810442bd: 48 98 cltq + ffffffff810442bf: c3 retq + ffffffff810442c0: 48 c7 c7 e3 54 98 81 mov $0xffffffff819854e3,%rdi + ffffffff810442c7: 31 c0 xor %eax,%eax + ffffffff810442c9: e8 71 13 6d 00 callq ffffffff8171563f <printk> + ffffffff810442ce: eb c9 jmp ffffffff81044299 <sys_getppid+0x9> + +Without the jump label optimization it looks like:: + + ffffffff810441f0 <sys_getppid>: + ffffffff810441f0: 8b 05 8a 52 d8 00 mov 0xd8528a(%rip),%eax # ffffffff81dc9480 <key> + ffffffff810441f6: 55 push %rbp + ffffffff810441f7: 48 89 e5 mov %rsp,%rbp + ffffffff810441fa: 85 c0 test %eax,%eax + ffffffff810441fc: 75 27 jne ffffffff81044225 <sys_getppid+0x35> + ffffffff810441fe: 65 48 8b 04 25 c0 b6 mov %gs:0xb6c0,%rax + ffffffff81044205: 00 00 + ffffffff81044207: 48 8b 80 80 02 00 00 mov 0x280(%rax),%rax + ffffffff8104420e: 48 8b 80 b0 02 00 00 mov 0x2b0(%rax),%rax + ffffffff81044215: 48 8b b8 e8 02 00 00 mov 0x2e8(%rax),%rdi + ffffffff8104421c: e8 2f da 00 00 callq ffffffff81051c50 <pid_vnr> + ffffffff81044221: 5d pop %rbp + ffffffff81044222: 48 98 cltq + ffffffff81044224: c3 retq + ffffffff81044225: 48 c7 c7 13 53 98 81 mov $0xffffffff81985313,%rdi + ffffffff8104422c: 31 c0 xor %eax,%eax + ffffffff8104422e: e8 60 0f 6d 00 callq ffffffff81715193 <printk> + ffffffff81044233: eb c9 jmp ffffffff810441fe <sys_getppid+0xe> + ffffffff81044235: 66 66 2e 0f 1f 84 00 data32 nopw %cs:0x0(%rax,%rax,1) + ffffffff8104423c: 00 00 00 00 Thus, the disable jump label case adds a 'mov', 'test' and 'jne' instruction vs. the jump label case just has a 'no-op' or 'jmp 0'. (The jmp 0, is patched to a 5 byte atomic no-op instruction at boot-time.) Thus, the disabled jump -label case adds: +label case adds:: -6 (mov) + 2 (test) + 2 (jne) = 10 - 5 (5 byte jump 0) = 5 addition bytes. + 6 (mov) + 2 (test) + 2 (jne) = 10 - 5 (5 byte jump 0) = 5 addition bytes. If we then include the padding bytes, the jump label code saves, 16 total bytes of instruction memory for this small function. In this case the non-jump label @@ -262,7 +271,7 @@ Since there are a number of static key API uses in the scheduler paths, 'pipe-test' (also known as 'perf bench sched pipe') can be used to show the performance improvement. Testing done on 3.3.0-rc2: -jump label disabled: +jump label disabled:: Performance counter stats for 'bash -c /tmp/pipe-test' (50 runs): @@ -279,7 +288,7 @@ jump label disabled: 1.601607384 seconds time elapsed ( +- 0.07% ) -jump label enabled: +jump label enabled:: Performance counter stats for 'bash -c /tmp/pipe-test' (50 runs): diff --git a/Documentation/svga.txt b/Documentation/svga.txt index cd66ec836e4f..119f1515b1ac 100644 --- a/Documentation/svga.txt +++ b/Documentation/svga.txt @@ -1,24 +1,31 @@ - Video Mode Selection Support 2.13 - (c) 1995--1999 Martin Mares, <mj@ucw.cz> --------------------------------------------------------------------------------- +.. include:: <isonum.txt> -1. Intro -~~~~~~~~ - This small document describes the "Video Mode Selection" feature which +================================= +Video Mode Selection Support 2.13 +================================= + +:Copyright: |copy| 1995--1999 Martin Mares, <mj@ucw.cz> + +Intro +~~~~~ + +This small document describes the "Video Mode Selection" feature which allows the use of various special video modes supported by the video BIOS. Due to usage of the BIOS, the selection is limited to boot time (before the kernel decompression starts) and works only on 80X86 machines. - ** Short intro for the impatient: Just use vga=ask for the first time, - ** enter `scan' on the video mode prompt, pick the mode you want to use, - ** remember its mode ID (the four-digit hexadecimal number) and then - ** set the vga parameter to this number (converted to decimal first). +.. note:: - The video mode to be used is selected by a kernel parameter which can be + Short intro for the impatient: Just use vga=ask for the first time, + enter ``scan`` on the video mode prompt, pick the mode you want to use, + remember its mode ID (the four-digit hexadecimal number) and then + set the vga parameter to this number (converted to decimal first). + +The video mode to be used is selected by a kernel parameter which can be specified in the kernel Makefile (the SVGA_MODE=... line) or by the "vga=..." option of LILO (or some other boot loader you use) or by the "vidmode" utility (present in standard Linux utility packages). You can use the following values -of this parameter: +of this parameter:: NORMAL_VGA - Standard 80x25 mode available on all display adapters. @@ -37,77 +44,79 @@ of this parameter: for exact meaning of the ID). Warning: rdev and LILO don't support hexadecimal numbers -- you have to convert it to decimal manually. -2. Menu -~~~~~~~ - The ASK_VGA mode causes the kernel to offer a video mode menu upon +Menu +~~~~ + +The ASK_VGA mode causes the kernel to offer a video mode menu upon bootup. It displays a "Press <RETURN> to see video modes available, <SPACE> to continue or wait 30 secs" message. If you press <RETURN>, you enter the menu, if you press <SPACE> or wait 30 seconds, the kernel will boot up in the standard 80x25 mode. - The menu looks like: +The menu looks like:: -Video adapter: <name-of-detected-video-adapter> -Mode: COLSxROWS: -0 0F00 80x25 -1 0F01 80x50 -2 0F02 80x43 -3 0F03 80x26 -.... -Enter mode number or `scan': <flashing-cursor-here> + Video adapter: <name-of-detected-video-adapter> + Mode: COLSxROWS: + 0 0F00 80x25 + 1 0F01 80x50 + 2 0F02 80x43 + 3 0F03 80x26 + .... + Enter mode number or ``scan``: <flashing-cursor-here> - <name-of-detected-video-adapter> tells what video adapter did Linux detect +<name-of-detected-video-adapter> tells what video adapter did Linux detect -- it's either a generic adapter name (MDA, CGA, HGC, EGA, VGA, VESA VGA [a VGA with VESA-compliant BIOS]) or a chipset name (e.g., Trident). Direct detection of chipsets is turned off by default (see CONFIG_VIDEO_SVGA in chapter 4 to see how to enable it if you really want) as it's inherently unreliable due to absolutely insane PC design. - "0 0F00 80x25" means that the first menu item (the menu items are numbered +"0 0F00 80x25" means that the first menu item (the menu items are numbered from "0" to "9" and from "a" to "z") is a 80x25 mode with ID=0x0f00 (see the next section for a description of mode IDs). - <flashing-cursor-here> encourages you to enter the item number or mode ID +<flashing-cursor-here> encourages you to enter the item number or mode ID you wish to set and press <RETURN>. If the computer complains something about "Unknown mode ID", it is trying to tell you that it isn't possible to set such a mode. It's also possible to press only <RETURN> which leaves the current mode. - The mode list usually contains a few basic modes and some VESA modes. In +The mode list usually contains a few basic modes and some VESA modes. In case your chipset has been detected, some chipset-specific modes are shown as well (some of these might be missing or unusable on your machine as different BIOSes are often shipped with the same card and the mode numbers depend purely on the VGA BIOS). - The modes displayed on the menu are partially sorted: The list starts with +The modes displayed on the menu are partially sorted: The list starts with the standard modes (80x25 and 80x50) followed by "special" modes (80x28 and 80x43), local modes (if the local modes feature is enabled), VESA modes and finally SVGA modes for the auto-detected adapter. - If you are not happy with the mode list offered (e.g., if you think your card +If you are not happy with the mode list offered (e.g., if you think your card is able to do more), you can enter "scan" instead of item number / mode ID. The program will try to ask the BIOS for all possible video mode numbers and test what happens then. The screen will be probably flashing wildly for some time and strange noises will be heard from inside the monitor and so on and then, really all consistent video modes supported by your BIOS will appear (plus maybe some -`ghost modes'). If you are afraid this could damage your monitor, don't use this -function. +``ghost modes``). If you are afraid this could damage your monitor, don't use +this function. - After scanning, the mode ordering is a bit different: the auto-detected SVGA -modes are not listed at all and the modes revealed by `scan' are shown before +After scanning, the mode ordering is a bit different: the auto-detected SVGA +modes are not listed at all and the modes revealed by ``scan`` are shown before all VESA modes. -3. Mode IDs -~~~~~~~~~~~ - Because of the complexity of all the video stuff, the video mode IDs +Mode IDs +~~~~~~~~ + +Because of the complexity of all the video stuff, the video mode IDs used here are also a bit complex. A video mode ID is a 16-bit number usually expressed in a hexadecimal notation (starting with "0x"). You can set a mode by entering its mode directly if you know it even if it isn't shown on the menu. -The ID numbers can be divided to three regions: +The ID numbers can be divided to those regions:: 0x0000 to 0x00ff - menu item references. 0x0000 is the first item. Don't use outside the menu as this can change from boot to boot (especially if you - have used the `scan' feature). + have used the ``scan`` feature). 0x0100 to 0x017f - standard BIOS modes. The ID is a BIOS video mode number (as presented to INT 10, function 00) increased by 0x0100. @@ -142,53 +151,54 @@ The ID numbers can be divided to three regions: 0xffff equivalent to 0x0f00 (standard 80x25) 0xfffe equivalent to 0x0f01 (EGA 80x43 or VGA 80x50) - If you add 0x8000 to the mode ID, the program will try to recalculate +If you add 0x8000 to the mode ID, the program will try to recalculate vertical display timing according to mode parameters, which can be used to eliminate some annoying bugs of certain VGA BIOSes (usually those used for cards with S3 chipsets and old Cirrus Logic BIOSes) -- mainly extra lines at the end of the display. -4. Options -~~~~~~~~~~ - Some options can be set in the source text (in arch/i386/boot/video.S). +Options +~~~~~~~ + +Some options can be set in the source text (in arch/i386/boot/video.S). All of them are simple #define's -- change them to #undef's when you want to switch them off. Currently supported: - CONFIG_VIDEO_SVGA - enables autodetection of SVGA cards. This is switched +CONFIG_VIDEO_SVGA - enables autodetection of SVGA cards. This is switched off by default as it's a bit unreliable due to terribly bad PC design. If you -really want to have the adapter autodetected (maybe in case the `scan' feature +really want to have the adapter autodetected (maybe in case the ``scan`` feature doesn't work on your machine), switch this on and don't cry if the results are not completely sane. In case you really need this feature, please drop me a mail as I think of removing it some day. - CONFIG_VIDEO_VESA - enables autodetection of VESA modes. If it doesn't work +CONFIG_VIDEO_VESA - enables autodetection of VESA modes. If it doesn't work on your machine (or displays a "Error: Scanning of VESA modes failed" message), you can switch it off and report as a bug. - CONFIG_VIDEO_COMPACT - enables compacting of the video mode list. If there +CONFIG_VIDEO_COMPACT - enables compacting of the video mode list. If there are more modes with the same screen size, only the first one is kept (see above for more info on mode ordering). However, in very strange cases it's possible that the first "version" of the mode doesn't work although some of the others do -- in this case turn this switch off to see the rest. - CONFIG_VIDEO_RETAIN - enables retaining of screen contents when switching +CONFIG_VIDEO_RETAIN - enables retaining of screen contents when switching video modes. Works only with some boot loaders which leave enough room for the buffer. (If you have old LILO, you can adjust heap_end_ptr and loadflags in setup.S, but it's better to upgrade the boot loader...) - CONFIG_VIDEO_LOCAL - enables inclusion of "local modes" in the list. The +CONFIG_VIDEO_LOCAL - enables inclusion of "local modes" in the list. The local modes are added automatically to the beginning of the list not depending on hardware configuration. The local modes are listed in the source text after the "local_mode_table:" line. The comment before this line describes the format of the table (which also includes a video card name to be displayed on the top of the menu). - CONFIG_VIDEO_400_HACK - force setting of 400 scan lines for standard VGA +CONFIG_VIDEO_400_HACK - force setting of 400 scan lines for standard VGA modes. This option is intended to be used on certain buggy BIOSes which draw some useless logo using font download and then fail to reset the correct mode. Don't use unless needed as it forces resetting the video card. - CONFIG_VIDEO_GFX_HACK - includes special hack for setting of graphics modes +CONFIG_VIDEO_GFX_HACK - includes special hack for setting of graphics modes to be used later by special drivers (e.g., 800x600 on IBM ThinkPad -- see ftp://ftp.phys.keio.ac.jp/pub/XFree86/800x600/XF86Configs/XF86Config.IBM_TP560). Allows to set _any_ BIOS mode including graphic ones and forcing specific @@ -196,33 +206,36 @@ text screen resolution instead of peeking it from BIOS variables. Don't use unless you think you know what you're doing. To activate this setup, use mode number 0x0f08 (see section 3). -5. Still doesn't work? -~~~~~~~~~~~~~~~~~~~~~~ - When the mode detection doesn't work (e.g., the mode list is incorrect or +Still doesn't work? +~~~~~~~~~~~~~~~~~~~ + +When the mode detection doesn't work (e.g., the mode list is incorrect or the machine hangs instead of displaying the menu), try to switch off some of the configuration options listed in section 4. If it fails, you can still use your kernel with the video mode set directly via the kernel parameter. - In either case, please send me a bug report containing what _exactly_ +In either case, please send me a bug report containing what _exactly_ happens and how do the configuration switches affect the behaviour of the bug. - If you start Linux from M$-DOS, you might also use some DOS tools for +If you start Linux from M$-DOS, you might also use some DOS tools for video mode setting. In this case, you must specify the 0x0f04 mode ("leave current settings") to Linux, because if you don't and you use any non-standard mode, Linux will switch to 80x25 automatically. - If you set some extended mode and there's one or more extra lines on the +If you set some extended mode and there's one or more extra lines on the bottom of the display containing already scrolled-out text, your VGA BIOS contains the most common video BIOS bug called "incorrect vertical display end setting". Adding 0x8000 to the mode ID might fix the problem. Unfortunately, this must be done manually -- no autodetection mechanisms are available. - If you have a VGA card and your display still looks as on EGA, your BIOS +If you have a VGA card and your display still looks as on EGA, your BIOS is probably broken and you need to set the CONFIG_VIDEO_400_HACK switch to force setting of the correct mode. -6. History -~~~~~~~~~~ +History +~~~~~~~ + +=============== ================================================================ 1.0 (??-Nov-95) First version supporting all adapters supported by the old setup.S + Cirrus Logic 54XX. Present in some 1.3.4? kernels and then removed due to instability on some machines. @@ -260,17 +273,18 @@ force setting of the correct mode. original version written by hhanemaa@cs.ruu.nl, patched by Jeff Chua, rewritten by me). - Screen store/restore fixed. -2.8 (14-Apr-96) - Previous release was not compilable without CONFIG_VIDEO_SVGA. +2.8 (14-Apr-96) - Previous release was not compilable without CONFIG_VIDEO_SVGA. - Better recognition of text modes during mode scan. 2.9 (12-May-96) - Ignored VESA modes 0x80 - 0xff (more VESA BIOS bugs!) -2.10 (11-Nov-96)- The whole thing made optional. +2.10(11-Nov-96) - The whole thing made optional. - Added the CONFIG_VIDEO_400_HACK switch. - Added the CONFIG_VIDEO_GFX_HACK switch. - Code cleanup. -2.11 (03-May-97)- Yet another cleanup, now including also the documentation. - - Direct testing of SVGA adapters turned off by default, `scan' +2.11(03-May-97) - Yet another cleanup, now including also the documentation. + - Direct testing of SVGA adapters turned off by default, ``scan`` offered explicitly on the prompt line. - Removed the doc section describing adding of new probing functions as I try to get rid of _all_ hardware probing here. -2.12 (25-May-98)- Added support for VESA frame buffer graphics. -2.13 (14-May-99)- Minor documentation fixes. +2.12(25-May-98) Added support for VESA frame buffer graphics. +2.13(14-May-99) Minor documentation fixes. +=============== ================================================================ diff --git a/Documentation/sync_file.txt b/Documentation/sync_file.txt index c3d033a06e8d..496fb2c3b3e6 100644 --- a/Documentation/sync_file.txt +++ b/Documentation/sync_file.txt @@ -1,8 +1,8 @@ - Sync File API Guide - ~~~~~~~~~~~~~~~~~~~ +=================== +Sync File API Guide +=================== - Gustavo Padovan - <gustavo at padovan dot org> +:Author: Gustavo Padovan <gustavo at padovan dot org> This document serves as a guide for device drivers writers on what the sync_file API is, and how drivers can support it. Sync file is the carrier of @@ -46,16 +46,17 @@ Creating Sync Files When a driver needs to send an out-fence userspace it creates a sync_file. -Interface: +Interface:: + struct sync_file *sync_file_create(struct dma_fence *fence); The caller pass the out-fence and gets back the sync_file. That is just the first step, next it needs to install an fd on sync_file->file. So it gets an -fd: +fd:: fd = get_unused_fd_flags(O_CLOEXEC); -and installs it on sync_file->file: +and installs it on sync_file->file:: fd_install(fd, sync_file->file); @@ -71,7 +72,8 @@ When userspace needs to send an in-fence to the driver it passes file descriptor of the Sync File to the kernel. The kernel can then retrieve the fences from it. -Interface: +Interface:: + struct dma_fence *sync_file_get_fence(int fd); @@ -79,5 +81,6 @@ The returned reference is owned by the caller and must be disposed of afterwards using dma_fence_put(). In case of error, a NULL is returned instead. References: -[1] struct sync_file in include/linux/sync_file.h -[2] All interfaces mentioned above defined in include/linux/sync_file.h + +1. struct sync_file in include/linux/sync_file.h +2. All interfaces mentioned above defined in include/linux/sync_file.h diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index b4ad97f10b8e..48244c42ff52 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -240,6 +240,26 @@ fragmentation index is <= extfrag_threshold. The default value is 500. ============================================================== +highmem_is_dirtyable + +Available only for systems with CONFIG_HIGHMEM enabled (32b systems). + +This parameter controls whether the high memory is considered for dirty +writers throttling. This is not the case by default which means that +only the amount of memory directly visible/usable by the kernel can +be dirtied. As a result, on systems with a large amount of memory and +lowmem basically depleted writers might be throttled too early and +streaming writes can get very slow. + +Changing the value to non zero would allow more memory to be dirtied +and thus allow writers to write more data which can be flushed to the +storage more effectively. Note this also comes with a risk of pre-mature +OOM killer because some writers (e.g. direct block device writes) can +only use the low memory and they can fill it up with dirty data without +any throttling. + +============================================================== + hugepages_treat_as_movable This parameter controls whether we can allocate hugepages from ZONE_MOVABLE diff --git a/Documentation/tee.txt b/Documentation/tee.txt index 718599357596..56ea85ffebf2 100644 --- a/Documentation/tee.txt +++ b/Documentation/tee.txt @@ -1,4 +1,7 @@ +============= TEE subsystem +============= + This document describes the TEE subsystem in Linux. A TEE (Trusted Execution Environment) is a trusted OS running in some @@ -80,27 +83,27 @@ The GlobalPlatform TEE Client API [5] is implemented on top of the generic TEE API. Picture of the relationship between the different components in the -OP-TEE architecture. - - User space Kernel Secure world - ~~~~~~~~~~ ~~~~~~ ~~~~~~~~~~~~ - +--------+ +-------------+ - | Client | | Trusted | - +--------+ | Application | - /\ +-------------+ - || +----------+ /\ - || |tee- | || - || |supplicant| \/ - || +----------+ +-------------+ - \/ /\ | TEE Internal| - +-------+ || | API | - + TEE | || +--------+--------+ +-------------+ - | Client| || | TEE | OP-TEE | | OP-TEE | - | API | \/ | subsys | driver | | Trusted OS | - +-------+----------------+----+-------+----+-----------+-------------+ - | Generic TEE API | | OP-TEE MSG | - | IOCTL (TEE_IOC_*) | | SMCCC (OPTEE_SMC_CALL_*) | - +-----------------------------+ +------------------------------+ +OP-TEE architecture:: + + User space Kernel Secure world + ~~~~~~~~~~ ~~~~~~ ~~~~~~~~~~~~ + +--------+ +-------------+ + | Client | | Trusted | + +--------+ | Application | + /\ +-------------+ + || +----------+ /\ + || |tee- | || + || |supplicant| \/ + || +----------+ +-------------+ + \/ /\ | TEE Internal| + +-------+ || | API | + + TEE | || +--------+--------+ +-------------+ + | Client| || | TEE | OP-TEE | | OP-TEE | + | API | \/ | subsys | driver | | Trusted OS | + +-------+----------------+----+-------+----+-----------+-------------+ + | Generic TEE API | | OP-TEE MSG | + | IOCTL (TEE_IOC_*) | | SMCCC (OPTEE_SMC_CALL_*) | + +-----------------------------+ +------------------------------+ RPC (Remote Procedure Call) are requests from secure world to kernel driver or tee-supplicant. An RPC is identified by a special range of SMCCC return @@ -109,10 +112,16 @@ kernel are handled by the kernel driver. Other RPC messages will be forwarded to tee-supplicant without further involvement of the driver, except switching shared memory buffer representation. -References: +References +========== + [1] https://github.com/OP-TEE/optee_os + [2] http://infocenter.arm.com/help/topic/com.arm.doc.den0028a/index.html + [3] drivers/tee/optee/optee_smc.h + [4] drivers/tee/optee/optee_msg.h + [5] http://www.globalplatform.org/specificationsdevice.asp look for "TEE Client API Specification v1.0" and click download. diff --git a/Documentation/this_cpu_ops.txt b/Documentation/this_cpu_ops.txt index 2cbf71975381..5cb8b883ae83 100644 --- a/Documentation/this_cpu_ops.txt +++ b/Documentation/this_cpu_ops.txt @@ -1,5 +1,9 @@ +=================== this_cpu operations -------------------- +=================== + +:Author: Christoph Lameter, August 4th, 2014 +:Author: Pranith Kumar, Aug 2nd, 2014 this_cpu operations are a way of optimizing access to per cpu variables associated with the *currently* executing processor. This is @@ -39,7 +43,7 @@ operations. The following this_cpu() operations with implied preemption protection are defined. These operations can be used without worrying about -preemption and interrupts. +preemption and interrupts:: this_cpu_read(pcp) this_cpu_write(pcp, val) @@ -67,14 +71,14 @@ to relocate a per cpu relative address to the proper per cpu area for the processor. So the relocation to the per cpu base is encoded in the instruction via a segment register prefix. -For example: +For example:: DEFINE_PER_CPU(int, x); int z; z = this_cpu_read(x); -results in a single instruction +results in a single instruction:: mov ax, gs:[x] @@ -84,16 +88,16 @@ this_cpu_ops such sequence also required preempt disable/enable to prevent the kernel from moving the thread to a different processor while the calculation is performed. -Consider the following this_cpu operation: +Consider the following this_cpu operation:: this_cpu_inc(x) -The above results in the following single instruction (no lock prefix!) +The above results in the following single instruction (no lock prefix!):: inc gs:[x] instead of the following operations required if there is no segment -register: +register:: int *y; int cpu; @@ -121,8 +125,10 @@ has to be paid for this optimization is the need to add up the per cpu counters when the value of a counter is needed. -Special operations: -------------------- +Special operations +------------------ + +:: y = this_cpu_ptr(&x) @@ -153,11 +159,15 @@ Therefore the use of x or &x outside of the context of per cpu operations is invalid and will generally be treated like a NULL pointer dereference. +:: + DEFINE_PER_CPU(int, x); In the context of per cpu operations the above implies that x is a per cpu variable. Most this_cpu operations take a cpu variable. +:: + int __percpu *p = &x; &x and hence p is the *offset* of a per cpu variable. this_cpu_ptr() @@ -168,7 +178,7 @@ strange. Operations on a field of a per cpu structure -------------------------------------------- -Let's say we have a percpu structure +Let's say we have a percpu structure:: struct s { int n,m; @@ -177,14 +187,14 @@ Let's say we have a percpu structure DEFINE_PER_CPU(struct s, p); -Operations on these fields are straightforward +Operations on these fields are straightforward:: this_cpu_inc(p.m) z = this_cpu_cmpxchg(p.m, 0, 1); -If we have an offset to struct s: +If we have an offset to struct s:: struct s __percpu *ps = &p; @@ -194,7 +204,7 @@ If we have an offset to struct s: The calculation of the pointer may require the use of this_cpu_ptr() -if we do not make use of this_cpu ops later to manipulate fields: +if we do not make use of this_cpu ops later to manipulate fields:: struct s *pp; @@ -206,7 +216,7 @@ if we do not make use of this_cpu ops later to manipulate fields: Variants of this_cpu ops -------------------------- +------------------------ this_cpu ops are interrupt safe. Some architectures do not support these per cpu local operations. In that case the operation must be @@ -222,7 +232,7 @@ preemption. If a per cpu variable is not used in an interrupt context and the scheduler cannot preempt, then they are safe. If any interrupts still occur while an operation is in progress and if the interrupt too modifies the variable, then RMW actions can not be guaranteed to be -safe. +safe:: __this_cpu_read(pcp) __this_cpu_write(pcp, val) @@ -279,7 +289,7 @@ unless absolutely necessary. Please consider using an IPI to wake up the remote CPU and perform the update to its per cpu area. To access per-cpu data structure remotely, typically the per_cpu_ptr() -function is used: +function is used:: DEFINE_PER_CPU(struct data, datap); @@ -289,7 +299,7 @@ function is used: This makes it explicit that we are getting ready to access a percpu area remotely. -You can also do the following to convert the datap offset to an address +You can also do the following to convert the datap offset to an address:: struct data *p = this_cpu_ptr(&datap); @@ -305,7 +315,7 @@ the following scenario that occurs because two per cpu variables share a cache-line but the relaxed synchronization is applied to only one process updating the cache-line. -Consider the following example +Consider the following example:: struct test { @@ -327,6 +337,3 @@ mind that a remote write will evict the cache line from the processor that most likely will access it. If the processor wakes up and finds a missing local cache line of a per cpu area, its performance and hence the wake up times will be affected. - -Christoph Lameter, August 4th, 2014 -Pranith Kumar, Aug 2nd, 2014 diff --git a/Documentation/timers/NO_HZ.txt b/Documentation/timers/NO_HZ.txt index 6eaf576294f3..2dcaf9adb7a7 100644 --- a/Documentation/timers/NO_HZ.txt +++ b/Documentation/timers/NO_HZ.txt @@ -194,32 +194,9 @@ that the RCU callbacks are processed in a timely fashion. Another approach is to offload RCU callback processing to "rcuo" kthreads using the CONFIG_RCU_NOCB_CPU=y Kconfig option. The specific CPUs to -offload may be selected via several methods: - -1. One of three mutually exclusive Kconfig options specify a - build-time default for the CPUs to offload: - - a. The CONFIG_RCU_NOCB_CPU_NONE=y Kconfig option results in - no CPUs being offloaded. - - b. The CONFIG_RCU_NOCB_CPU_ZERO=y Kconfig option causes - CPU 0 to be offloaded. - - c. The CONFIG_RCU_NOCB_CPU_ALL=y Kconfig option causes all - CPUs to be offloaded. Note that the callbacks will be - offloaded to "rcuo" kthreads, and that those kthreads - will in fact run on some CPU. However, this approach - gives fine-grained control on exactly which CPUs the - callbacks run on, along with their scheduling priority - (including the default of SCHED_OTHER), and it further - allows this control to be varied dynamically at runtime. - -2. The "rcu_nocbs=" kernel boot parameter, which takes a comma-separated - list of CPUs and CPU ranges, for example, "1,3-5" selects CPUs 1, - 3, 4, and 5. The specified CPUs will be offloaded in addition to - any CPUs specified as offloaded by CONFIG_RCU_NOCB_CPU_ZERO=y or - CONFIG_RCU_NOCB_CPU_ALL=y. This means that the "rcu_nocbs=" boot - parameter has no effect for kernels built with RCU_NOCB_CPU_ALL=y. +offload may be selected using The "rcu_nocbs=" kernel boot parameter, +which takes a comma-separated list of CPUs and CPU ranges, for example, +"1,3-5" selects CPUs 1, 3, 4, and 5. The offloaded CPUs will never queue RCU callbacks, and therefore RCU never prevents offloaded CPUs from entering either dyntick-idle mode diff --git a/Documentation/trace/coresight-cpu-debug.txt b/Documentation/trace/coresight-cpu-debug.txt new file mode 100644 index 000000000000..b3da1f90b861 --- /dev/null +++ b/Documentation/trace/coresight-cpu-debug.txt @@ -0,0 +1,175 @@ + Coresight CPU Debug Module + ========================== + + Author: Leo Yan <leo.yan@linaro.org> + Date: April 5th, 2017 + +Introduction +------------ + +Coresight CPU debug module is defined in ARMv8-a architecture reference manual +(ARM DDI 0487A.k) Chapter 'Part H: External debug', the CPU can integrate +debug module and it is mainly used for two modes: self-hosted debug and +external debug. Usually the external debug mode is well known as the external +debugger connects with SoC from JTAG port; on the other hand the program can +explore debugging method which rely on self-hosted debug mode, this document +is to focus on this part. + +The debug module provides sample-based profiling extension, which can be used +to sample CPU program counter, secure state and exception level, etc; usually +every CPU has one dedicated debug module to be connected. Based on self-hosted +debug mechanism, Linux kernel can access these related registers from mmio +region when the kernel panic happens. The callback notifier for kernel panic +will dump related registers for every CPU; finally this is good for assistant +analysis for panic. + + +Implementation +-------------- + +- During driver registration, it uses EDDEVID and EDDEVID1 - two device ID + registers to decide if sample-based profiling is implemented or not. On some + platforms this hardware feature is fully or partially implemented; and if + this feature is not supported then registration will fail. + +- At the time this documentation was written, the debug driver mainly relies on + information gathered by the kernel panic callback notifier from three + sampling registers: EDPCSR, EDVIDSR and EDCIDSR: from EDPCSR we can get + program counter; EDVIDSR has information for secure state, exception level, + bit width, etc; EDCIDSR is context ID value which contains the sampled value + of CONTEXTIDR_EL1. + +- The driver supports a CPU running in either AArch64 or AArch32 mode. The + registers naming convention is a bit different between them, AArch64 uses + 'ED' for register prefix (ARM DDI 0487A.k, chapter H9.1) and AArch32 uses + 'DBG' as prefix (ARM DDI 0487A.k, chapter G5.1). The driver is unified to + use AArch64 naming convention. + +- ARMv8-a (ARM DDI 0487A.k) and ARMv7-a (ARM DDI 0406C.b) have different + register bits definition. So the driver consolidates two difference: + + If PCSROffset=0b0000, on ARMv8-a the feature of EDPCSR is not implemented; + but ARMv7-a defines "PCSR samples are offset by a value that depends on the + instruction set state". For ARMv7-a, the driver checks furthermore if CPU + runs with ARM or thumb instruction set and calibrate PCSR value, the + detailed description for offset is in ARMv7-a ARM (ARM DDI 0406C.b) chapter + C11.11.34 "DBGPCSR, Program Counter Sampling Register". + + If PCSROffset=0b0010, ARMv8-a defines "EDPCSR implemented, and samples have + no offset applied and do not sample the instruction set state in AArch32 + state". So on ARMv8 if EDDEVID1.PCSROffset is 0b0010 and the CPU operates + in AArch32 state, EDPCSR is not sampled; when the CPU operates in AArch64 + state EDPCSR is sampled and no offset are applied. + + +Clock and power domain +---------------------- + +Before accessing debug registers, we should ensure the clock and power domain +have been enabled properly. In ARMv8-a ARM (ARM DDI 0487A.k) chapter 'H9.1 +Debug registers', the debug registers are spread into two domains: the debug +domain and the CPU domain. + + +---------------+ + | | + | | + +----------+--+ | + dbg_clock -->| |**| |<-- cpu_clock + | Debug |**| CPU | + dbg_power_domain -->| |**| |<-- cpu_power_domain + +----------+--+ | + | | + | | + +---------------+ + +For debug domain, the user uses DT binding "clocks" and "power-domains" to +specify the corresponding clock source and power supply for the debug logic. +The driver calls the pm_runtime_{put|get} operations as needed to handle the +debug power domain. + +For CPU domain, the different SoC designs have different power management +schemes and finally this heavily impacts external debug module. So we can +divide into below cases: + +- On systems with a sane power controller which can behave correctly with + respect to CPU power domain, the CPU power domain can be controlled by + register EDPRCR in driver. The driver firstly writes bit EDPRCR.COREPURQ + to power up the CPU, and then writes bit EDPRCR.CORENPDRQ for emulation + of CPU power down. As result, this can ensure the CPU power domain is + powered on properly during the period when access debug related registers; + +- Some designs will power down an entire cluster if all CPUs on the cluster + are powered down - including the parts of the debug registers that should + remain powered in the debug power domain. The bits in EDPRCR are not + respected in these cases, so these designs do not support debug over + power down in the way that the CoreSight / Debug designers anticipated. + This means that even checking EDPRSR has the potential to cause a bus hang + if the target register is unpowered. + + In this case, accessing to the debug registers while they are not powered + is a recipe for disaster; so we need preventing CPU low power states at boot + time or when user enable module at the run time. Please see chapter + "How to use the module" for detailed usage info for this. + + +Device Tree Bindings +-------------------- + +See Documentation/devicetree/bindings/arm/coresight-cpu-debug.txt for details. + + +How to use the module +--------------------- + +If you want to enable debugging functionality at boot time, you can add +"coresight_cpu_debug.enable=1" to the kernel command line parameter. + +The driver also can work as module, so can enable the debugging when insmod +module: +# insmod coresight_cpu_debug.ko debug=1 + +When boot time or insmod module you have not enabled the debugging, the driver +uses the debugfs file system to provide a knob to dynamically enable or disable +debugging: + +To enable it, write a '1' into /sys/kernel/debug/coresight_cpu_debug/enable: +# echo 1 > /sys/kernel/debug/coresight_cpu_debug/enable + +To disable it, write a '0' into /sys/kernel/debug/coresight_cpu_debug/enable: +# echo 0 > /sys/kernel/debug/coresight_cpu_debug/enable + +As explained in chapter "Clock and power domain", if you are working on one +platform which has idle states to power off debug logic and the power +controller cannot work well for the request from EDPRCR, then you should +firstly constraint CPU idle states before enable CPU debugging feature; so can +ensure the accessing to debug logic. + +If you want to limit idle states at boot time, you can use "nohlt" or +"cpuidle.off=1" in the kernel command line. + +At the runtime you can disable idle states with below methods: + +Set latency request to /dev/cpu_dma_latency to disable all CPUs specific idle +states (if latency = 0uS then disable all idle states): +# echo "what_ever_latency_you_need_in_uS" > /dev/cpu_dma_latency + +Disable specific CPU's specific idle state: +# echo 1 > /sys/devices/system/cpu/cpu$cpu/cpuidle/state$state/disable + + +Output format +------------- + +Here is an example of the debugging output format: + +ARM external debug module: +coresight-cpu-debug 850000.debug: CPU[0]: +coresight-cpu-debug 850000.debug: EDPRSR: 00000001 (Power:On DLK:Unlock) +coresight-cpu-debug 850000.debug: EDPCSR: [<ffff00000808e9bc>] handle_IPI+0x174/0x1d8 +coresight-cpu-debug 850000.debug: EDCIDSR: 00000000 +coresight-cpu-debug 850000.debug: EDVIDSR: 90000000 (State:Non-secure Mode:EL1/0 Width:64bits VMID:0) +coresight-cpu-debug 852000.debug: CPU[1]: +coresight-cpu-debug 852000.debug: EDPRSR: 00000001 (Power:On DLK:Unlock) +coresight-cpu-debug 852000.debug: EDPCSR: [<ffff0000087fab34>] debug_notifier_call+0x23c/0x358 +coresight-cpu-debug 852000.debug: EDCIDSR: 00000000 +coresight-cpu-debug 852000.debug: EDVIDSR: 90000000 (State:Non-secure Mode:EL1/0 Width:64bits VMID:0) diff --git a/Documentation/trace/ftrace.txt b/Documentation/trace/ftrace.txt index 94a987bd2bc5..d4601df6e72e 100644 --- a/Documentation/trace/ftrace.txt +++ b/Documentation/trace/ftrace.txt @@ -5,10 +5,11 @@ Copyright 2008 Red Hat Inc. Author: Steven Rostedt <srostedt@redhat.com> License: The GNU Free Documentation License, Version 1.2 (dual licensed under the GPL v2) -Reviewers: Elias Oltmanns, Randy Dunlap, Andrew Morton, - John Kacur, and David Teigland. +Original Reviewers: Elias Oltmanns, Randy Dunlap, Andrew Morton, + John Kacur, and David Teigland. Written for: 2.6.28-rc2 Updated for: 3.10 +Updated for: 4.13 - Copyright 2017 VMware Inc. Steven Rostedt Introduction ------------ @@ -26,9 +27,11 @@ a task is woken to the task is actually scheduled in. One of the most common uses of ftrace is the event tracing. Through out the kernel is hundreds of static event points that -can be enabled via the debugfs file system to see what is +can be enabled via the tracefs file system to see what is going on in certain parts of the kernel. +See events.txt for more information. + Implementation Details ---------------------- @@ -39,34 +42,47 @@ See ftrace-design.txt for details for arch porters and such. The File System --------------- -Ftrace uses the debugfs file system to hold the control files as +Ftrace uses the tracefs file system to hold the control files as well as the files to display output. -When debugfs is configured into the kernel (which selecting any ftrace -option will do) the directory /sys/kernel/debug will be created. To mount +When tracefs is configured into the kernel (which selecting any ftrace +option will do) the directory /sys/kernel/tracing will be created. To mount this directory, you can add to your /etc/fstab file: - debugfs /sys/kernel/debug debugfs defaults 0 0 + tracefs /sys/kernel/tracing tracefs defaults 0 0 Or you can mount it at run time with: - mount -t debugfs nodev /sys/kernel/debug + mount -t tracefs nodev /sys/kernel/tracing For quicker access to that directory you may want to make a soft link to it: - ln -s /sys/kernel/debug /debug + ln -s /sys/kernel/tracing /tracing + + *** NOTICE *** + +Before 4.1, all ftrace tracing control files were within the debugfs +file system, which is typically located at /sys/kernel/debug/tracing. +For backward compatibility, when mounting the debugfs file system, +the tracefs file system will be automatically mounted at: + + /sys/kernel/debug/tracing -Any selected ftrace option will also create a directory called tracing -within the debugfs. The rest of the document will assume that you are in -the ftrace directory (cd /sys/kernel/debug/tracing) and will only concentrate -on the files within that directory and not distract from the content with -the extended "/sys/kernel/debug/tracing" path name. +All files located in the tracefs file system will be located in that +debugfs file system directory as well. + + *** NOTICE *** + +Any selected ftrace option will also create the tracefs file system. +The rest of the document will assume that you are in the ftrace directory +(cd /sys/kernel/tracing) and will only concentrate on the files within that +directory and not distract from the content with the extended +"/sys/kernel/tracing" path name. That's it! (assuming that you have ftrace configured into your kernel) -After mounting debugfs, you can see a directory called -"tracing". This directory contains the control and output files +After mounting tracefs you will have access to the control and output files of ftrace. Here is a list of some of the key files: @@ -92,10 +108,20 @@ of ftrace. Here is a list of some of the key files: writing to the ring buffer, the tracing overhead may still be occurring. + The kernel function tracing_off() can be used within the + kernel to disable writing to the ring buffer, which will + set this file to "0". User space can re-enable tracing by + echoing "1" into the file. + + Note, the function and event trigger "traceoff" will also + set this file to zero and stop tracing. Which can also + be re-enabled by user space using this file. + trace: This file holds the output of the trace in a human - readable format (described below). + readable format (described below). Note, tracing is temporarily + disabled while this file is being read (opened). trace_pipe: @@ -109,7 +135,8 @@ of ftrace. Here is a list of some of the key files: will not be read again with a sequential read. The "trace" file is static, and if the tracer is not adding more data, it will display the same - information every time it is read. + information every time it is read. This file will not + disable tracing while being read. trace_options: @@ -128,12 +155,14 @@ of ftrace. Here is a list of some of the key files: tracing_max_latency: Some of the tracers record the max latency. - For example, the time interrupts are disabled. - This time is saved in this file. The max trace - will also be stored, and displayed by "trace". - A new max trace will only be recorded if the - latency is greater than the value in this - file. (in microseconds) + For example, the maximum time that interrupts are disabled. + The maximum time is saved in this file. The max trace will also be + stored, and displayed by "trace". A new max trace will only be + recorded if the latency is greater than the value in this file + (in microseconds). + + By echoing in a time into this file, no latency will be recorded + unless it is greater than the time in this file. tracing_thresh: @@ -152,32 +181,34 @@ of ftrace. Here is a list of some of the key files: that the kernel uses for allocation, usually 4 KB in size). If the last page allocated has room for more bytes than requested, the rest of the page will be used, - making the actual allocation bigger than requested. + making the actual allocation bigger than requested or shown. ( Note, the size may not be a multiple of the page size due to buffer management meta-data. ) + Buffer sizes for individual CPUs may vary + (see "per_cpu/cpu0/buffer_size_kb" below), and if they do + this file will show "X". + buffer_total_size_kb: This displays the total combined size of all the trace buffers. free_buffer: - If a process is performing the tracing, and the ring buffer - should be shrunk "freed" when the process is finished, even - if it were to be killed by a signal, this file can be used - for that purpose. On close of this file, the ring buffer will - be resized to its minimum size. Having a process that is tracing - also open this file, when the process exits its file descriptor - for this file will be closed, and in doing so, the ring buffer - will be "freed". + If a process is performing tracing, and the ring buffer should be + shrunk "freed" when the process is finished, even if it were to be + killed by a signal, this file can be used for that purpose. On close + of this file, the ring buffer will be resized to its minimum size. + Having a process that is tracing also open this file, when the process + exits its file descriptor for this file will be closed, and in doing so, + the ring buffer will be "freed". It may also stop tracing if disable_on_free option is set. tracing_cpumask: - This is a mask that lets the user only trace - on specified CPUs. The format is a hex string - representing the CPUs. + This is a mask that lets the user only trace on specified CPUs. + The format is a hex string representing the CPUs. set_ftrace_filter: @@ -190,6 +221,9 @@ of ftrace. Here is a list of some of the key files: to be traced. Echoing names of functions into this file will limit the trace to only those functions. + The functions listed in "available_filter_functions" are what + can be written into this file. + This interface also allows for commands to be used. See the "Filter commands" section for more details. @@ -202,7 +236,14 @@ of ftrace. Here is a list of some of the key files: set_ftrace_pid: - Have the function tracer only trace a single thread. + Have the function tracer only trace the threads whose PID are + listed in this file. + + If the "function-fork" option is set, then when a task whose + PID is listed in this file forks, the child's PID will + automatically be added to this file, and the child will be + traced by the function tracer as well. This option will also + cause PIDs of tasks that exit to be removed from the file. set_event_pid: @@ -217,17 +258,28 @@ of ftrace. Here is a list of some of the key files: set_graph_function: - Set a "trigger" function where tracing should start - with the function graph tracer (See the section - "dynamic ftrace" for more details). + Functions listed in this file will cause the function graph + tracer to only trace these functions and the functions that + they call. (See the section "dynamic ftrace" for more details). + + set_graph_notrace: + + Similar to set_graph_function, but will disable function graph + tracing when the function is hit until it exits the function. + This makes it possible to ignore tracing functions that are called + by a specific function. available_filter_functions: - This lists the functions that ftrace - has processed and can trace. These are the function - names that you can pass to "set_ftrace_filter" or - "set_ftrace_notrace". (See the section "dynamic ftrace" - below for more details.) + This lists the functions that ftrace has processed and can trace. + These are the function names that you can pass to + "set_ftrace_filter" or "set_ftrace_notrace". + (See the section "dynamic ftrace" below for more details.) + + dyn_ftrace_total_info: + + This file is for debugging purposes. The number of functions that + have been converted to nops and are available to be traced. enabled_functions: @@ -250,12 +302,21 @@ of ftrace. Here is a list of some of the key files: an 'I' will be displayed on the same line as the function that can be overridden. + If the architecture supports it, it will also show what callback + is being directly called by the function. If the count is greater + than 1 it most likely will be ftrace_ops_list_func(). + + If the callback of the function jumps to a trampoline that is + specific to a the callback and not the standard trampoline, + its address will be printed as well as the function that the + trampoline calls. + function_profile_enabled: When set it will enable all functions with either the function - tracer, or if enabled, the function graph tracer. It will + tracer, or if configured, the function graph tracer. It will keep a histogram of the number of functions that were called - and if run with the function graph tracer, it will also keep + and if the function graph tracer was configured, it will also keep track of the time spent in those functions. The histogram content can be displayed in the files: @@ -283,12 +344,11 @@ of ftrace. Here is a list of some of the key files: printk_formats: This is for tools that read the raw format files. If an event in - the ring buffer references a string (currently only trace_printk() - does this), only a pointer to the string is recorded into the buffer - and not the string itself. This prevents tools from knowing what - that string was. This file displays the string and address for - the string allowing tools to map the pointers to what the - strings were. + the ring buffer references a string, only a pointer to the string + is recorded into the buffer and not the string itself. This prevents + tools from knowing what that string was. This file displays the string + and address for the string allowing tools to map the pointers to what + the strings were. saved_cmdlines: @@ -298,6 +358,22 @@ of ftrace. Here is a list of some of the key files: comms for events. If a pid for a comm is not listed, then "<...>" is displayed in the output. + If the option "record-cmd" is set to "0", then comms of tasks + will not be saved during recording. By default, it is enabled. + + saved_cmdlines_size: + + By default, 128 comms are saved (see "saved_cmdlines" above). To + increase or decrease the amount of comms that are cached, echo + in a the number of comms to cache, into this file. + + saved_tgids: + + If the option "record-tgid" is set, on each scheduling context switch + the Task Group ID of a task is saved in a table mapping the PID of + the thread to its TGID. By default, the "record-tgid" option is + disabled. + snapshot: This displays the "snapshot" buffer and also lets the user @@ -336,6 +412,9 @@ of ftrace. Here is a list of some of the key files: # cat trace_clock [local] global counter x86-tsc + The clock with the square brackets around it is the one + in effect. + local: Default clock, but may not be in sync across CPUs global: This clock is in sync with all CPUs but may @@ -448,6 +527,23 @@ of ftrace. Here is a list of some of the key files: See events.txt for more information. + set_event: + + By echoing in the event into this file, will enable that event. + + See events.txt for more information. + + available_events: + + A list of events that can be enabled in tracing. + + See events.txt for more information. + + hwlat_detector: + + Directory for the Hardware Latency Detector. + See "Hardware Latency Detector" section below. + per_cpu: This is a directory that contains the trace per_cpu information. @@ -539,13 +635,25 @@ Here is the list of current tracers that may be configured. to draw a graph of function calls similar to C code source. + "blk" + + The block tracer. The tracer used by the blktrace user + application. + + "hwlat" + + The Hardware Latency tracer is used to detect if the hardware + produces any latency. See "Hardware Latency Detector" section + below. + "irqsoff" Traces the areas that disable interrupts and saves the trace with the longest max latency. See tracing_max_latency. When a new max is recorded, it replaces the old trace. It is best to view this - trace with the latency-format option enabled. + trace with the latency-format option enabled, which + happens automatically when the tracer is selected. "preemptoff" @@ -571,6 +679,26 @@ Here is the list of current tracers that may be configured. RT tasks (as the current "wakeup" does). This is useful for those interested in wake up timings of RT tasks. + "wakeup_dl" + + Traces and records the max latency that it takes for + a SCHED_DEADLINE task to be woken (as the "wakeup" and + "wakeup_rt" does). + + "mmiotrace" + + A special tracer that is used to trace binary module. + It will trace all the calls that a module makes to the + hardware. Everything it writes and reads from the I/O + as well. + + "branch" + + This tracer can be configured when tracing likely/unlikely + calls within the kernel. It will trace when a likely and + unlikely branch is hit and if it was correct in its prediction + of being correct. + "nop" This is the "trace nothing" tracer. To remove all @@ -582,7 +710,7 @@ Examples of using the tracer ---------------------------- Here are typical examples of using the tracers when controlling -them only with the debugfs interface (without using any +them only with the tracefs interface (without using any user-land utilities). Output format: @@ -674,7 +802,7 @@ why a latency happened. Here is a typical trace. This shows that the current tracer is "irqsoff" tracing the time for which interrupts were disabled. It gives the trace version (which never changes) and the version of the kernel upon which this was executed on -(3.10). Then it displays the max latency in microseconds (259 us). The number +(3.8). Then it displays the max latency in microseconds (259 us). The number of trace entries displayed and the total number (both are four: #4/4). VP, KP, SP, and HP are always zero and are reserved for later use. #P is the number of online CPUs (#P:4). @@ -709,6 +837,8 @@ explains which is which. '.' otherwise. hardirq/softirq: + 'Z' - NMI occurred inside a hardirq + 'z' - NMI is running 'H' - hard irq occurred inside a softirq. 'h' - hard irq is running 's' - soft irq is running @@ -757,24 +887,24 @@ nohex nobin noblock trace_printk -nobranch annotate nouserstacktrace nosym-userobj noprintk-msg-only context-info nolatency-format -sleep-time -graph-time record-cmd +norecord-tgid overwrite nodisable_on_free irq-info markers noevent-fork function-trace +nofunction-fork nodisplay-graph nostacktrace +nobranch To disable one of the options, echo in the option prepended with "no". @@ -830,8 +960,6 @@ Here are the available options: trace_printk - Can disable trace_printk() from writing into the buffer. - branch - Enable branch tracing with the tracer. - annotate - It is sometimes confusing when the CPU buffers are full and one CPU buffer had a lot of events recently, thus a shorter time frame, were another CPU may have only had @@ -850,7 +978,8 @@ Here are the available options: <idle>-0 [001] .Ns3 21169.031485: sub_preempt_count <-_raw_spin_unlock userstacktrace - This option changes the trace. It records a - stacktrace of the current userspace thread. + stacktrace of the current user space thread after + each trace event. sym-userobj - when user stacktrace are enabled, look up which object the address belongs to, and print a @@ -873,29 +1002,21 @@ x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6] context-info - Show only the event data. Hides the comm, PID, timestamp, CPU, and other useful data. - latency-format - This option changes the trace. When - it is enabled, the trace displays - additional information about the - latencies, as described in "Latency - trace format". - - sleep-time - When running function graph tracer, to include - the time a task schedules out in its function. - When enabled, it will account time the task has been - scheduled out as part of the function call. - - graph-time - When running function profiler with function graph tracer, - to include the time to call nested functions. When this is - not set, the time reported for the function will only - include the time the function itself executed for, not the - time for functions that it called. + latency-format - This option changes the trace output. When it is enabled, + the trace displays additional information about the + latency, as described in "Latency trace format". record-cmd - When any event or tracer is enabled, a hook is enabled - in the sched_switch trace point to fill comm cache + in the sched_switch trace point to fill comm cache with mapped pids and comms. But this may cause some overhead, and if you only care about pids, and not the name of the task, disabling this option can lower the - impact of tracing. + impact of tracing. See "saved_cmdlines". + + record-tgid - When any event or tracer is enabled, a hook is enabled + in the sched_switch trace point to fill the cache of + mapped Thread Group IDs (TGID) mapping to pids. See + "saved_tgids". overwrite - This controls what happens when the trace buffer is full. If "1" (default), the oldest events are @@ -935,19 +1056,98 @@ x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6] functions. This keeps the overhead of the tracer down when performing latency tests. + function-fork - When set, tasks with PIDs listed in set_ftrace_pid will + have the PIDs of their children added to set_ftrace_pid + when those tasks fork. Also, when tasks with PIDs in + set_ftrace_pid exit, their PIDs will be removed from the + file. + display-graph - When set, the latency tracers (irqsoff, wakeup, etc) will use function graph tracing instead of function tracing. - stacktrace - This is one of the options that changes the trace - itself. When a trace is recorded, so is the stack - of functions. This allows for back traces of - trace sites. + stacktrace - When set, a stack trace is recorded after any trace event + is recorded. + + branch - Enable branch tracing with the tracer. This enables branch + tracer along with the currently set tracer. Enabling this + with the "nop" tracer is the same as just enabling the + "branch" tracer. Note: Some tracers have their own options. They only appear in this file when the tracer is active. They always appear in the options directory. +Here are the per tracer options: + +Options for function tracer: + + func_stack_trace - When set, a stack trace is recorded after every + function that is recorded. NOTE! Limit the functions + that are recorded before enabling this, with + "set_ftrace_filter" otherwise the system performance + will be critically degraded. Remember to disable + this option before clearing the function filter. + +Options for function_graph tracer: + + Since the function_graph tracer has a slightly different output + it has its own options to control what is displayed. + + funcgraph-overrun - When set, the "overrun" of the graph stack is + displayed after each function traced. The + overrun, is when the stack depth of the calls + is greater than what is reserved for each task. + Each task has a fixed array of functions to + trace in the call graph. If the depth of the + calls exceeds that, the function is not traced. + The overrun is the number of functions missed + due to exceeding this array. + + funcgraph-cpu - When set, the CPU number of the CPU where the trace + occurred is displayed. + + funcgraph-overhead - When set, if the function takes longer than + A certain amount, then a delay marker is + displayed. See "delay" above, under the + header description. + + funcgraph-proc - Unlike other tracers, the process' command line + is not displayed by default, but instead only + when a task is traced in and out during a context + switch. Enabling this options has the command + of each process displayed at every line. + + funcgraph-duration - At the end of each function (the return) + the duration of the amount of time in the + function is displayed in microseconds. + + funcgraph-abstime - When set, the timestamp is displayed at each + line. + + funcgraph-irqs - When disabled, functions that happen inside an + interrupt will not be traced. + + funcgraph-tail - When set, the return event will include the function + that it represents. By default this is off, and + only a closing curly bracket "}" is displayed for + the return of a function. + + sleep-time - When running function graph tracer, to include + the time a task schedules out in its function. + When enabled, it will account time the task has been + scheduled out as part of the function call. + + graph-time - When running function profiler with function graph tracer, + to include the time to call nested functions. When this is + not set, the time reported for the function will only + include the time the function itself executed for, not the + time for functions that it called. + +Options for blk tracer: + + blk_classic - Shows a more minimalistic output. + irqsoff ------- @@ -1609,7 +1809,7 @@ Doing the same with chrt -r 5 and function-trace set. <idle>-0 3dN.2 14us : sched_avg_update <-__cpu_load_update <idle>-0 3dN.2 14us : _raw_spin_unlock <-cpu_load_update_nohz <idle>-0 3dN.2 14us : sub_preempt_count <-_raw_spin_unlock - <idle>-0 3dN.1 15us : calc_load_exit_idle <-tick_nohz_idle_exit + <idle>-0 3dN.1 15us : calc_load_nohz_stop <-tick_nohz_idle_exit <idle>-0 3dN.1 15us : touch_softlockup_watchdog <-tick_nohz_idle_exit <idle>-0 3dN.1 15us : hrtimer_cancel <-tick_nohz_idle_exit <idle>-0 3dN.1 15us : hrtimer_try_to_cancel <-hrtimer_cancel @@ -1711,6 +1911,85 @@ events. <idle>-0 2d..3 6us : 0:120:R ==> [002] 5882: 94:R sleep +Hardware Latency Detector +------------------------- + +The hardware latency detector is executed by enabling the "hwlat" tracer. + +NOTE, this tracer will affect the performance of the system as it will +periodically make a CPU constantly busy with interrupts disabled. + + # echo hwlat > current_tracer + # sleep 100 + # cat trace +# tracer: hwlat +# +# _-----=> irqs-off +# / _----=> need-resched +# | / _---=> hardirq/softirq +# || / _--=> preempt-depth +# ||| / delay +# TASK-PID CPU# |||| TIMESTAMP FUNCTION +# | | | |||| | | + <...>-3638 [001] d... 19452.055471: #1 inner/outer(us): 12/14 ts:1499801089.066141940 + <...>-3638 [003] d... 19454.071354: #2 inner/outer(us): 11/9 ts:1499801091.082164365 + <...>-3638 [002] dn.. 19461.126852: #3 inner/outer(us): 12/9 ts:1499801098.138150062 + <...>-3638 [001] d... 19488.340960: #4 inner/outer(us): 8/12 ts:1499801125.354139633 + <...>-3638 [003] d... 19494.388553: #5 inner/outer(us): 8/12 ts:1499801131.402150961 + <...>-3638 [003] d... 19501.283419: #6 inner/outer(us): 0/12 ts:1499801138.297435289 nmi-total:4 nmi-count:1 + + +The above output is somewhat the same in the header. All events will have +interrupts disabled 'd'. Under the FUNCTION title there is: + + #1 - This is the count of events recorded that were greater than the + tracing_threshold (See below). + + inner/outer(us): 12/14 + + This shows two numbers as "inner latency" and "outer latency". The test + runs in a loop checking a timestamp twice. The latency detected within + the two timestamps is the "inner latency" and the latency detected + after the previous timestamp and the next timestamp in the loop is + the "outer latency". + + ts:1499801089.066141940 + + The absolute timestamp that the event happened. + + nmi-total:4 nmi-count:1 + + On architectures that support it, if an NMI comes in during the + test, the time spent in NMI is reported in "nmi-total" (in + microseconds). + + All architectures that have NMIs will show the "nmi-count" if an + NMI comes in during the test. + +hwlat files: + + tracing_threshold - This gets automatically set to "10" to represent 10 + microseconds. This is the threshold of latency that + needs to be detected before the trace will be recorded. + + Note, when hwlat tracer is finished (another tracer is + written into "current_tracer"), the original value for + tracing_threshold is placed back into this file. + + hwlat_detector/width - The length of time the test runs with interrupts + disabled. + + hwlat_detector/window - The length of time of the window which the test + runs. That is, the test will run for "width" + microseconds per "window" microseconds + + tracing_cpumask - When the test is started. A kernel thread is created that + runs the test. This thread will alternate between CPUs + listed in the tracing_cpumask between each period + (one "window"). To limit the test to specific CPUs + set the mask in this file to only the CPUs that the test + should run on. + function -------- @@ -1821,15 +2100,15 @@ something like this simple program: #define STR(x) _STR(x) #define MAX_PATH 256 -const char *find_debugfs(void) +const char *find_tracefs(void) { - static char debugfs[MAX_PATH+1]; - static int debugfs_found; + static char tracefs[MAX_PATH+1]; + static int tracefs_found; char type[100]; FILE *fp; - if (debugfs_found) - return debugfs; + if (tracefs_found) + return tracefs; if ((fp = fopen("/proc/mounts","r")) == NULL) { perror("/proc/mounts"); @@ -1839,27 +2118,27 @@ const char *find_debugfs(void) while (fscanf(fp, "%*s %" STR(MAX_PATH) "s %99s %*s %*d %*d\n", - debugfs, type) == 2) { - if (strcmp(type, "debugfs") == 0) + tracefs, type) == 2) { + if (strcmp(type, "tracefs") == 0) break; } fclose(fp); - if (strcmp(type, "debugfs") != 0) { - fprintf(stderr, "debugfs not mounted"); + if (strcmp(type, "tracefs") != 0) { + fprintf(stderr, "tracefs not mounted"); return NULL; } - strcat(debugfs, "/tracing/"); - debugfs_found = 1; + strcat(tracefs, "/tracing/"); + tracefs_found = 1; - return debugfs; + return tracefs; } const char *tracing_file(const char *file_name) { static char trace_file[MAX_PATH+1]; - snprintf(trace_file, MAX_PATH, "%s/%s", find_debugfs(), file_name); + snprintf(trace_file, MAX_PATH, "%s/%s", find_tracefs(), file_name); return trace_file; } @@ -1898,12 +2177,12 @@ Or this simple script! ------ #!/bin/bash -debugfs=`sed -ne 's/^debugfs \(.*\) debugfs.*/\1/p' /proc/mounts` -echo nop > $debugfs/tracing/current_tracer -echo 0 > $debugfs/tracing/tracing_on -echo $$ > $debugfs/tracing/set_ftrace_pid -echo function > $debugfs/tracing/current_tracer -echo 1 > $debugfs/tracing/tracing_on +tracefs=`sed -ne 's/^tracefs \(.*\) tracefs.*/\1/p' /proc/mounts` +echo nop > $tracefs/tracing/current_tracer +echo 0 > $tracefs/tracing/tracing_on +echo $$ > $tracefs/tracing/set_ftrace_pid +echo function > $tracefs/tracing/current_tracer +echo 1 > $tracefs/tracing/tracing_on exec "$@" ------ @@ -2145,13 +2424,18 @@ include the -pg switch in the compiling of the kernel.) At compile time every C file object is run through the recordmcount program (located in the scripts directory). This program will parse the ELF headers in the C object to find all -the locations in the .text section that call mcount. (Note, only -white listed .text sections are processed, since processing other -sections like .init.text may cause races due to those sections -being freed unexpectedly). - -A new section called "__mcount_loc" is created that holds -references to all the mcount call sites in the .text section. +the locations in the .text section that call mcount. Starting +with gcc verson 4.6, the -mfentry has been added for x86, which +calls "__fentry__" instead of "mcount". Which is called before +the creation of the stack frame. + +Note, not all sections are traced. They may be prevented by either +a notrace, or blocked another way and all inline functions are not +traced. Check the "available_filter_functions" file to see what functions +can be traced. + +A section called "__mcount_loc" is created that holds +references to all the mcount/fentry call sites in the .text section. The recordmcount program re-links this section back into the original object. The final linking stage of the kernel will add all these references into a single table. @@ -2679,7 +2963,7 @@ in time without stopping tracing. Ftrace swaps the current buffer with a spare buffer, and tracing continues in the new current (=previous spare) buffer. -The following debugfs files in "tracing" are related to this +The following tracefs files in "tracing" are related to this feature: snapshot: @@ -2752,7 +3036,7 @@ cat: snapshot: Device or resource busy Instances --------- -In the debugfs tracing directory is a directory called "instances". +In the tracefs tracing directory is a directory called "instances". This directory can have new directories created inside of it using mkdir, and removing directories with rmdir. The directory created with mkdir in this directory will already contain files and other diff --git a/Documentation/translations/ja_JP/howto.rst b/Documentation/translations/ja_JP/howto.rst index 4511eed0fabb..8d7ed0cbbf5f 100644 --- a/Documentation/translations/ja_JP/howto.rst +++ b/Documentation/translations/ja_JP/howto.rst @@ -197,13 +197,6 @@ ReSTใใผใฏใขใใใไฝฟใฃใใใญใฅใกใณใใฏ Documentation/outputใซ็ make latexdocs make epubdocs -็พๅจใๅนพใคใใฎ DocBookๅฝขๅผใงๆธใใใใใญใฅใกใณใใฏ ReSTๅฝขๅผใซ่ปขๆไธญใง -ใใใใใใฎใใญใฅใกใณใใฏDocumentation/DocBook ใใฃใฌใฏใใชใซ็ๆใใใ -Postscript ใพใใฏ man ใใผใธใฎๅฝขๅผใ็ๆใใใซใฏไปฅไธใฎใใใซใใพใ - :: - - make psdocs - make mandocs - ใซใผใใซ้็บ่
ใซใชใใซใฏ ------------------------ diff --git a/Documentation/translations/ko_KR/howto.rst b/Documentation/translations/ko_KR/howto.rst index 2333697251dd..624654bdcd8a 100644 --- a/Documentation/translations/ko_KR/howto.rst +++ b/Documentation/translations/ko_KR/howto.rst @@ -191,13 +191,6 @@ ReST ๋งํฌ์
์ ์ฌ์ฉํ๋ ๋ฌธ์๋ค์ Documentation/output ์ ์์ฑ๋๋ make latexdocs make epubdocs -ํ์ฌ, ReST ๋ก์ ๋ณํ์ด ์งํ์ค์ธ, DocBook ์ผ๋ก ์ฐ์ธ ๋ฌธ์๋ค์ด ์กด์ฌํ๋ค. ๊ทธ๋ฐ -๋ฌธ์๋ค์ Documentation/DocBook/ ๋๋ ํ ๋ฆฌ ์์ ์์ฑ๋ ๊ฒ์ด๊ณ ๋ค์ ์ปค๋งจ๋๋ฅผ ํตํด -Postscript ๋ man page ๋ก๋ ๋ง๋ค์ด์ง ์ ์๋ค:: - - make psdocs - make mandocs - ์ปค๋ ๊ฐ๋ฐ์๊ฐ ๋๋ ๊ฒ --------------------- @@ -270,15 +263,17 @@ pub/linux/kernel/v4.x/ ๋๋ ํ ๋ฆฌ์์ ์ฐธ์กฐ๋ ์ ์๋ค.๊ฐ๋ฐ ํ๋ก์ธ์ ์ ํธ๋๋ ๋ฐฉ๋ฒ์ git(์ปค๋์ ์์ค ๊ด๋ฆฌ ํด, ๋ ๋ง์ ์ ๋ณด๋ค์ https://git-scm.com/ ์์ ์ฐธ์กฐํ ์ ์๋ค)๋ฅผ ์ฌ์ฉํ๋ ๊ฒ์ด์ง๋ง ์์ํ ํจ์นํ์ผ์ ํ์์ผ๋ก ๋ณด๋ด๋ ๊ฒ๋ ๋ฌด๊ดํ๋ค. - - 2์ฃผ ํ์ -rc1 ์ปค๋์ด ๋ฐฐํฌ๋๋ฉฐ ์ง๊ธ๋ถํฐ๋ ์ ์ฒด ์ปค๋์ ์์ ์ฑ์ ์ํฅ์ - ๋ฏธ์น ์ ์๋ ์๋ก์ด ๊ธฐ๋ฅ๋ค์ ํฌํจํ์ง ์๋ ํจ์น๋ค๋ง์ด ์ถ๊ฐ๋ ์ ์๋ค. - ์์ ํ ์๋ก์ด ๋๋ผ์ด๋ฒ(ํน์ ํ์ผ์์คํ
)๋ -rc1 ์ดํ์๋ง ๋ฐ์๋ค์ฌ์ง๋ค๋ - ๊ฒ์ ๊ธฐ์ตํด๋ผ. ์๋ํ๋ฉด ๋ณ๊ฒฝ์ด ์์ฒด๋ด์์๋ง ๋ฐ์ํ๊ณ ์ถ๊ฐ๋ ์ฝ๋๊ฐ - ๋๋ผ์ด๋ฒ ์ธ๋ถ์ ๋ค๋ฅธ ๋ถ๋ถ์๋ ์ํฅ์ ์ฃผ์ง ์์ผ๋ฏ๋ก ๊ทธ๋ฐ ๋ณ๊ฒฝ์ + - 2์ฃผ ํ์ -rc1 ์ปค๋์ด ๋ฆด๋ฆฌ์ฆ๋๋ฉฐ ์ฌ๊ธฐ์๋ถํฐ์ ์ฃผ์์ ์ ์๋ก์ด ์ปค๋์ + ๊ฐ๋ฅํํ ์์ ๋๊ฒ ํ๋ ๊ฒ์ด๋ค. ์ด ์์ ์์์ ๋๋ถ๋ถ์ ํจ์น๋ค์ ํ๊ท(์ญ์์ฃผ: ์ด์ ์๋ ์กด์ฌํ์ง ์์์ง๋ง ์๋ก์ด ๊ธฐ๋ฅ์ถ๊ฐ๋ ๋ณ๊ฒฝ์ผ๋ก ์ธํด - ์๊ฒจ๋ ๋ฒ๊ทธ)๋ฅผ ์ผ์ผํฌ ๋งํ ์ํ์ ๊ฐ์ง๊ณ ์์ง ์๊ธฐ ๋๋ฌธ์ด๋ค. -rc1์ด - ๋ฐฐํฌ๋ ์ดํ์ git๋ฅผ ์ฌ์ฉํ์ฌ ํจ์น๋ค์ Linus์๊ฒ ๋ณด๋ผ์ ์์ง๋ง ํจ์น๋ค์ - ๊ณต์์ ์ธ ๋ฉ์ผ๋ง ๋ฆฌ์คํธ๋ก ๋ณด๋ด์ ๊ฒํ ๋ฅผ ๋ฐ์ ํ์๊ฐ ์๋ค. + ์๊ฒจ๋ ๋ฒ๊ทธ)๋ฅผ ๊ณ ์ณ์ผ ํ๋ค. ์ด์ ๋ถํฐ ์กด์ฌํ ๋ฒ๊ทธ๋ ํ๊ท๊ฐ ์๋๋ฏ๋ก, ๊ทธ๋ฐ + ๋ฒ๊ทธ์ ๋ํ ์์ ์ฌํญ์ ์ค์ํ ๊ฒฝ์ฐ์๋ง ๋ณด๋ด์ ธ์ผ ํ๋ค. ์์ ํ ์๋ก์ด + ๋๋ผ์ด๋ฒ(ํน์ ํ์ผ์์คํ
)๋ -rc1 ์ดํ์๋ง ๋ฐ์๋ค์ฌ์ง๋ค๋ ๊ฒ์ ๊ธฐ์ตํด๋ผ. + ์๋ํ๋ฉด ๋ณ๊ฒฝ์ด ์์ฒด๋ด์์๋ง ๋ฐ์ํ๊ณ ์ถ๊ฐ๋ ์ฝ๋๊ฐ ๋๋ผ์ด๋ฒ ์ธ๋ถ์ ๋ค๋ฅธ + ๋ถ๋ถ์๋ ์ํฅ์ ์ฃผ์ง ์์ผ๋ฏ๋ก ๊ทธ๋ฐ ๋ณ๊ฒฝ์ ํ๊ท๋ฅผ ์ผ์ผํฌ ๋งํ ์ํ์ ๊ฐ์ง๊ณ + ์์ง ์๊ธฐ ๋๋ฌธ์ด๋ค. -rc1์ด ๋ฐฐํฌ๋ ์ดํ์ git๋ฅผ ์ฌ์ฉํ์ฌ ํจ์น๋ค์ Linus์๊ฒ + ๋ณด๋ผ์ ์์ง๋ง ํจ์น๋ค์ ๊ณต์์ ์ธ ๋ฉ์ผ๋ง ๋ฆฌ์คํธ๋ก ๋ณด๋ด์ ๊ฒํ ๋ฅผ ๋ฐ์ ํ์๊ฐ + ์๋ค. - ์๋ก์ด -rc๋ Linus๊ฐ ํ์ฌ git tree๊ฐ ํ
์คํธ ํ๊ธฐ์ ์ถฉ๋ถํ ์์ ๋ ์ํ์ ์๋ค๊ณ ํ๋จ๋ ๋๋ง๋ค ๋ฐฐํฌ๋๋ค. ๋ชฉํ๋ ์๋ก์ด -rc ์ปค๋์ ๋งค์ฃผ ๋ฐฐํฌํ๋ ๊ฒ์ด๋ค. @@ -359,7 +354,7 @@ http://patchwork.ozlabs.org/ ์ ๋์ด๋์ด ์๋ค. ๋ฒ๊ทธ ๋ณด๊ณ --------- -https://bugzilla.kernel.org๋ ๋ฆฌ๋
์ค ์ปค๋ ๊ฐ๋ฐ์๋ค์ด ์ปค๋์ ๋ฒ๊ทธ๋ฅผ ์ถ์ ํ๋ +https://bugzilla.kernel.org ๋ ๋ฆฌ๋
์ค ์ปค๋ ๊ฐ๋ฐ์๋ค์ด ์ปค๋์ ๋ฒ๊ทธ๋ฅผ ์ถ์ ํ๋ ๊ณณ์ด๋ค. ์ฌ์ฉ์๋ค์ ๋ฐ๊ฒฌํ ๋ชจ๋ ๋ฒ๊ทธ๋ค์ ๋ณด๊ณ ํ๊ธฐ ์ํ์ฌ ์ด ํด์ ์ฌ์ฉํ ๊ฒ์ ๊ถ์ฅํ๋ค. kernel bugzilla๋ฅผ ์ฌ์ฉํ๋ ์์ธํ ๋ฐฉ๋ฒ์ ๋ค์์ ์ฐธ์กฐํ๋ผ. diff --git a/Documentation/translations/ko_KR/memory-barriers.txt b/Documentation/translations/ko_KR/memory-barriers.txt index d05d4c54e8f7..38310dcd6620 100644 --- a/Documentation/translations/ko_KR/memory-barriers.txt +++ b/Documentation/translations/ko_KR/memory-barriers.txt @@ -523,11 +523,11 @@ CPU ์๊ฒ ๊ธฐ๋ํ ์ ์๋ ์ต์ํ์ ๋ณด์ฅ์ฌํญ ๋ช๊ฐ์ง๊ฐ ์์ต๋ ์ฆ, ACQUIRE ๋ ์ต์ํ์ "์ทจ๋" ๋์์ฒ๋ผ, ๊ทธ๋ฆฌ๊ณ RELEASE ๋ ์ต์ํ์ "๊ณต๊ฐ" ์ฒ๋ผ ๋์ํ๋ค๋ ์๋ฏธ์
๋๋ค. -atomic_ops.txt ์์ ์ค๋ช
๋๋ ์ดํ ๋ฏน ์คํผ๋ ์ด์
๋ค ์ค์๋ ์์ ํ ์์์กํ ๊ฒ๋ค๊ณผ -(๋ฐฐ๋ฆฌ์ด๋ฅผ ์ฌ์ฉํ์ง ์๋) ์ํ๋ ์์์ ๊ฒ๋ค ์ธ์ ACQUIRE ์ RELEASE ๋ถ๋ฅ์ -๊ฒ๋ค๋ ์กด์ฌํฉ๋๋ค. ๋ก๋์ ์คํ ์ด๋ฅผ ๋ชจ๋ ์ํํ๋ ์กฐํฉ๋ ์ดํ ๋ฏน ์คํผ๋ ์ด์
์์, -ACQUIRE ๋ ํด๋น ์คํผ๋ ์ด์
์ ๋ก๋ ๋ถ๋ถ์๋ง ์ ์ฉ๋๊ณ RELEASE ๋ ํด๋น -์คํผ๋ ์ด์
์ ์คํ ์ด ๋ถ๋ถ์๋ง ์ ์ฉ๋ฉ๋๋ค. +core-api/atomic_ops.rst ์์ ์ค๋ช
๋๋ ์ดํ ๋ฏน ์คํผ๋ ์ด์
๋ค ์ค์๋ ์์ ํ +์์์กํ ๊ฒ๋ค๊ณผ (๋ฐฐ๋ฆฌ์ด๋ฅผ ์ฌ์ฉํ์ง ์๋) ์ํ๋ ์์์ ๊ฒ๋ค ์ธ์ ACQUIRE ์ +RELEASE ๋ถ๋ฅ์ ๊ฒ๋ค๋ ์กด์ฌํฉ๋๋ค. ๋ก๋์ ์คํ ์ด๋ฅผ ๋ชจ๋ ์ํํ๋ ์กฐํฉ๋ ์ดํ ๋ฏน +์คํผ๋ ์ด์
์์, ACQUIRE ๋ ํด๋น ์คํผ๋ ์ด์
์ ๋ก๋ ๋ถ๋ถ์๋ง ์ ์ฉ๋๊ณ RELEASE ๋ +ํด๋น ์คํผ๋ ์ด์
์ ์คํ ์ด ๋ถ๋ถ์๋ง ์ ์ฉ๋ฉ๋๋ค. ๋ฉ๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์ด๋ค์ ๋ CPU ๊ฐ, ๋๋ CPU ์ ๋๋ฐ์ด์ค ๊ฐ์ ์ํธ์์ฉ์ ๊ฐ๋ฅ์ฑ์ด ์์ ๋์๋ง ํ์ํฉ๋๋ค. ๋ง์ฝ ์ด๋ค ์ฝ๋์ ๊ทธ๋ฐ ์ํธ์์ฉ์ด ์์ ๊ฒ์ด ๋ณด์ฅ๋๋ค๋ฉด, ํด๋น @@ -786,7 +786,7 @@ CPU ๋ b ๋ก๋ถํฐ์ ๋ก๋ ์คํผ๋ ์ด์
์ด a ๋ก๋ถํฐ์ ๋ก๋ ์คํผ๋ ์์ ์ฝ๋๋ฅผ ์๋์ ๊ฐ์ด ๋ฐ๊ฟ๋ฒ๋ฆด ์ ์์ต๋๋ค: q = READ_ONCE(a); - WRITE_ONCE(b, 1); + WRITE_ONCE(b, 2); do_something_else(); ์ด๋ ๊ฒ ๋๋ฉด, CPU ๋ ๋ณ์ 'a' ๋ก๋ถํฐ์ ๋ก๋์ ๋ณ์ 'b' ๋ก์ ์คํ ์ด ์ฌ์ด์ ์์๋ฅผ @@ -1848,7 +1848,7 @@ Mandatory ๋ฐฐ๋ฆฌ์ด๋ค์ SMP ์์คํ
์์๋ UP ์์คํ
์์๋ SMP ํจ๊ณ ์ด ์ฝ๋๋ ๊ฐ์ฒด์ ์
๋ฐ์ดํธ๋ death ๋งํฌ๊ฐ ๋ ํผ๋ฐ์ค ์นด์ดํฐ ๊ฐ์ ๋์ *์ ์* ๋ณด์ผ ๊ฒ์ ๋ณด์ฅํฉ๋๋ค. - ๋ ๋ง์ ์ ๋ณด๋ฅผ ์ํด์ Documentation/atomic_ops.txt ๋ฌธ์๋ฅผ ์ฐธ๊ณ ํ์ธ์. + ๋ ๋ง์ ์ ๋ณด๋ฅผ ์ํด์ Documentation/core-api/atomic_ops.rst ๋ฌธ์๋ฅผ ์ฐธ๊ณ ํ์ธ์. ์ด๋์ ์ด๊ฒ๋ค์ ์ฌ์ฉํด์ผ ํ ์ง ๊ถ๊ธํ๋ค๋ฉด "์ดํ ๋ฏน ์คํผ๋ ์ด์
" ์๋ธ์น์
์ ์ฐธ๊ณ ํ์ธ์. @@ -2550,7 +2550,7 @@ CPU ์์๋ ์ฌ์ฉ๋๋ ์ดํ ๋ฏน ์ธ์คํธ๋ญ์
์์ฒด์ ๋ฉ๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์ ์๋๋ฐ, ๊ทธ๋ฐ ๊ฒฝ์ฐ์ ์ด ํน์ ๋ฉ๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์ด ๋๊ตฌ๋ค์ no-op ์ด ๋์ด ์ค์ง์ ์ผ๋ก ์๋ฌด์ผ๋ ํ์ง ์์ต๋๋ค. -๋ ๋ง์ ๋ด์ฉ์ ์ํด์ Documentation/atomic_ops.txt ๋ฅผ ์ฐธ๊ณ ํ์ธ์. +๋ ๋ง์ ๋ด์ฉ์ ์ํด์ Documentation/core-api/atomic_ops.rst ๋ฅผ ์ฐธ๊ณ ํ์ธ์. ๋๋ฐ์ด์ค ์ก์ธ์ค diff --git a/Documentation/unaligned-memory-access.txt b/Documentation/unaligned-memory-access.txt index 3f76c0c37920..51b4ff031586 100644 --- a/Documentation/unaligned-memory-access.txt +++ b/Documentation/unaligned-memory-access.txt @@ -1,6 +1,15 @@ +========================= UNALIGNED MEMORY ACCESSES ========================= +:Author: Daniel Drake <dsd@gentoo.org>, +:Author: Johannes Berg <johannes@sipsolutions.net> + +:With help from: Alan Cox, Avuton Olrich, Heikki Orsila, Jan Engelhardt, + Kyle McMartin, Kyle Moffett, Randy Dunlap, Robert Hancock, Uli Kunitz, + Vadim Lobanov + + Linux runs on a wide variety of architectures which have varying behaviour when it comes to memory access. This document presents some details about unaligned accesses, why you need to write code that doesn't cause them, @@ -73,7 +82,7 @@ memory addresses of certain variables, etc. Fortunately things are not too complex, as in most cases, the compiler ensures that things will work for you. For example, take the following -structure: +structure:: struct foo { u16 field1; @@ -106,7 +115,7 @@ On a related topic, with the above considerations in mind you may observe that you could reorder the fields in the structure in order to place fields where padding would otherwise be inserted, and hence reduce the overall resident memory size of structure instances. The optimal layout of the -above example is: +above example is:: struct foo { u32 field2; @@ -139,21 +148,21 @@ Code that causes unaligned access With the above in mind, let's move onto a real life example of a function that can cause an unaligned memory access. The following function taken from include/linux/etherdevice.h is an optimized routine to compare two -ethernet MAC addresses for equality. +ethernet MAC addresses for equality:: -bool ether_addr_equal(const u8 *addr1, const u8 *addr2) -{ -#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS + bool ether_addr_equal(const u8 *addr1, const u8 *addr2) + { + #ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS u32 fold = ((*(const u32 *)addr1) ^ (*(const u32 *)addr2)) | ((*(const u16 *)(addr1 + 4)) ^ (*(const u16 *)(addr2 + 4))); return fold == 0; -#else + #else const u16 *a = (const u16 *)addr1; const u16 *b = (const u16 *)addr2; return ((a[0] ^ b[0]) | (a[1] ^ b[1]) | (a[2] ^ b[2])) == 0; -#endif -} + #endif + } In the above function, when the hardware has efficient unaligned access capability, there is no issue with this code. But when the hardware isn't @@ -171,7 +180,8 @@ as it is a decent optimization for the cases when you can ensure alignment, which is true almost all of the time in ethernet networking context. -Here is another example of some code that could cause unaligned accesses: +Here is another example of some code that could cause unaligned accesses:: + void myfunc(u8 *data, u32 value) { [...] @@ -184,6 +194,7 @@ to an address that is not evenly divisible by 4. In summary, the 2 main scenarios where you may run into unaligned access problems involve: + 1. Casting variables to types of different lengths 2. Pointer arithmetic followed by access to at least 2 bytes of data @@ -195,7 +206,7 @@ The easiest way to avoid unaligned access is to use the get_unaligned() and put_unaligned() macros provided by the <asm/unaligned.h> header file. Going back to an earlier example of code that potentially causes unaligned -access: +access:: void myfunc(u8 *data, u32 value) { @@ -204,7 +215,7 @@ access: [...] } -To avoid the unaligned memory access, you would rewrite it as follows: +To avoid the unaligned memory access, you would rewrite it as follows:: void myfunc(u8 *data, u32 value) { @@ -215,7 +226,7 @@ To avoid the unaligned memory access, you would rewrite it as follows: } The get_unaligned() macro works similarly. Assuming 'data' is a pointer to -memory and you wish to avoid unaligned access, its usage is as follows: +memory and you wish to avoid unaligned access, its usage is as follows:: u32 value = get_unaligned((u32 *) data); @@ -245,18 +256,10 @@ For some ethernet hardware that cannot DMA to unaligned addresses like 4*n+2 or non-ethernet hardware, this can be a problem, and it is then required to copy the incoming frame into an aligned buffer. Because this is unnecessary on architectures that can do unaligned accesses, the code can be -made dependent on CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS like so: - -#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS - skb = original skb -#else - skb = copy skb -#endif - --- -Authors: Daniel Drake <dsd@gentoo.org>, - Johannes Berg <johannes@sipsolutions.net> -With help from: Alan Cox, Avuton Olrich, Heikki Orsila, Jan Engelhardt, -Kyle McMartin, Kyle Moffett, Randy Dunlap, Robert Hancock, Uli Kunitz, -Vadim Lobanov +made dependent on CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS like so:: + #ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS + skb = original skb + #else + skb = copy skb + #endif diff --git a/Documentation/usb/gadget-testing.txt b/Documentation/usb/gadget-testing.txt index fb0cc4df1765..fbc397d17e98 100644 --- a/Documentation/usb/gadget-testing.txt +++ b/Documentation/usb/gadget-testing.txt @@ -16,10 +16,11 @@ provided by gadgets. 13. RNDIS function 14. SERIAL function 15. SOURCESINK function -16. UAC1 function +16. UAC1 function (legacy implementation) 17. UAC2 function 18. UVC function 19. PRINTER function +20. UAC1 function (new API) 1. ACM function @@ -589,15 +590,16 @@ device: run the gadget host: test-usb (tools/usb/testusb.c) -16. UAC1 function +16. UAC1 function (legacy implementation) ================= -The function is provided by usb_f_uac1.ko module. +The function is provided by usb_f_uac1_legacy.ko module. Function-specific configfs interface ------------------------------------ -The function name to use when creating the function directory is "uac1". +The function name to use when creating the function directory +is "uac1_legacy". The uac1 function provides these attributes in its function directory: audio_buf_size - audio buffer size @@ -772,3 +774,46 @@ host: More advanced testing can be done with the prn_example described in Documentation/usb/gadget-printer.txt. + + +20. UAC1 function (virtual ALSA card, using u_audio API) +================= + +The function is provided by usb_f_uac1.ko module. +It will create a virtual ALSA card and the audio streams are simply +sinked to and sourced from it. + +Function-specific configfs interface +------------------------------------ + +The function name to use when creating the function directory is "uac1". +The uac1 function provides these attributes in its function directory: + + c_chmask - capture channel mask + c_srate - capture sampling rate + c_ssize - capture sample size (bytes) + p_chmask - playback channel mask + p_srate - playback sampling rate + p_ssize - playback sample size (bytes) + req_number - the number of pre-allocated request for both capture + and playback + +The attributes have sane default values. + +Testing the UAC1 function +------------------------- + +device: run the gadget +host: aplay -l # should list our USB Audio Gadget + +This function does not require real hardware support, it just +sends a stream of audio data to/from the host. In order to +actually hear something at the device side, a command similar +to this must be used at the device side: + +$ arecord -f dat -t wav -D hw:2,0 | aplay -D hw:0,0 & + +e.g.: + +$ arecord -f dat -t wav -D hw:CARD=UAC1Gadget,DEV=0 | \ +aplay -D default:CARD=OdroidU3 diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst index a9d01b44a659..7b2eb1b7d4ca 100644 --- a/Documentation/userspace-api/index.rst +++ b/Documentation/userspace-api/index.rst @@ -16,6 +16,8 @@ place where this information is gathered. .. toctree:: :maxdepth: 2 + no_new_privs + seccomp_filter unshare .. only:: subproject and html diff --git a/Documentation/prctl/no_new_privs.txt b/Documentation/userspace-api/no_new_privs.rst index f7be84fba910..d060ea217ea1 100644 --- a/Documentation/prctl/no_new_privs.txt +++ b/Documentation/userspace-api/no_new_privs.rst @@ -1,3 +1,7 @@ +====================== +No New Privileges Flag +====================== + The execve system call can grant a newly-started program privileges that its parent did not have. The most obvious examples are setuid/setgid programs and file capabilities. To prevent the parent program from @@ -5,53 +9,55 @@ gaining these privileges as well, the kernel and user code must be careful to prevent the parent from doing anything that could subvert the child. For example: - - The dynamic loader handles LD_* environment variables differently if + - The dynamic loader handles ``LD_*`` environment variables differently if a program is setuid. - chroot is disallowed to unprivileged processes, since it would allow - /etc/passwd to be replaced from the point of view of a process that + ``/etc/passwd`` to be replaced from the point of view of a process that inherited chroot. - The exec code has special handling for ptrace. -These are all ad-hoc fixes. The no_new_privs bit (since Linux 3.5) is a +These are all ad-hoc fixes. The ``no_new_privs`` bit (since Linux 3.5) is a new, generic mechanism to make it safe for a process to modify its execution environment in a manner that persists across execve. Any task -can set no_new_privs. Once the bit is set, it is inherited across fork, -clone, and execve and cannot be unset. With no_new_privs set, execve +can set ``no_new_privs``. Once the bit is set, it is inherited across fork, +clone, and execve and cannot be unset. With ``no_new_privs`` set, ``execve()`` promises not to grant the privilege to do anything that could not have been done without the execve call. For example, the setuid and setgid bits will no longer change the uid or gid; file capabilities will not add to the permitted set, and LSMs will not relax constraints after execve. -To set no_new_privs, use prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0). +To set ``no_new_privs``, use:: + + prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0); Be careful, though: LSMs might also not tighten constraints on exec -in no_new_privs mode. (This means that setting up a general-purpose -service launcher to set no_new_privs before execing daemons may +in ``no_new_privs`` mode. (This means that setting up a general-purpose +service launcher to set ``no_new_privs`` before execing daemons may interfere with LSM-based sandboxing.) -Note that no_new_privs does not prevent privilege changes that do not -involve execve. An appropriately privileged task can still call -setuid(2) and receive SCM_RIGHTS datagrams. +Note that ``no_new_privs`` does not prevent privilege changes that do not +involve ``execve()``. An appropriately privileged task can still call +``setuid(2)`` and receive SCM_RIGHTS datagrams. -There are two main use cases for no_new_privs so far: +There are two main use cases for ``no_new_privs`` so far: - Filters installed for the seccomp mode 2 sandbox persist across execve and can change the behavior of newly-executed programs. Unprivileged users are therefore only allowed to install such filters - if no_new_privs is set. + if ``no_new_privs`` is set. - - By itself, no_new_privs can be used to reduce the attack surface + - By itself, ``no_new_privs`` can be used to reduce the attack surface available to an unprivileged user. If everything running with a - given uid has no_new_privs set, then that uid will be unable to + given uid has ``no_new_privs`` set, then that uid will be unable to escalate its privileges by directly attacking setuid, setgid, and fcap-using binaries; it will need to compromise something without the - no_new_privs bit set first. + ``no_new_privs`` bit set first. In the future, other potentially dangerous kernel features could become -available to unprivileged tasks if no_new_privs is set. In principle, -several options to unshare(2) and clone(2) would be safe when -no_new_privs is set, and no_new_privs + chroot is considerable less +available to unprivileged tasks if ``no_new_privs`` is set. In principle, +several options to ``unshare(2)`` and ``clone(2)`` would be safe when +``no_new_privs`` is set, and ``no_new_privs`` + ``chroot`` is considerable less dangerous than chroot by itself. diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/userspace-api/seccomp_filter.rst index 1e469ef75778..f71eb5ef1f2d 100644 --- a/Documentation/prctl/seccomp_filter.txt +++ b/Documentation/userspace-api/seccomp_filter.rst @@ -1,8 +1,9 @@ - SECure COMPuting with filters - ============================= +=========================================== +Seccomp BPF (SECure COMPuting with filters) +=========================================== Introduction ------------- +============ A large number of system calls are exposed to every userland process with many of them going unused for the entire lifetime of the process. @@ -27,7 +28,7 @@ pointers which constrains all filters to solely evaluating the system call arguments directly. What it isn't -------------- +============= System call filtering isn't a sandbox. It provides a clearly defined mechanism for minimizing the exposed kernel surface. It is meant to be @@ -40,13 +41,13 @@ system calls in socketcall() is allowed, for instance) which could be construed, incorrectly, as a more complete sandboxing solution. Usage ------ +===== An additional seccomp mode is added and is enabled using the same prctl(2) call as the strict seccomp. If the architecture has -CONFIG_HAVE_ARCH_SECCOMP_FILTER, then filters may be added as below: +``CONFIG_HAVE_ARCH_SECCOMP_FILTER``, then filters may be added as below: -PR_SET_SECCOMP: +``PR_SET_SECCOMP``: Now takes an additional argument which specifies a new filter using a BPF program. The BPF program will be executed over struct seccomp_data @@ -55,24 +56,25 @@ PR_SET_SECCOMP: acceptable values to inform the kernel which action should be taken. - Usage: + Usage:: + prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, prog); The 'prog' argument is a pointer to a struct sock_fprog which will contain the filter program. If the program is invalid, the - call will return -1 and set errno to EINVAL. + call will return -1 and set errno to ``EINVAL``. - If fork/clone and execve are allowed by @prog, any child + If ``fork``/``clone`` and ``execve`` are allowed by @prog, any child processes will be constrained to the same filters and system call ABI as the parent. - Prior to use, the task must call prctl(PR_SET_NO_NEW_PRIVS, 1) or - run with CAP_SYS_ADMIN privileges in its namespace. If these are not - true, -EACCES will be returned. This requirement ensures that filter + Prior to use, the task must call ``prctl(PR_SET_NO_NEW_PRIVS, 1)`` or + run with ``CAP_SYS_ADMIN`` privileges in its namespace. If these are not + true, ``-EACCES`` will be returned. This requirement ensures that filter programs cannot be applied to child processes with greater privileges than the task that installed them. - Additionally, if prctl(2) is allowed by the attached filter, + Additionally, if ``prctl(2)`` is allowed by the attached filter, additional filters may be layered on which will increase evaluation time, but allow for further decreasing the attack surface during execution of a process. @@ -80,51 +82,52 @@ PR_SET_SECCOMP: The above call returns 0 on success and non-zero on error. Return values -------------- +============= + A seccomp filter may return any of the following values. If multiple filters exist, the return value for the evaluation of a given system call will always use the highest precedent value. (For example, -SECCOMP_RET_KILL will always take precedence.) +``SECCOMP_RET_KILL`` will always take precedence.) In precedence order, they are: -SECCOMP_RET_KILL: +``SECCOMP_RET_KILL``: Results in the task exiting immediately without executing the - system call. The exit status of the task (status & 0x7f) will - be SIGSYS, not SIGKILL. + system call. The exit status of the task (``status & 0x7f``) will + be ``SIGSYS``, not ``SIGKILL``. -SECCOMP_RET_TRAP: - Results in the kernel sending a SIGSYS signal to the triggering - task without executing the system call. siginfo->si_call_addr +``SECCOMP_RET_TRAP``: + Results in the kernel sending a ``SIGSYS`` signal to the triggering + task without executing the system call. ``siginfo->si_call_addr`` will show the address of the system call instruction, and - siginfo->si_syscall and siginfo->si_arch will indicate which + ``siginfo->si_syscall`` and ``siginfo->si_arch`` will indicate which syscall was attempted. The program counter will be as though the syscall happened (i.e. it will not point to the syscall instruction). The return value register will contain an arch- dependent value -- if resuming execution, set it to something sensible. (The architecture dependency is because replacing - it with -ENOSYS could overwrite some useful information.) + it with ``-ENOSYS`` could overwrite some useful information.) - The SECCOMP_RET_DATA portion of the return value will be passed - as si_errno. + The ``SECCOMP_RET_DATA`` portion of the return value will be passed + as ``si_errno``. - SIGSYS triggered by seccomp will have a si_code of SYS_SECCOMP. + ``SIGSYS`` triggered by seccomp will have a si_code of ``SYS_SECCOMP``. -SECCOMP_RET_ERRNO: +``SECCOMP_RET_ERRNO``: Results in the lower 16-bits of the return value being passed to userland as the errno without executing the system call. -SECCOMP_RET_TRACE: +``SECCOMP_RET_TRACE``: When returned, this value will cause the kernel to attempt to - notify a ptrace()-based tracer prior to executing the system - call. If there is no tracer present, -ENOSYS is returned to + notify a ``ptrace()``-based tracer prior to executing the system + call. If there is no tracer present, ``-ENOSYS`` is returned to userland and the system call is not executed. - A tracer will be notified if it requests PTRACE_O_TRACESECCOMP - using ptrace(PTRACE_SETOPTIONS). The tracer will be notified - of a PTRACE_EVENT_SECCOMP and the SECCOMP_RET_DATA portion of + A tracer will be notified if it requests ``PTRACE_O_TRACESECCOM``P + using ``ptrace(PTRACE_SETOPTIONS)``. The tracer will be notified + of a ``PTRACE_EVENT_SECCOMP`` and the ``SECCOMP_RET_DATA`` portion of the BPF program return value will be available to the tracer - via PTRACE_GETEVENTMSG. + via ``PTRACE_GETEVENTMSG``. The tracer can skip the system call by changing the syscall number to -1. Alternatively, the tracer can change the system call @@ -138,19 +141,19 @@ SECCOMP_RET_TRACE: allow use of ptrace, even of other sandboxed processes, without extreme care; ptracers can use this mechanism to escape.) -SECCOMP_RET_ALLOW: +``SECCOMP_RET_ALLOW``: Results in the system call being executed. If multiple filters exist, the return value for the evaluation of a given system call will always use the highest precedent value. -Precedence is only determined using the SECCOMP_RET_ACTION mask. When +Precedence is only determined using the ``SECCOMP_RET_ACTION`` mask. When multiple filters return values of the same precedence, only the -SECCOMP_RET_DATA from the most recently installed filter will be +``SECCOMP_RET_DATA`` from the most recently installed filter will be returned. Pitfalls --------- +======== The biggest pitfall to avoid during use is filtering on system call number without checking the architecture value. Why? On any @@ -160,39 +163,40 @@ the numbers in the different calling conventions overlap, then checks in the filters may be abused. Always check the arch value! Example -------- +======= -The samples/seccomp/ directory contains both an x86-specific example +The ``samples/seccomp/`` directory contains both an x86-specific example and a more generic example of a higher level macro interface for BPF program generation. Adding architecture support ------------------------ +=========================== -See arch/Kconfig for the authoritative requirements. In general, if an +See ``arch/Kconfig`` for the authoritative requirements. In general, if an architecture supports both ptrace_event and seccomp, it will be able to -support seccomp filter with minor fixup: SIGSYS support and seccomp return -value checking. Then it must just add CONFIG_HAVE_ARCH_SECCOMP_FILTER +support seccomp filter with minor fixup: ``SIGSYS`` support and seccomp return +value checking. Then it must just add ``CONFIG_HAVE_ARCH_SECCOMP_FILTER`` to its arch-specific Kconfig. Caveats -------- +======= The vDSO can cause some system calls to run entirely in userspace, leading to surprises when you run programs on different machines that fall back to real syscalls. To minimize these surprises on x86, make sure you test with -/sys/devices/system/clocksource/clocksource0/current_clocksource set to -something like acpi_pm. +``/sys/devices/system/clocksource/clocksource0/current_clocksource`` set to +something like ``acpi_pm``. On x86-64, vsyscall emulation is enabled by default. (vsyscalls are -legacy variants on vDSO calls.) Currently, emulated vsyscalls will honor seccomp, with a few oddities: +legacy variants on vDSO calls.) Currently, emulated vsyscalls will +honor seccomp, with a few oddities: -- A return value of SECCOMP_RET_TRAP will set a si_call_addr pointing to +- A return value of ``SECCOMP_RET_TRAP`` will set a ``si_call_addr`` pointing to the vsyscall entry for the given call and not the address after the 'syscall' instruction. Any code which wants to restart the call should be aware that (a) a ret instruction has been emulated and (b) @@ -200,7 +204,7 @@ legacy variants on vDSO calls.) Currently, emulated vsyscalls will honor seccom emulation security checks, making resuming the syscall mostly pointless. -- A return value of SECCOMP_RET_TRACE will signal the tracer as usual, +- A return value of ``SECCOMP_RET_TRACE`` will signal the tracer as usual, but the syscall may not be changed to another system call using the orig_rax register. It may only be changed to -1 order to skip the currently emulated call. Any other change MAY terminate the process. @@ -209,14 +213,14 @@ legacy variants on vDSO calls.) Currently, emulated vsyscalls will honor seccom rip or rsp. (Do not rely on other changes terminating the process. They might work. For example, on some kernels, choosing a syscall that only exists in future kernels will be correctly emulated (by - returning -ENOSYS). + returning ``-ENOSYS``). -To detect this quirky behavior, check for addr & ~0x0C00 == -0xFFFFFFFFFF600000. (For SECCOMP_RET_TRACE, use rip. For -SECCOMP_RET_TRAP, use siginfo->si_call_addr.) Do not check any other +To detect this quirky behavior, check for ``addr & ~0x0C00 == +0xFFFFFFFFFF600000``. (For ``SECCOMP_RET_TRACE``, use rip. For +``SECCOMP_RET_TRAP``, use ``siginfo->si_call_addr``.) Do not check any other condition: future kernels may improve vsyscall emulation and current kernels in vsyscall=native mode will behave differently, but the -instructions at 0xF...F600{0,4,8,C}00 will not be system calls in these +instructions at ``0xF...F600{0,4,8,C}00`` will not be system calls in these cases. Note that modern systems are unlikely to use vsyscalls at all -- they diff --git a/Documentation/userspace-api/unshare.rst b/Documentation/userspace-api/unshare.rst index 737c192cf4e7..877e90a35238 100644 --- a/Documentation/userspace-api/unshare.rst +++ b/Documentation/userspace-api/unshare.rst @@ -107,7 +107,7 @@ the benefits of this new feature can exceed its cost. unshare() reverses sharing that was done using clone(2) system call, so unshare() should have a similar interface as clone(2). That is, -since flags in clone(int flags, void *stack) specifies what should +since flags in clone(int flags, void \*stack) specifies what should be shared, similar flags in unshare(int flags) should specify what should be unshared. Unfortunately, this may appear to invert the meaning of the flags from the way they are used in clone(2). diff --git a/Documentation/vfio-mediated-device.txt b/Documentation/vfio-mediated-device.txt index e5e57b40f8af..1b3950346532 100644 --- a/Documentation/vfio-mediated-device.txt +++ b/Documentation/vfio-mediated-device.txt @@ -1,14 +1,17 @@ -/* - * VFIO Mediated devices - * - * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved. - * Author: Neo Jia <cjia@nvidia.com> - * Kirti Wankhede <kwankhede@nvidia.com> - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License version 2 as - * published by the Free Software Foundation. - */ +.. include:: <isonum.txt> + +===================== +VFIO Mediated devices +===================== + +:Copyright: |copy| 2016, NVIDIA CORPORATION. All rights reserved. +:Author: Neo Jia <cjia@nvidia.com> +:Author: Kirti Wankhede <kwankhede@nvidia.com> + +This program is free software; you can redistribute it and/or modify +it under the terms of the GNU General Public License version 2 as +published by the Free Software Foundation. + Virtual Function I/O (VFIO) Mediated devices[1] =============================================== @@ -42,7 +45,7 @@ removes it from a VFIO group. The following high-level block diagram shows the main components and interfaces in the VFIO mediated driver framework. The diagram shows NVIDIA, Intel, and IBM -devices as examples, as these devices are the first devices to use this module. +devices as examples, as these devices are the first devices to use this module:: +---------------+ | | @@ -91,7 +94,7 @@ Registration Interface for a Mediated Bus Driver ------------------------------------------------ The registration interface for a mediated bus driver provides the following -structure to represent a mediated device's driver: +structure to represent a mediated device's driver:: /* * struct mdev_driver [2] - Mediated device's driver @@ -110,14 +113,14 @@ structure to represent a mediated device's driver: A mediated bus driver for mdev should use this structure in the function calls to register and unregister itself with the core driver: -* Register: +* Register:: - extern int mdev_register_driver(struct mdev_driver *drv, + extern int mdev_register_driver(struct mdev_driver *drv, struct module *owner); -* Unregister: +* Unregister:: - extern void mdev_unregister_driver(struct mdev_driver *drv); + extern void mdev_unregister_driver(struct mdev_driver *drv); The mediated bus driver is responsible for adding mediated devices to the VFIO group when devices are bound to the driver and removing mediated devices from @@ -152,15 +155,15 @@ The callbacks in the mdev_parent_ops structure are as follows: * mmap: mmap emulation callback A driver should use the mdev_parent_ops structure in the function call to -register itself with the mdev core driver: +register itself with the mdev core driver:: -extern int mdev_register_device(struct device *dev, - const struct mdev_parent_ops *ops); + extern int mdev_register_device(struct device *dev, + const struct mdev_parent_ops *ops); However, the mdev_parent_ops structure is not required in the function call -that a driver should use to unregister itself with the mdev core driver: +that a driver should use to unregister itself with the mdev core driver:: -extern void mdev_unregister_device(struct device *dev); + extern void mdev_unregister_device(struct device *dev); Mediated Device Management Interface Through sysfs @@ -183,30 +186,32 @@ with the mdev core driver. Directories and files under the sysfs for Each Physical Device -------------------------------------------------------------- -|- [parent physical device] -|--- Vendor-specific-attributes [optional] -|--- [mdev_supported_types] -| |--- [<type-id>] -| | |--- create -| | |--- name -| | |--- available_instances -| | |--- device_api -| | |--- description -| | |--- [devices] -| |--- [<type-id>] -| | |--- create -| | |--- name -| | |--- available_instances -| | |--- device_api -| | |--- description -| | |--- [devices] -| |--- [<type-id>] -| |--- create -| |--- name -| |--- available_instances -| |--- device_api -| |--- description -| |--- [devices] +:: + + |- [parent physical device] + |--- Vendor-specific-attributes [optional] + |--- [mdev_supported_types] + | |--- [<type-id>] + | | |--- create + | | |--- name + | | |--- available_instances + | | |--- device_api + | | |--- description + | | |--- [devices] + | |--- [<type-id>] + | | |--- create + | | |--- name + | | |--- available_instances + | | |--- device_api + | | |--- description + | | |--- [devices] + | |--- [<type-id>] + | |--- create + | |--- name + | |--- available_instances + | |--- device_api + | |--- description + | |--- [devices] * [mdev_supported_types] @@ -219,12 +224,12 @@ Directories and files under the sysfs for Each Physical Device The [<type-id>] name is created by adding the device driver string as a prefix to the string provided by the vendor driver. This format of this name is as - follows: + follows:: sprintf(buf, "%s-%s", dev_driver_string(parent->dev), group->name); (or using mdev_parent_dev(mdev) to arrive at the parent device outside - of the core mdev code) + of the core mdev code) * device_api @@ -239,7 +244,7 @@ Directories and files under the sysfs for Each Physical Device * [device] This directory contains links to the devices of type <type-id> that have been -created. + created. * name @@ -253,21 +258,25 @@ created. Directories and Files Under the sysfs for Each mdev Device ---------------------------------------------------------- -|- [parent phy device] -|--- [$MDEV_UUID] +:: + + |- [parent phy device] + |--- [$MDEV_UUID] |--- remove |--- mdev_type {link to its type} |--- vendor-specific-attributes [optional] * remove (write only) + Writing '1' to the 'remove' file destroys the mdev device. The vendor driver can fail the remove() callback if that device is active and the vendor driver doesn't support hot unplug. -Example: +Example:: + # echo 1 > /sys/bus/mdev/devices/$mdev_UUID/remove -Mediated device Hot plug: +Mediated device Hot plug ------------------------ Mediated devices can be created and assigned at runtime. The procedure to hot @@ -277,13 +286,13 @@ Translation APIs for Mediated Devices ===================================== The following APIs are provided for translating user pfn to host pfn in a VFIO -driver: +driver:: -extern int vfio_pin_pages(struct device *dev, unsigned long *user_pfn, - int npage, int prot, unsigned long *phys_pfn); + extern int vfio_pin_pages(struct device *dev, unsigned long *user_pfn, + int npage, int prot, unsigned long *phys_pfn); -extern int vfio_unpin_pages(struct device *dev, unsigned long *user_pfn, - int npage); + extern int vfio_unpin_pages(struct device *dev, unsigned long *user_pfn, + int npage); These functions call back into the back-end IOMMU module by using the pin_pages and unpin_pages callbacks of the struct vfio_iommu_driver_ops[4]. Currently @@ -304,81 +313,80 @@ card. This step creates a dummy device, /sys/devices/virtual/mtty/mtty/ - Files in this device directory in sysfs are similar to the following: - - # tree /sys/devices/virtual/mtty/mtty/ - /sys/devices/virtual/mtty/mtty/ - |-- mdev_supported_types - | |-- mtty-1 - | | |-- available_instances - | | |-- create - | | |-- device_api - | | |-- devices - | | `-- name - | `-- mtty-2 - | |-- available_instances - | |-- create - | |-- device_api - | |-- devices - | `-- name - |-- mtty_dev - | `-- sample_mtty_dev - |-- power - | |-- autosuspend_delay_ms - | |-- control - | |-- runtime_active_time - | |-- runtime_status - | `-- runtime_suspended_time - |-- subsystem -> ../../../../class/mtty - `-- uevent + Files in this device directory in sysfs are similar to the following:: + + # tree /sys/devices/virtual/mtty/mtty/ + /sys/devices/virtual/mtty/mtty/ + |-- mdev_supported_types + | |-- mtty-1 + | | |-- available_instances + | | |-- create + | | |-- device_api + | | |-- devices + | | `-- name + | `-- mtty-2 + | |-- available_instances + | |-- create + | |-- device_api + | |-- devices + | `-- name + |-- mtty_dev + | `-- sample_mtty_dev + |-- power + | |-- autosuspend_delay_ms + | |-- control + | |-- runtime_active_time + | |-- runtime_status + | `-- runtime_suspended_time + |-- subsystem -> ../../../../class/mtty + `-- uevent 2. Create a mediated device by using the dummy device that you created in the - previous step. + previous step:: - # echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1001" > \ + # echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1001" > \ /sys/devices/virtual/mtty/mtty/mdev_supported_types/mtty-2/create -3. Add parameters to qemu-kvm. +3. Add parameters to qemu-kvm:: - -device vfio-pci,\ - sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 + -device vfio-pci,\ + sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 4. Boot the VM. In the Linux guest VM, with no hardware on the host, the device appears - as follows: - - # lspci -s 00:05.0 -xxvv - 00:05.0 Serial controller: Device 4348:3253 (rev 10) (prog-if 02 [16550]) - Subsystem: Device 4348:3253 - Physical Slot: 5 - Control: I/O+ Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- - Stepping- SERR- FastB2B- DisINTx- - Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- - <TAbort- <MAbort- >SERR- <PERR- INTx- - Interrupt: pin A routed to IRQ 10 - Region 0: I/O ports at c150 [size=8] - Region 1: I/O ports at c158 [size=8] - Kernel driver in use: serial - 00: 48 43 53 32 01 00 00 02 10 02 00 07 00 00 00 00 - 10: 51 c1 00 00 59 c1 00 00 00 00 00 00 00 00 00 00 - 20: 00 00 00 00 00 00 00 00 00 00 00 00 48 43 53 32 - 30: 00 00 00 00 00 00 00 00 00 00 00 00 0a 01 00 00 - - In the Linux guest VM, dmesg output for the device is as follows: - - serial 0000:00:05.0: PCI INT A -> Link[LNKA] -> GSI 10 (level, high) -> IRQ -10 - 0000:00:05.0: ttyS1 at I/O 0xc150 (irq = 10) is a 16550A - 0000:00:05.0: ttyS2 at I/O 0xc158 (irq = 10) is a 16550A - - -5. In the Linux guest VM, check the serial ports. - - # setserial -g /dev/ttyS* - /dev/ttyS0, UART: 16550A, Port: 0x03f8, IRQ: 4 - /dev/ttyS1, UART: 16550A, Port: 0xc150, IRQ: 10 - /dev/ttyS2, UART: 16550A, Port: 0xc158, IRQ: 10 + as follows:: + + # lspci -s 00:05.0 -xxvv + 00:05.0 Serial controller: Device 4348:3253 (rev 10) (prog-if 02 [16550]) + Subsystem: Device 4348:3253 + Physical Slot: 5 + Control: I/O+ Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- + Stepping- SERR- FastB2B- DisINTx- + Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- + <TAbort- <MAbort- >SERR- <PERR- INTx- + Interrupt: pin A routed to IRQ 10 + Region 0: I/O ports at c150 [size=8] + Region 1: I/O ports at c158 [size=8] + Kernel driver in use: serial + 00: 48 43 53 32 01 00 00 02 10 02 00 07 00 00 00 00 + 10: 51 c1 00 00 59 c1 00 00 00 00 00 00 00 00 00 00 + 20: 00 00 00 00 00 00 00 00 00 00 00 00 48 43 53 32 + 30: 00 00 00 00 00 00 00 00 00 00 00 00 0a 01 00 00 + + In the Linux guest VM, dmesg output for the device is as follows: + + serial 0000:00:05.0: PCI INT A -> Link[LNKA] -> GSI 10 (level, high) -> IRQ 10 + 0000:00:05.0: ttyS1 at I/O 0xc150 (irq = 10) is a 16550A + 0000:00:05.0: ttyS2 at I/O 0xc158 (irq = 10) is a 16550A + + +5. In the Linux guest VM, check the serial ports:: + + # setserial -g /dev/ttyS* + /dev/ttyS0, UART: 16550A, Port: 0x03f8, IRQ: 4 + /dev/ttyS1, UART: 16550A, Port: 0xc150, IRQ: 10 + /dev/ttyS2, UART: 16550A, Port: 0xc158, IRQ: 10 6. Using minicom or any terminal emulation program, open port /dev/ttyS1 or /dev/ttyS2 with hardware flow control disabled. @@ -388,14 +396,14 @@ card. Data is loop backed from hosts mtty driver. -8. Destroy the mediated device that you created. +8. Destroy the mediated device that you created:: - # echo 1 > /sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1001/remove + # echo 1 > /sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1001/remove References ========== -[1] See Documentation/vfio.txt for more information on VFIO. -[2] struct mdev_driver in include/linux/mdev.h -[3] struct mdev_parent_ops in include/linux/mdev.h -[4] struct vfio_iommu_driver_ops in include/linux/vfio.h +1. See Documentation/vfio.txt for more information on VFIO. +2. struct mdev_driver in include/linux/mdev.h +3. struct mdev_parent_ops in include/linux/mdev.h +4. struct vfio_iommu_driver_ops in include/linux/vfio.h diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt index 1dd3fddfd3a1..ef6a5111eaa1 100644 --- a/Documentation/vfio.txt +++ b/Documentation/vfio.txt @@ -1,5 +1,7 @@ -VFIO - "Virtual Function I/O"[1] -------------------------------------------------------------------------------- +================================== +VFIO - "Virtual Function I/O" [1]_ +================================== + Many modern system now provide DMA and interrupt remapping facilities to help ensure I/O devices behave within the boundaries they've been allotted. This includes x86 hardware with AMD-Vi and Intel VT-d, @@ -7,14 +9,14 @@ POWER systems with Partitionable Endpoints (PEs) and embedded PowerPC systems such as Freescale PAMU. The VFIO driver is an IOMMU/device agnostic framework for exposing direct device access to userspace, in a secure, IOMMU protected environment. In other words, this allows -safe[2], non-privileged, userspace drivers. +safe [2]_, non-privileged, userspace drivers. Why do we want that? Virtual machines often make use of direct device access ("device assignment") when configured for the highest possible I/O performance. From a device and host perspective, this simply turns the VM into a userspace driver, with the benefits of significantly reduced latency, higher bandwidth, and direct use of -bare-metal device drivers[3]. +bare-metal device drivers [3]_. Some applications, particularly in the high performance computing field, also benefit from low-overhead, direct device access from @@ -31,7 +33,7 @@ KVM PCI specific device assignment code as well as provide a more secure, more featureful userspace driver environment than UIO. Groups, Devices, and IOMMUs -------------------------------------------------------------------------------- +--------------------------- Devices are the main target of any I/O driver. Devices typically create a programming interface made up of I/O access, interrupts, @@ -114,40 +116,40 @@ well as mechanisms for describing and registering interrupt notifications. VFIO Usage Example -------------------------------------------------------------------------------- +------------------ -Assume user wants to access PCI device 0000:06:0d.0 +Assume user wants to access PCI device 0000:06:0d.0:: -$ readlink /sys/bus/pci/devices/0000:06:0d.0/iommu_group -../../../../kernel/iommu_groups/26 + $ readlink /sys/bus/pci/devices/0000:06:0d.0/iommu_group + ../../../../kernel/iommu_groups/26 This device is therefore in IOMMU group 26. This device is on the pci bus, therefore the user will make use of vfio-pci to manage the -group: +group:: -# modprobe vfio-pci + # modprobe vfio-pci Binding this device to the vfio-pci driver creates the VFIO group -character devices for this group: +character devices for this group:: -$ lspci -n -s 0000:06:0d.0 -06:0d.0 0401: 1102:0002 (rev 08) -# echo 0000:06:0d.0 > /sys/bus/pci/devices/0000:06:0d.0/driver/unbind -# echo 1102 0002 > /sys/bus/pci/drivers/vfio-pci/new_id + $ lspci -n -s 0000:06:0d.0 + 06:0d.0 0401: 1102:0002 (rev 08) + # echo 0000:06:0d.0 > /sys/bus/pci/devices/0000:06:0d.0/driver/unbind + # echo 1102 0002 > /sys/bus/pci/drivers/vfio-pci/new_id Now we need to look at what other devices are in the group to free -it for use by VFIO: - -$ ls -l /sys/bus/pci/devices/0000:06:0d.0/iommu_group/devices -total 0 -lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:00:1e.0 -> - ../../../../devices/pci0000:00/0000:00:1e.0 -lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.0 -> - ../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.0 -lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.1 -> - ../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.1 - -This device is behind a PCIe-to-PCI bridge[4], therefore we also +it for use by VFIO:: + + $ ls -l /sys/bus/pci/devices/0000:06:0d.0/iommu_group/devices + total 0 + lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:00:1e.0 -> + ../../../../devices/pci0000:00/0000:00:1e.0 + lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.0 -> + ../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.0 + lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.1 -> + ../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.1 + +This device is behind a PCIe-to-PCI bridge [4]_, therefore we also need to add device 0000:06:0d.1 to the group following the same procedure as above. Device 0000:00:1e.0 is a bridge that does not currently have a host driver, therefore it's not required to @@ -157,12 +159,12 @@ support PCI bridges). The final step is to provide the user with access to the group if unprivileged operation is desired (note that /dev/vfio/vfio provides no capabilities on its own and is therefore expected to be set to -mode 0666 by the system). +mode 0666 by the system):: -# chown user:user /dev/vfio/26 + # chown user:user /dev/vfio/26 The user now has full access to all the devices and the iommu for this -group and can access them as follows: +group and can access them as follows:: int container, group, device, i; struct vfio_group_status group_status = @@ -248,31 +250,31 @@ VFIO bus driver API VFIO bus drivers, such as vfio-pci make use of only a few interfaces into VFIO core. When devices are bound and unbound to the driver, the driver should call vfio_add_group_dev() and vfio_del_group_dev() -respectively: +respectively:: -extern int vfio_add_group_dev(struct iommu_group *iommu_group, - struct device *dev, - const struct vfio_device_ops *ops, - void *device_data); + extern int vfio_add_group_dev(struct iommu_group *iommu_group, + struct device *dev, + const struct vfio_device_ops *ops, + void *device_data); -extern void *vfio_del_group_dev(struct device *dev); + extern void *vfio_del_group_dev(struct device *dev); vfio_add_group_dev() indicates to the core to begin tracking the specified iommu_group and register the specified dev as owned by a VFIO bus driver. The driver provides an ops structure for callbacks -similar to a file operations structure: - -struct vfio_device_ops { - int (*open)(void *device_data); - void (*release)(void *device_data); - ssize_t (*read)(void *device_data, char __user *buf, - size_t count, loff_t *ppos); - ssize_t (*write)(void *device_data, const char __user *buf, - size_t size, loff_t *ppos); - long (*ioctl)(void *device_data, unsigned int cmd, - unsigned long arg); - int (*mmap)(void *device_data, struct vm_area_struct *vma); -}; +similar to a file operations structure:: + + struct vfio_device_ops { + int (*open)(void *device_data); + void (*release)(void *device_data); + ssize_t (*read)(void *device_data, char __user *buf, + size_t count, loff_t *ppos); + ssize_t (*write)(void *device_data, const char __user *buf, + size_t size, loff_t *ppos); + long (*ioctl)(void *device_data, unsigned int cmd, + unsigned long arg); + int (*mmap)(void *device_data, struct vm_area_struct *vma); + }; Each function is passed the device_data that was originally registered in the vfio_add_group_dev() call above. This allows the bus driver @@ -285,50 +287,55 @@ own VFIO_DEVICE_GET_REGION_INFO ioctl. PPC64 sPAPR implementation note -------------------------------------------------------------------------------- +------------------------------- This implementation has some specifics: 1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per -container is supported as an IOMMU table is allocated at the boot time, -one table per a IOMMU group which is a Partitionable Endpoint (PE) -(PE is often a PCI domain but not always). -Newer systems (POWER8 with IODA2) have improved hardware design which allows -to remove this limitation and have multiple IOMMU groups per a VFIO container. + container is supported as an IOMMU table is allocated at the boot time, + one table per a IOMMU group which is a Partitionable Endpoint (PE) + (PE is often a PCI domain but not always). + + Newer systems (POWER8 with IODA2) have improved hardware design which allows + to remove this limitation and have multiple IOMMU groups per a VFIO + container. 2) The hardware supports so called DMA windows - the PCI address range -within which DMA transfer is allowed, any attempt to access address space -out of the window leads to the whole PE isolation. + within which DMA transfer is allowed, any attempt to access address space + out of the window leads to the whole PE isolation. 3) PPC64 guests are paravirtualized but not fully emulated. There is an API -to map/unmap pages for DMA, and it normally maps 1..32 pages per call and -currently there is no way to reduce the number of calls. In order to make things -faster, the map/unmap handling has been implemented in real mode which provides -an excellent performance which has limitations such as inability to do -locked pages accounting in real time. + to map/unmap pages for DMA, and it normally maps 1..32 pages per call and + currently there is no way to reduce the number of calls. In order to make + things faster, the map/unmap handling has been implemented in real mode + which provides an excellent performance which has limitations such as + inability to do locked pages accounting in real time. 4) According to sPAPR specification, A Partitionable Endpoint (PE) is an I/O -subtree that can be treated as a unit for the purposes of partitioning and -error recovery. A PE may be a single or multi-function IOA (IO Adapter), a -function of a multi-function IOA, or multiple IOAs (possibly including switch -and bridge structures above the multiple IOAs). PPC64 guests detect PCI errors -and recover from them via EEH RTAS services, which works on the basis of -additional ioctl commands. + subtree that can be treated as a unit for the purposes of partitioning and + error recovery. A PE may be a single or multi-function IOA (IO Adapter), a + function of a multi-function IOA, or multiple IOAs (possibly including + switch and bridge structures above the multiple IOAs). PPC64 guests detect + PCI errors and recover from them via EEH RTAS services, which works on the + basis of additional ioctl commands. -So 4 additional ioctls have been added: + So 4 additional ioctls have been added: - VFIO_IOMMU_SPAPR_TCE_GET_INFO - returns the size and the start - of the DMA window on the PCI bus. + VFIO_IOMMU_SPAPR_TCE_GET_INFO + returns the size and the start of the DMA window on the PCI bus. - VFIO_IOMMU_ENABLE - enables the container. The locked pages accounting + VFIO_IOMMU_ENABLE + enables the container. The locked pages accounting is done at this point. This lets user first to know what the DMA window is and adjust rlimit before doing any real job. - VFIO_IOMMU_DISABLE - disables the container. + VFIO_IOMMU_DISABLE + disables the container. - VFIO_EEH_PE_OP - provides an API for EEH setup, error detection and recovery. + VFIO_EEH_PE_OP + provides an API for EEH setup, error detection and recovery. -The code flow from the example above should be slightly changed: + The code flow from the example above should be slightly changed:: struct vfio_eeh_pe_op pe_op = { .argsz = sizeof(pe_op), .flags = 0 }; @@ -442,73 +449,73 @@ The code flow from the example above should be slightly changed: .... 5) There is v2 of SPAPR TCE IOMMU. It deprecates VFIO_IOMMU_ENABLE/ -VFIO_IOMMU_DISABLE and implements 2 new ioctls: -VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY -(which are unsupported in v1 IOMMU). + VFIO_IOMMU_DISABLE and implements 2 new ioctls: + VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY + (which are unsupported in v1 IOMMU). -PPC64 paravirtualized guests generate a lot of map/unmap requests, -and the handling of those includes pinning/unpinning pages and updating -mm::locked_vm counter to make sure we do not exceed the rlimit. -The v2 IOMMU splits accounting and pinning into separate operations: + PPC64 paravirtualized guests generate a lot of map/unmap requests, + and the handling of those includes pinning/unpinning pages and updating + mm::locked_vm counter to make sure we do not exceed the rlimit. + The v2 IOMMU splits accounting and pinning into separate operations: -- VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY ioctls -receive a user space address and size of the block to be pinned. -Bisecting is not supported and VFIO_IOMMU_UNREGISTER_MEMORY is expected to -be called with the exact address and size used for registering -the memory block. The userspace is not expected to call these often. -The ranges are stored in a linked list in a VFIO container. + - VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY ioctls + receive a user space address and size of the block to be pinned. + Bisecting is not supported and VFIO_IOMMU_UNREGISTER_MEMORY is expected to + be called with the exact address and size used for registering + the memory block. The userspace is not expected to call these often. + The ranges are stored in a linked list in a VFIO container. -- VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA ioctls only update the actual -IOMMU table and do not do pinning; instead these check that the userspace -address is from pre-registered range. + - VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA ioctls only update the actual + IOMMU table and do not do pinning; instead these check that the userspace + address is from pre-registered range. -This separation helps in optimizing DMA for guests. + This separation helps in optimizing DMA for guests. 6) sPAPR specification allows guests to have an additional DMA window(s) on -a PCI bus with a variable page size. Two ioctls have been added to support -this: VFIO_IOMMU_SPAPR_TCE_CREATE and VFIO_IOMMU_SPAPR_TCE_REMOVE. -The platform has to support the functionality or error will be returned to -the userspace. The existing hardware supports up to 2 DMA windows, one is -2GB long, uses 4K pages and called "default 32bit window"; the other can -be as big as entire RAM, use different page size, it is optional - guests -create those in run-time if the guest driver supports 64bit DMA. - -VFIO_IOMMU_SPAPR_TCE_CREATE receives a page shift, a DMA window size and -a number of TCE table levels (if a TCE table is going to be big enough and -the kernel may not be able to allocate enough of physically contiguous memory). -It creates a new window in the available slot and returns the bus address where -the new window starts. Due to hardware limitation, the user space cannot choose -the location of DMA windows. - -VFIO_IOMMU_SPAPR_TCE_REMOVE receives the bus start address of the window -and removes it. + a PCI bus with a variable page size. Two ioctls have been added to support + this: VFIO_IOMMU_SPAPR_TCE_CREATE and VFIO_IOMMU_SPAPR_TCE_REMOVE. + The platform has to support the functionality or error will be returned to + the userspace. The existing hardware supports up to 2 DMA windows, one is + 2GB long, uses 4K pages and called "default 32bit window"; the other can + be as big as entire RAM, use different page size, it is optional - guests + create those in run-time if the guest driver supports 64bit DMA. + + VFIO_IOMMU_SPAPR_TCE_CREATE receives a page shift, a DMA window size and + a number of TCE table levels (if a TCE table is going to be big enough and + the kernel may not be able to allocate enough of physically contiguous + memory). It creates a new window in the available slot and returns the bus + address where the new window starts. Due to hardware limitation, the user + space cannot choose the location of DMA windows. + + VFIO_IOMMU_SPAPR_TCE_REMOVE receives the bus start address of the window + and removes it. ------------------------------------------------------------------------------- -[1] VFIO was originally an acronym for "Virtual Function I/O" in its -initial implementation by Tom Lyon while as Cisco. We've since -outgrown the acronym, but it's catchy. - -[2] "safe" also depends upon a device being "well behaved". It's -possible for multi-function devices to have backdoors between -functions and even for single function devices to have alternative -access to things like PCI config space through MMIO registers. To -guard against the former we can include additional precautions in the -IOMMU driver to group multi-function PCI devices together -(iommu=group_mf). The latter we can't prevent, but the IOMMU should -still provide isolation. For PCI, SR-IOV Virtual Functions are the -best indicator of "well behaved", as these are designed for -virtualization usage models. - -[3] As always there are trade-offs to virtual machine device -assignment that are beyond the scope of VFIO. It's expected that -future IOMMU technologies will reduce some, but maybe not all, of -these trade-offs. - -[4] In this case the device is below a PCI bridge, so transactions -from either function of the device are indistinguishable to the iommu: - --[0000:00]-+-1e.0-[06]--+-0d.0 - \-0d.1 - -00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90) +.. [1] VFIO was originally an acronym for "Virtual Function I/O" in its + initial implementation by Tom Lyon while as Cisco. We've since + outgrown the acronym, but it's catchy. + +.. [2] "safe" also depends upon a device being "well behaved". It's + possible for multi-function devices to have backdoors between + functions and even for single function devices to have alternative + access to things like PCI config space through MMIO registers. To + guard against the former we can include additional precautions in the + IOMMU driver to group multi-function PCI devices together + (iommu=group_mf). The latter we can't prevent, but the IOMMU should + still provide isolation. For PCI, SR-IOV Virtual Functions are the + best indicator of "well behaved", as these are designed for + virtualization usage models. + +.. [3] As always there are trade-offs to virtual machine device + assignment that are beyond the scope of VFIO. It's expected that + future IOMMU technologies will reduce some, but maybe not all, of + these trade-offs. + +.. [4] In this case the device is below a PCI bridge, so transactions + from either function of the device are indistinguishable to the iommu:: + + -[0000:00]-+-1e.0-[06]--+-0d.0 + \-0d.1 + + 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 4029943887a3..e63a35fafef0 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -3255,6 +3255,141 @@ Otherwise, if the MCE is a corrected error, KVM will just store it in the corresponding bank (provided this bank is not holding a previously reported uncorrected error). +4.107 KVM_S390_GET_CMMA_BITS + +Capability: KVM_CAP_S390_CMMA_MIGRATION +Architectures: s390 +Type: vm ioctl +Parameters: struct kvm_s390_cmma_log (in, out) +Returns: 0 on success, a negative value on error + +This ioctl is used to get the values of the CMMA bits on the s390 +architecture. It is meant to be used in two scenarios: +- During live migration to save the CMMA values. Live migration needs + to be enabled via the KVM_REQ_START_MIGRATION VM property. +- To non-destructively peek at the CMMA values, with the flag + KVM_S390_CMMA_PEEK set. + +The ioctl takes parameters via the kvm_s390_cmma_log struct. The desired +values are written to a buffer whose location is indicated via the "values" +member in the kvm_s390_cmma_log struct. The values in the input struct are +also updated as needed. +Each CMMA value takes up one byte. + +struct kvm_s390_cmma_log { + __u64 start_gfn; + __u32 count; + __u32 flags; + union { + __u64 remaining; + __u64 mask; + }; + __u64 values; +}; + +start_gfn is the number of the first guest frame whose CMMA values are +to be retrieved, + +count is the length of the buffer in bytes, + +values points to the buffer where the result will be written to. + +If count is greater than KVM_S390_SKEYS_MAX, then it is considered to be +KVM_S390_SKEYS_MAX. KVM_S390_SKEYS_MAX is re-used for consistency with +other ioctls. + +The result is written in the buffer pointed to by the field values, and +the values of the input parameter are updated as follows. + +Depending on the flags, different actions are performed. The only +supported flag so far is KVM_S390_CMMA_PEEK. + +The default behaviour if KVM_S390_CMMA_PEEK is not set is: +start_gfn will indicate the first page frame whose CMMA bits were dirty. +It is not necessarily the same as the one passed as input, as clean pages +are skipped. + +count will indicate the number of bytes actually written in the buffer. +It can (and very often will) be smaller than the input value, since the +buffer is only filled until 16 bytes of clean values are found (which +are then not copied in the buffer). Since a CMMA migration block needs +the base address and the length, for a total of 16 bytes, we will send +back some clean data if there is some dirty data afterwards, as long as +the size of the clean data does not exceed the size of the header. This +allows to minimize the amount of data to be saved or transferred over +the network at the expense of more roundtrips to userspace. The next +invocation of the ioctl will skip over all the clean values, saving +potentially more than just the 16 bytes we found. + +If KVM_S390_CMMA_PEEK is set: +the existing storage attributes are read even when not in migration +mode, and no other action is performed; + +the output start_gfn will be equal to the input start_gfn, + +the output count will be equal to the input count, except if the end of +memory has been reached. + +In both cases: +the field "remaining" will indicate the total number of dirty CMMA values +still remaining, or 0 if KVM_S390_CMMA_PEEK is set and migration mode is +not enabled. + +mask is unused. + +values points to the userspace buffer where the result will be stored. + +This ioctl can fail with -ENOMEM if not enough memory can be allocated to +complete the task, with -ENXIO if CMMA is not enabled, with -EINVAL if +KVM_S390_CMMA_PEEK is not set but migration mode was not enabled, with +-EFAULT if the userspace address is invalid or if no page table is +present for the addresses (e.g. when using hugepages). + +4.108 KVM_S390_SET_CMMA_BITS + +Capability: KVM_CAP_S390_CMMA_MIGRATION +Architectures: s390 +Type: vm ioctl +Parameters: struct kvm_s390_cmma_log (in) +Returns: 0 on success, a negative value on error + +This ioctl is used to set the values of the CMMA bits on the s390 +architecture. It is meant to be used during live migration to restore +the CMMA values, but there are no restrictions on its use. +The ioctl takes parameters via the kvm_s390_cmma_values struct. +Each CMMA value takes up one byte. + +struct kvm_s390_cmma_log { + __u64 start_gfn; + __u32 count; + __u32 flags; + union { + __u64 remaining; + __u64 mask; + }; + __u64 values; +}; + +start_gfn indicates the starting guest frame number, + +count indicates how many values are to be considered in the buffer, + +flags is not used and must be 0. + +mask indicates which PGSTE bits are to be considered. + +remaining is not used. + +values points to the buffer in userspace where to store the values. + +This ioctl can fail with -ENOMEM if not enough memory can be allocated to +complete the task, with -ENXIO if CMMA is not enabled, with -EINVAL if +the count field is too large (e.g. more than KVM_S390_CMMA_SIZE_MAX) or +if the flags field was not 0, with -EFAULT if the userspace address is +invalid, if invalid pages are written to (e.g. after the end of memory) +or if no page table is present for the addresses (e.g. when using +hugepages). + 5. The kvm_run structure ------------------------ @@ -3996,6 +4131,34 @@ Parameters: none Allow use of adapter-interruption suppression. Returns: 0 on success; -EBUSY if a VCPU has already been created. +7.11 KVM_CAP_PPC_SMT + +Architectures: ppc +Parameters: vsmt_mode, flags + +Enabling this capability on a VM provides userspace with a way to set +the desired virtual SMT mode (i.e. the number of virtual CPUs per +virtual core). The virtual SMT mode, vsmt_mode, must be a power of 2 +between 1 and 8. On POWER8, vsmt_mode must also be no greater than +the number of threads per subcore for the host. Currently flags must +be 0. A successful call to enable this capability will result in +vsmt_mode being returned when the KVM_CAP_PPC_SMT capability is +subsequently queried for the VM. This capability is only supported by +HV KVM, and can only be set before any VCPUs have been created. +The KVM_CAP_PPC_SMT_POSSIBLE capability indicates which virtual SMT +modes are available. + +7.12 KVM_CAP_PPC_FWNMI + +Architectures: ppc +Parameters: none + +With this capability a machine check exception in the guest address +space will cause KVM to exit the guest with NMI exit reason. This +enables QEMU to build error log and branch to guest kernel registered +machine check handling routine. Without this capability KVM will +branch to guests' 0x200 interrupt vector. + 8. Other capabilities. ---------------------- @@ -4157,3 +4320,30 @@ Currently the following bits are defined for the device_irq_level bitmap: Future versions of kvm may implement additional events. These will get indicated by returning a higher number from KVM_CHECK_EXTENSION and will be listed above. + +8.10 KVM_CAP_PPC_SMT_POSSIBLE + +Architectures: ppc + +Querying this capability returns a bitmap indicating the possible +virtual SMT modes that can be set using KVM_CAP_PPC_SMT. If bit N +(counting from the right) is set, then a virtual SMT mode of 2^N is +available. + +8.11 KVM_CAP_HYPERV_SYNIC2 + +Architectures: x86 + +This capability enables a newer version of Hyper-V Synthetic interrupt +controller (SynIC). The only difference with KVM_CAP_HYPERV_SYNIC is that KVM +doesn't clear SynIC message and event flags pages when they are enabled by +writing to the respective MSRs. + +8.12 KVM_CAP_HYPERV_VP_INDEX + +Architectures: x86 + +This capability indicates that userspace can load HV_X64_MSR_VP_INDEX msr. Its +value is used to denote the target vcpu for a SynIC interrupt. For +compatibilty, KVM initializes this msr to KVM's internal vcpu index. When this +capability is absent, userspace can still query this msr's value. diff --git a/Documentation/virtual/kvm/devices/s390_flic.txt b/Documentation/virtual/kvm/devices/s390_flic.txt index c2518cea8ab4..2f1cbf1301d2 100644 --- a/Documentation/virtual/kvm/devices/s390_flic.txt +++ b/Documentation/virtual/kvm/devices/s390_flic.txt @@ -16,6 +16,7 @@ FLIC provides support to - register and modify adapter interrupt sources (KVM_DEV_FLIC_ADAPTER_*) - modify AIS (adapter-interruption-suppression) mode state (KVM_DEV_FLIC_AISM) - inject adapter interrupts on a specified adapter (KVM_DEV_FLIC_AIRQ_INJECT) +- get/set all AIS mode states (KVM_DEV_FLIC_AISM_ALL) Groups: KVM_DEV_FLIC_ENQUEUE @@ -136,6 +137,20 @@ struct kvm_s390_ais_req { an isc according to the adapter-interruption-suppression mode on condition that the AIS capability is enabled. + KVM_DEV_FLIC_AISM_ALL + Gets or sets the adapter-interruption-suppression mode for all ISCs. Takes + a kvm_s390_ais_all describing: + +struct kvm_s390_ais_all { + __u8 simm; /* Single-Interruption-Mode mask */ + __u8 nimm; /* No-Interruption-Mode mask * +}; + + simm contains Single-Interruption-Mode mask for all ISCs, nimm contains + No-Interruption-Mode mask for all ISCs. Each bit in simm and nimm corresponds + to an ISC (MSB0 bit 0 to ISC 0 and so on). The combination of simm bit and + nimm bit presents AIS mode for a ISC. + Note: The KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR device ioctls executed on FLIC with an unknown group or attribute gives the error code EINVAL (instead of ENXIO, as specified in the API documentation). It is not possible to conclude diff --git a/Documentation/virtual/kvm/devices/vcpu.txt b/Documentation/virtual/kvm/devices/vcpu.txt index 02f50686c418..2b5dab16c4f2 100644 --- a/Documentation/virtual/kvm/devices/vcpu.txt +++ b/Documentation/virtual/kvm/devices/vcpu.txt @@ -16,7 +16,9 @@ Parameters: in kvm_device_attr.addr the address for PMU overflow interrupt is a Returns: -EBUSY: The PMU overflow interrupt is already set -ENXIO: The overflow interrupt not set when attempting to get it -ENODEV: PMUv3 not supported - -EINVAL: Invalid PMU overflow interrupt number supplied + -EINVAL: Invalid PMU overflow interrupt number supplied or + trying to set the IRQ number without using an in-kernel + irqchip. A value describing the PMUv3 (Performance Monitor Unit v3) overflow interrupt number for this vcpu. This interrupt could be a PPI or SPI, but the interrupt @@ -25,11 +27,36 @@ all vcpus, while as an SPI it must be a separate number per vcpu. 1.2 ATTRIBUTE: KVM_ARM_VCPU_PMU_V3_INIT Parameters: no additional parameter in kvm_device_attr.addr -Returns: -ENODEV: PMUv3 not supported - -ENXIO: PMUv3 not properly configured as required prior to calling this - attribute +Returns: -ENODEV: PMUv3 not supported or GIC not initialized + -ENXIO: PMUv3 not properly configured or in-kernel irqchip not + configured as required prior to calling this attribute -EBUSY: PMUv3 already initialized -Request the initialization of the PMUv3. This must be done after creating the -in-kernel irqchip. Creating a PMU with a userspace irqchip is currently not -supported. +Request the initialization of the PMUv3. If using the PMUv3 with an in-kernel +virtual GIC implementation, this must be done after initializing the in-kernel +irqchip. + + +2. GROUP: KVM_ARM_VCPU_TIMER_CTRL +Architectures: ARM,ARM64 + +2.1. ATTRIBUTE: KVM_ARM_VCPU_TIMER_IRQ_VTIMER +2.2. ATTRIBUTE: KVM_ARM_VCPU_TIMER_IRQ_PTIMER +Parameters: in kvm_device_attr.addr the address for the timer interrupt is a + pointer to an int +Returns: -EINVAL: Invalid timer interrupt number + -EBUSY: One or more VCPUs has already run + +A value describing the architected timer interrupt number when connected to an +in-kernel virtual GIC. These must be a PPI (16 <= intid < 32). Setting the +attribute overrides the default values (see below). + +KVM_ARM_VCPU_TIMER_IRQ_VTIMER: The EL1 virtual timer intid (default: 27) +KVM_ARM_VCPU_TIMER_IRQ_PTIMER: The EL1 physical timer intid (default: 30) + +Setting the same PPI for different timers will prevent the VCPUs from running. +Setting the interrupt number on a VCPU configures all VCPUs created at that +time to use the number provided for a given timer, overwriting any previously +configured values on other VCPUs. Userspace should configure the interrupt +numbers on at least one VCPU after creating all VCPUs and before running any +VCPUs. diff --git a/Documentation/virtual/kvm/devices/vm.txt b/Documentation/virtual/kvm/devices/vm.txt index 575ccb022aac..903fc926860b 100644 --- a/Documentation/virtual/kvm/devices/vm.txt +++ b/Documentation/virtual/kvm/devices/vm.txt @@ -222,3 +222,36 @@ Allows user space to disable dea key wrapping, clearing the wrapping key. Parameters: none Returns: 0 + +5. GROUP: KVM_S390_VM_MIGRATION +Architectures: s390 + +5.1. ATTRIBUTE: KVM_S390_VM_MIGRATION_STOP (w/o) + +Allows userspace to stop migration mode, needed for PGSTE migration. +Setting this attribute when migration mode is not active will have no +effects. + +Parameters: none +Returns: 0 + +5.2. ATTRIBUTE: KVM_S390_VM_MIGRATION_START (w/o) + +Allows userspace to start migration mode, needed for PGSTE migration. +Setting this attribute when migration mode is already active will have +no effects. + +Parameters: none +Returns: -ENOMEM if there is not enough free memory to start migration mode + -EINVAL if the state of the VM is invalid (e.g. no memory defined) + 0 in case of success. + +5.3. ATTRIBUTE: KVM_S390_VM_MIGRATION_STATUS (r/o) + +Allows userspace to query the status of migration mode. + +Parameters: address of a buffer in user space to store the data (u64) to; + the data itself is either 0 if migration mode is disabled or 1 + if it is enabled +Returns: -EFAULT if the given address is not accessible from kernel space + 0 in case of success. diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt index 481b6a9c25d5..f50d45b1e967 100644 --- a/Documentation/virtual/kvm/mmu.txt +++ b/Documentation/virtual/kvm/mmu.txt @@ -179,6 +179,10 @@ Shadow pages contain the following information: shadow page; it is also used to go back from a struct kvm_mmu_page to a memslot, through the kvm_memslots_for_spte_role macro and __gfn_to_memslot. + role.ad_disabled: + Is 1 if the MMU instance cannot use A/D bits. EPT did not have A/D + bits before Haswell; shadow EPT page tables also cannot use A/D bits + if the L1 hypervisor does not enable them. gfn: Either the guest page table containing the translations shadowed by this page, or the base page frame for linear translations. See role.direct. diff --git a/Documentation/virtual/kvm/msr.txt b/Documentation/virtual/kvm/msr.txt index 0a9ea515512a..1ebecc115dc6 100644 --- a/Documentation/virtual/kvm/msr.txt +++ b/Documentation/virtual/kvm/msr.txt @@ -166,10 +166,11 @@ MSR_KVM_SYSTEM_TIME: 0x12 MSR_KVM_ASYNC_PF_EN: 0x4b564d02 data: Bits 63-6 hold 64-byte aligned physical address of a 64 byte memory area which must be in guest RAM and must be - zeroed. Bits 5-2 are reserved and should be zero. Bit 0 is 1 + zeroed. Bits 5-3 are reserved and should be zero. Bit 0 is 1 when asynchronous page faults are enabled on the vcpu 0 when disabled. Bit 1 is 1 if asynchronous page faults can be injected - when vcpu is in cpl == 0. + when vcpu is in cpl == 0. Bit 2 is 1 if asynchronous page faults + are delivered to L1 as #PF vmexits. First 4 byte of 64 byte memory location will be written to by the hypervisor at the time of asynchronous page fault (APF) diff --git a/Documentation/virtual/kvm/vcpu-requests.rst b/Documentation/virtual/kvm/vcpu-requests.rst new file mode 100644 index 000000000000..5feb3706a7ae --- /dev/null +++ b/Documentation/virtual/kvm/vcpu-requests.rst @@ -0,0 +1,307 @@ +================= +KVM VCPU Requests +================= + +Overview +======== + +KVM supports an internal API enabling threads to request a VCPU thread to +perform some activity. For example, a thread may request a VCPU to flush +its TLB with a VCPU request. The API consists of the following functions:: + + /* Check if any requests are pending for VCPU @vcpu. */ + bool kvm_request_pending(struct kvm_vcpu *vcpu); + + /* Check if VCPU @vcpu has request @req pending. */ + bool kvm_test_request(int req, struct kvm_vcpu *vcpu); + + /* Clear request @req for VCPU @vcpu. */ + void kvm_clear_request(int req, struct kvm_vcpu *vcpu); + + /* + * Check if VCPU @vcpu has request @req pending. When the request is + * pending it will be cleared and a memory barrier, which pairs with + * another in kvm_make_request(), will be issued. + */ + bool kvm_check_request(int req, struct kvm_vcpu *vcpu); + + /* + * Make request @req of VCPU @vcpu. Issues a memory barrier, which pairs + * with another in kvm_check_request(), prior to setting the request. + */ + void kvm_make_request(int req, struct kvm_vcpu *vcpu); + + /* Make request @req of all VCPUs of the VM with struct kvm @kvm. */ + bool kvm_make_all_cpus_request(struct kvm *kvm, unsigned int req); + +Typically a requester wants the VCPU to perform the activity as soon +as possible after making the request. This means most requests +(kvm_make_request() calls) are followed by a call to kvm_vcpu_kick(), +and kvm_make_all_cpus_request() has the kicking of all VCPUs built +into it. + +VCPU Kicks +---------- + +The goal of a VCPU kick is to bring a VCPU thread out of guest mode in +order to perform some KVM maintenance. To do so, an IPI is sent, forcing +a guest mode exit. However, a VCPU thread may not be in guest mode at the +time of the kick. Therefore, depending on the mode and state of the VCPU +thread, there are two other actions a kick may take. All three actions +are listed below: + +1) Send an IPI. This forces a guest mode exit. +2) Waking a sleeping VCPU. Sleeping VCPUs are VCPU threads outside guest + mode that wait on waitqueues. Waking them removes the threads from + the waitqueues, allowing the threads to run again. This behavior + may be suppressed, see KVM_REQUEST_NO_WAKEUP below. +3) Nothing. When the VCPU is not in guest mode and the VCPU thread is not + sleeping, then there is nothing to do. + +VCPU Mode +--------- + +VCPUs have a mode state, ``vcpu->mode``, that is used to track whether the +guest is running in guest mode or not, as well as some specific +outside guest mode states. The architecture may use ``vcpu->mode`` to +ensure VCPU requests are seen by VCPUs (see "Ensuring Requests Are Seen"), +as well as to avoid sending unnecessary IPIs (see "IPI Reduction"), and +even to ensure IPI acknowledgements are waited upon (see "Waiting for +Acknowledgements"). The following modes are defined: + +OUTSIDE_GUEST_MODE + + The VCPU thread is outside guest mode. + +IN_GUEST_MODE + + The VCPU thread is in guest mode. + +EXITING_GUEST_MODE + + The VCPU thread is transitioning from IN_GUEST_MODE to + OUTSIDE_GUEST_MODE. + +READING_SHADOW_PAGE_TABLES + + The VCPU thread is outside guest mode, but it wants the sender of + certain VCPU requests, namely KVM_REQ_TLB_FLUSH, to wait until the VCPU + thread is done reading the page tables. + +VCPU Request Internals +====================== + +VCPU requests are simply bit indices of the ``vcpu->requests`` bitmap. +This means general bitops, like those documented in [atomic-ops]_ could +also be used, e.g. :: + + clear_bit(KVM_REQ_UNHALT & KVM_REQUEST_MASK, &vcpu->requests); + +However, VCPU request users should refrain from doing so, as it would +break the abstraction. The first 8 bits are reserved for architecture +independent requests, all additional bits are available for architecture +dependent requests. + +Architecture Independent Requests +--------------------------------- + +KVM_REQ_TLB_FLUSH + + KVM's common MMU notifier may need to flush all of a guest's TLB + entries, calling kvm_flush_remote_tlbs() to do so. Architectures that + choose to use the common kvm_flush_remote_tlbs() implementation will + need to handle this VCPU request. + +KVM_REQ_MMU_RELOAD + + When shadow page tables are used and memory slots are removed it's + necessary to inform each VCPU to completely refresh the tables. This + request is used for that. + +KVM_REQ_PENDING_TIMER + + This request may be made from a timer handler run on the host on behalf + of a VCPU. It informs the VCPU thread to inject a timer interrupt. + +KVM_REQ_UNHALT + + This request may be made from the KVM common function kvm_vcpu_block(), + which is used to emulate an instruction that causes a CPU to halt until + one of an architectural specific set of events and/or interrupts is + received (determined by checking kvm_arch_vcpu_runnable()). When that + event or interrupt arrives kvm_vcpu_block() makes the request. This is + in contrast to when kvm_vcpu_block() returns due to any other reason, + such as a pending signal, which does not indicate the VCPU's halt + emulation should stop, and therefore does not make the request. + +KVM_REQUEST_MASK +---------------- + +VCPU requests should be masked by KVM_REQUEST_MASK before using them with +bitops. This is because only the lower 8 bits are used to represent the +request's number. The upper bits are used as flags. Currently only two +flags are defined. + +VCPU Request Flags +------------------ + +KVM_REQUEST_NO_WAKEUP + + This flag is applied to requests that only need immediate attention + from VCPUs running in guest mode. That is, sleeping VCPUs do not need + to be awaken for these requests. Sleeping VCPUs will handle the + requests when they are awaken later for some other reason. + +KVM_REQUEST_WAIT + + When requests with this flag are made with kvm_make_all_cpus_request(), + then the caller will wait for each VCPU to acknowledge its IPI before + proceeding. This flag only applies to VCPUs that would receive IPIs. + If, for example, the VCPU is sleeping, so no IPI is necessary, then + the requesting thread does not wait. This means that this flag may be + safely combined with KVM_REQUEST_NO_WAKEUP. See "Waiting for + Acknowledgements" for more information about requests with + KVM_REQUEST_WAIT. + +VCPU Requests with Associated State +=================================== + +Requesters that want the receiving VCPU to handle new state need to ensure +the newly written state is observable to the receiving VCPU thread's CPU +by the time it observes the request. This means a write memory barrier +must be inserted after writing the new state and before setting the VCPU +request bit. Additionally, on the receiving VCPU thread's side, a +corresponding read barrier must be inserted after reading the request bit +and before proceeding to read the new state associated with it. See +scenario 3, Message and Flag, of [lwn-mb]_ and the kernel documentation +[memory-barriers]_. + +The pair of functions, kvm_check_request() and kvm_make_request(), provide +the memory barriers, allowing this requirement to be handled internally by +the API. + +Ensuring Requests Are Seen +========================== + +When making requests to VCPUs, we want to avoid the receiving VCPU +executing in guest mode for an arbitrary long time without handling the +request. We can be sure this won't happen as long as we ensure the VCPU +thread checks kvm_request_pending() before entering guest mode and that a +kick will send an IPI to force an exit from guest mode when necessary. +Extra care must be taken to cover the period after the VCPU thread's last +kvm_request_pending() check and before it has entered guest mode, as kick +IPIs will only trigger guest mode exits for VCPU threads that are in guest +mode or at least have already disabled interrupts in order to prepare to +enter guest mode. This means that an optimized implementation (see "IPI +Reduction") must be certain when it's safe to not send the IPI. One +solution, which all architectures except s390 apply, is to: + +- set ``vcpu->mode`` to IN_GUEST_MODE between disabling the interrupts and + the last kvm_request_pending() check; +- enable interrupts atomically when entering the guest. + +This solution also requires memory barriers to be placed carefully in both +the requesting thread and the receiving VCPU. With the memory barriers we +can exclude the possibility of a VCPU thread observing +!kvm_request_pending() on its last check and then not receiving an IPI for +the next request made of it, even if the request is made immediately after +the check. This is done by way of the Dekker memory barrier pattern +(scenario 10 of [lwn-mb]_). As the Dekker pattern requires two variables, +this solution pairs ``vcpu->mode`` with ``vcpu->requests``. Substituting +them into the pattern gives:: + + CPU1 CPU2 + ================= ================= + local_irq_disable(); + WRITE_ONCE(vcpu->mode, IN_GUEST_MODE); kvm_make_request(REQ, vcpu); + smp_mb(); smp_mb(); + if (kvm_request_pending(vcpu)) { if (READ_ONCE(vcpu->mode) == + IN_GUEST_MODE) { + ...abort guest entry... ...send IPI... + } } + +As stated above, the IPI is only useful for VCPU threads in guest mode or +that have already disabled interrupts. This is why this specific case of +the Dekker pattern has been extended to disable interrupts before setting +``vcpu->mode`` to IN_GUEST_MODE. WRITE_ONCE() and READ_ONCE() are used to +pedantically implement the memory barrier pattern, guaranteeing the +compiler doesn't interfere with ``vcpu->mode``'s carefully planned +accesses. + +IPI Reduction +------------- + +As only one IPI is needed to get a VCPU to check for any/all requests, +then they may be coalesced. This is easily done by having the first IPI +sending kick also change the VCPU mode to something !IN_GUEST_MODE. The +transitional state, EXITING_GUEST_MODE, is used for this purpose. + +Waiting for Acknowledgements +---------------------------- + +Some requests, those with the KVM_REQUEST_WAIT flag set, require IPIs to +be sent, and the acknowledgements to be waited upon, even when the target +VCPU threads are in modes other than IN_GUEST_MODE. For example, one case +is when a target VCPU thread is in READING_SHADOW_PAGE_TABLES mode, which +is set after disabling interrupts. To support these cases, the +KVM_REQUEST_WAIT flag changes the condition for sending an IPI from +checking that the VCPU is IN_GUEST_MODE to checking that it is not +OUTSIDE_GUEST_MODE. + +Request-less VCPU Kicks +----------------------- + +As the determination of whether or not to send an IPI depends on the +two-variable Dekker memory barrier pattern, then it's clear that +request-less VCPU kicks are almost never correct. Without the assurance +that a non-IPI generating kick will still result in an action by the +receiving VCPU, as the final kvm_request_pending() check does for +request-accompanying kicks, then the kick may not do anything useful at +all. If, for instance, a request-less kick was made to a VCPU that was +just about to set its mode to IN_GUEST_MODE, meaning no IPI is sent, then +the VCPU thread may continue its entry without actually having done +whatever it was the kick was meant to initiate. + +One exception is x86's posted interrupt mechanism. In this case, however, +even the request-less VCPU kick is coupled with the same +local_irq_disable() + smp_mb() pattern described above; the ON bit +(Outstanding Notification) in the posted interrupt descriptor takes the +role of ``vcpu->requests``. When sending a posted interrupt, PIR.ON is +set before reading ``vcpu->mode``; dually, in the VCPU thread, +vmx_sync_pir_to_irr() reads PIR after setting ``vcpu->mode`` to +IN_GUEST_MODE. + +Additional Considerations +========================= + +Sleeping VCPUs +-------------- + +VCPU threads may need to consider requests before and/or after calling +functions that may put them to sleep, e.g. kvm_vcpu_block(). Whether they +do or not, and, if they do, which requests need consideration, is +architecture dependent. kvm_vcpu_block() calls kvm_arch_vcpu_runnable() +to check if it should awaken. One reason to do so is to provide +architectures a function where requests may be checked if necessary. + +Clearing Requests +----------------- + +Generally it only makes sense for the receiving VCPU thread to clear a +request. However, in some circumstances, such as when the requesting +thread and the receiving VCPU thread are executed serially, such as when +they are the same thread, or when they are using some form of concurrency +control to temporarily execute synchronously, then it's possible to know +that the request may be cleared immediately, rather than waiting for the +receiving VCPU thread to handle the request in VCPU RUN. The only current +examples of this are kvm_vcpu_block() calls made by VCPUs to block +themselves. A possible side-effect of that call is to make the +KVM_REQ_UNHALT request, which may then be cleared immediately when the +VCPU returns from the call. + +References +========== + +.. [atomic-ops] Documentation/core-api/atomic_ops.rst +.. [memory-barriers] Documentation/memory-barriers.txt +.. [lwn-mb] https://lwn.net/Articles/573436/ diff --git a/Documentation/vm/ksm.txt b/Documentation/vm/ksm.txt index 6b0ca7feb135..6686bd267dc9 100644 --- a/Documentation/vm/ksm.txt +++ b/Documentation/vm/ksm.txt @@ -98,6 +98,50 @@ use_zero_pages - specifies whether empty pages (i.e. allocated pages it is only effective for pages merged after the change. Default: 0 (normal KSM behaviour as in earlier releases) +max_page_sharing - Maximum sharing allowed for each KSM page. This + enforces a deduplication limit to avoid the virtual + memory rmap lists to grow too large. The minimum + value is 2 as a newly created KSM page will have at + least two sharers. The rmap walk has O(N) + complexity where N is the number of rmap_items + (i.e. virtual mappings) that are sharing the page, + which is in turn capped by max_page_sharing. So + this effectively spread the the linear O(N) + computational complexity from rmap walk context + over different KSM pages. The ksmd walk over the + stable_node "chains" is also O(N), but N is the + number of stable_node "dups", not the number of + rmap_items, so it has not a significant impact on + ksmd performance. In practice the best stable_node + "dup" candidate will be kept and found at the head + of the "dups" list. The higher this value the + faster KSM will merge the memory (because there + will be fewer stable_node dups queued into the + stable_node chain->hlist to check for pruning) and + the higher the deduplication factor will be, but + the slowest the worst case rmap walk could be for + any given KSM page. Slowing down the rmap_walk + means there will be higher latency for certain + virtual memory operations happening during + swapping, compaction, NUMA balancing and page + migration, in turn decreasing responsiveness for + the caller of those virtual memory operations. The + scheduler latency of other tasks not involved with + the VM operations doing the rmap walk is not + affected by this parameter as the rmap walks are + always schedule friendly themselves. + +stable_node_chains_prune_millisecs - How frequently to walk the whole + list of stable_node "dups" linked in the + stable_node "chains" in order to prune stale + stable_nodes. Smaller milllisecs values will free + up the KSM metadata with lower latency, but they + will make ksmd use more CPU during the scan. This + only applies to the stable_node chains so it's a + noop if not a single KSM page hit the + max_page_sharing yet (there would be no stable_node + chains in such case). + The effectiveness of KSM and MADV_MERGEABLE is shown in /sys/kernel/mm/ksm/: pages_shared - how many shared pages are being used @@ -106,10 +150,29 @@ pages_unshared - how many pages unique but repeatedly checked for merging pages_volatile - how many pages changing too fast to be placed in a tree full_scans - how many times all mergeable areas have been scanned +stable_node_chains - number of stable node chains allocated, this is + effectively the number of KSM pages that hit the + max_page_sharing limit +stable_node_dups - number of stable node dups queued into the + stable_node chains + A high ratio of pages_sharing to pages_shared indicates good sharing, but a high ratio of pages_unshared to pages_sharing indicates wasted effort. pages_volatile embraces several different kinds of activity, but a high proportion there would also indicate poor use of madvise MADV_MERGEABLE. +The maximum possible page_sharing/page_shared ratio is limited by the +max_page_sharing tunable. To increase the ratio max_page_sharing must +be increased accordingly. + +The stable_node_dups/stable_node_chains ratio is also affected by the +max_page_sharing tunable, and an high ratio may indicate fragmentation +in the stable_node dups, which could be solved by introducing +fragmentation algorithms in ksmd which would refile rmap_items from +one stable_node dup to another stable_node dup, in order to freeup +stable_node "dups" with few rmap_items in them, but that may increase +the ksmd CPU usage and possibly slowdown the readonly computations on +the KSM pages of the applications. + Izik Eidus, Hugh Dickins, 17 Nov 2009 diff --git a/Documentation/watchdog/watchdog-parameters.txt b/Documentation/watchdog/watchdog-parameters.txt index 914518aeb972..b3526365ea8e 100644 --- a/Documentation/watchdog/watchdog-parameters.txt +++ b/Documentation/watchdog/watchdog-parameters.txt @@ -369,6 +369,12 @@ timeout: Watchdog timeout in seconds. (0<timeout<N, default=60) nowayout: Watchdog cannot be stopped once started (default=kernel config parameter) ------------------------------------------------- +uniphier_wdt: +timeout: Watchdog timeout in power of two seconds. + (1 <= timeout <= 128, default=64) +nowayout: Watchdog cannot be stopped once started + (default=kernel config parameter) +------------------------------------------------- w83627hf_wdt: wdt_io: w83627hf/thf WDT io port (default 0x2E) timeout: Watchdog timeout in seconds. 1 <= timeout <= 255, default=60. diff --git a/Documentation/x86/x86_64/boot-options.txt b/Documentation/x86/x86_64/boot-options.txt index 61b611e9eeaf..b297c48389b9 100644 --- a/Documentation/x86/x86_64/boot-options.txt +++ b/Documentation/x86/x86_64/boot-options.txt @@ -36,7 +36,8 @@ Machine check to broadcast MCEs. mce=bootlog Enable logging of machine checks left over from booting. - Disabled by default on AMD because some BIOS leave bogus ones. + Disabled by default on AMD Fam10h and older because some BIOS + leave bogus ones. If your BIOS doesn't do that it's a good idea to enable though to make sure you log even machine check events that result in a reboot. On Intel systems it is enabled by default. diff --git a/Documentation/xillybus.txt b/Documentation/xillybus.txt index 1660145b9969..2446ee303c09 100644 --- a/Documentation/xillybus.txt +++ b/Documentation/xillybus.txt @@ -1,12 +1,11 @@ +========================================== +Xillybus driver for generic FPGA interface +========================================== - ========================================== - Xillybus driver for generic FPGA interface - ========================================== +:Author: Eli Billauer, Xillybus Ltd. (http://xillybus.com) +:Email: eli.billauer@gmail.com or as advertised on Xillybus' site. -Author: Eli Billauer, Xillybus Ltd. (http://xillybus.com) -Email: eli.billauer@gmail.com or as advertised on Xillybus' site. - -Contents: +.. Contents: - Introduction -- Background @@ -17,7 +16,7 @@ Contents: -- Synchronization -- Seekable pipes -- Internals + - Internals -- Source code organization -- Pipe attributes -- Host never reads from the FPGA @@ -29,7 +28,7 @@ Contents: -- The "nonempty" message (supporting poll) -INTRODUCTION +Introduction ============ Background @@ -105,7 +104,7 @@ driver is used to work out of the box with any Xillybus IP core. The data structure just mentioned should not be confused with PCI's configuration space or the Flattened Device Tree. -USAGE +Usage ===== User interface @@ -117,11 +116,11 @@ names of these files depend on the IP core that is loaded in the FPGA (see Probing below). To communicate with the FPGA, open the device file that corresponds to the hardware FIFO you want to send data or receive data from, and use plain write() or read() calls, just like with a regular pipe. In -particular, it makes perfect sense to go: +particular, it makes perfect sense to go:: -$ cat mydata > /dev/xillybus_thisfifo + $ cat mydata > /dev/xillybus_thisfifo -$ cat /dev/xillybus_thatfifo > hisdata + $ cat /dev/xillybus_thatfifo > hisdata possibly pressing CTRL-C as some stage, even though the xillybus_* pipes have the capability to send an EOF (but may not use it). @@ -178,7 +177,7 @@ the attached memory is done by seeking to the desired address, and calling read() or write() as required. -INTERNALS +Internals ========= Source code organization @@ -365,7 +364,7 @@ into that page. It can be shown that all pages requested from the kernel (except possibly for the last) are 100% utilized this way. The "nonempty" message (supporting poll) ---------------------------------------- +---------------------------------------- In order to support the "poll" method (and hence select() ), there is a small catch regarding the FPGA to host direction: The FPGA may have filled a DMA diff --git a/Documentation/xtensa/mmu.txt b/Documentation/xtensa/mmu.txt index 222a2c6748e6..5de8715d5bec 100644 --- a/Documentation/xtensa/mmu.txt +++ b/Documentation/xtensa/mmu.txt @@ -41,9 +41,9 @@ The scheme below assumes that the kernel is loaded below 0x40000000. 00..1F -> 00 -> 00 -> 00 The default location of IO peripherals is above 0xf0000000. This may be changed -using a "ranges" property in a device tree simple-bus node. See ePAPR 1.1, ยง6.5 -for details on the syntax and semantic of simple-bus nodes. The following -limitations apply: +using a "ranges" property in a device tree simple-bus node. See the Devicetree +Specification, section 4.5 for details on the syntax and semantics of +simple-bus nodes. The following limitations apply: 1. Only top level simple-bus nodes are considered diff --git a/Documentation/xz.txt b/Documentation/xz.txt index 2cf3e2608de3..b2220d03aa50 100644 --- a/Documentation/xz.txt +++ b/Documentation/xz.txt @@ -1,121 +1,127 @@ - +============================ XZ data compression in Linux ============================ Introduction +============ - XZ is a general purpose data compression format with high compression - ratio and relatively fast decompression. The primary compression - algorithm (filter) is LZMA2. Additional filters can be used to improve - compression ratio even further. E.g. Branch/Call/Jump (BCJ) filters - improve compression ratio of executable data. +XZ is a general purpose data compression format with high compression +ratio and relatively fast decompression. The primary compression +algorithm (filter) is LZMA2. Additional filters can be used to improve +compression ratio even further. E.g. Branch/Call/Jump (BCJ) filters +improve compression ratio of executable data. - The XZ decompressor in Linux is called XZ Embedded. It supports - the LZMA2 filter and optionally also BCJ filters. CRC32 is supported - for integrity checking. The home page of XZ Embedded is at - <http://tukaani.org/xz/embedded.html>, where you can find the - latest version and also information about using the code outside - the Linux kernel. +The XZ decompressor in Linux is called XZ Embedded. It supports +the LZMA2 filter and optionally also BCJ filters. CRC32 is supported +for integrity checking. The home page of XZ Embedded is at +<http://tukaani.org/xz/embedded.html>, where you can find the +latest version and also information about using the code outside +the Linux kernel. - For userspace, XZ Utils provide a zlib-like compression library - and a gzip-like command line tool. XZ Utils can be downloaded from - <http://tukaani.org/xz/>. +For userspace, XZ Utils provide a zlib-like compression library +and a gzip-like command line tool. XZ Utils can be downloaded from +<http://tukaani.org/xz/>. XZ related components in the kernel - - The xz_dec module provides XZ decompressor with single-call (buffer - to buffer) and multi-call (stateful) APIs. The usage of the xz_dec - module is documented in include/linux/xz.h. - - The xz_dec_test module is for testing xz_dec. xz_dec_test is not - useful unless you are hacking the XZ decompressor. xz_dec_test - allocates a char device major dynamically to which one can write - .xz files from userspace. The decompressed output is thrown away. - Keep an eye on dmesg to see diagnostics printed by xz_dec_test. - See the xz_dec_test source code for the details. - - For decompressing the kernel image, initramfs, and initrd, there - is a wrapper function in lib/decompress_unxz.c. Its API is the - same as in other decompress_*.c files, which is defined in - include/linux/decompress/generic.h. - - scripts/xz_wrap.sh is a wrapper for the xz command line tool found - from XZ Utils. The wrapper sets compression options to values suitable - for compressing the kernel image. - - For kernel makefiles, two commands are provided for use with - $(call if_needed). The kernel image should be compressed with - $(call if_needed,xzkern) which will use a BCJ filter and a big LZMA2 - dictionary. It will also append a four-byte trailer containing the - uncompressed size of the file, which is needed by the boot code. - Other things should be compressed with $(call if_needed,xzmisc) - which will use no BCJ filter and 1 MiB LZMA2 dictionary. +=================================== + +The xz_dec module provides XZ decompressor with single-call (buffer +to buffer) and multi-call (stateful) APIs. The usage of the xz_dec +module is documented in include/linux/xz.h. + +The xz_dec_test module is for testing xz_dec. xz_dec_test is not +useful unless you are hacking the XZ decompressor. xz_dec_test +allocates a char device major dynamically to which one can write +.xz files from userspace. The decompressed output is thrown away. +Keep an eye on dmesg to see diagnostics printed by xz_dec_test. +See the xz_dec_test source code for the details. + +For decompressing the kernel image, initramfs, and initrd, there +is a wrapper function in lib/decompress_unxz.c. Its API is the +same as in other decompress_*.c files, which is defined in +include/linux/decompress/generic.h. + +scripts/xz_wrap.sh is a wrapper for the xz command line tool found +from XZ Utils. The wrapper sets compression options to values suitable +for compressing the kernel image. + +For kernel makefiles, two commands are provided for use with +$(call if_needed). The kernel image should be compressed with +$(call if_needed,xzkern) which will use a BCJ filter and a big LZMA2 +dictionary. It will also append a four-byte trailer containing the +uncompressed size of the file, which is needed by the boot code. +Other things should be compressed with $(call if_needed,xzmisc) +which will use no BCJ filter and 1 MiB LZMA2 dictionary. Notes on compression options +============================ - Since the XZ Embedded supports only streams with no integrity check or - CRC32, make sure that you don't use some other integrity check type - when encoding files that are supposed to be decoded by the kernel. With - liblzma, you need to use either LZMA_CHECK_NONE or LZMA_CHECK_CRC32 - when encoding. With the xz command line tool, use --check=none or - --check=crc32. - - Using CRC32 is strongly recommended unless there is some other layer - which will verify the integrity of the uncompressed data anyway. - Double checking the integrity would probably be waste of CPU cycles. - Note that the headers will always have a CRC32 which will be validated - by the decoder; you can only change the integrity check type (or - disable it) for the actual uncompressed data. - - In userspace, LZMA2 is typically used with dictionary sizes of several - megabytes. The decoder needs to have the dictionary in RAM, thus big - dictionaries cannot be used for files that are intended to be decoded - by the kernel. 1 MiB is probably the maximum reasonable dictionary - size for in-kernel use (maybe more is OK for initramfs). The presets - in XZ Utils may not be optimal when creating files for the kernel, - so don't hesitate to use custom settings. Example: - - xz --check=crc32 --lzma2=dict=512KiB inputfile - - An exception to above dictionary size limitation is when the decoder - is used in single-call mode. Decompressing the kernel itself is an - example of this situation. In single-call mode, the memory usage - doesn't depend on the dictionary size, and it is perfectly fine to - use a big dictionary: for maximum compression, the dictionary should - be at least as big as the uncompressed data itself. +Since the XZ Embedded supports only streams with no integrity check or +CRC32, make sure that you don't use some other integrity check type +when encoding files that are supposed to be decoded by the kernel. With +liblzma, you need to use either LZMA_CHECK_NONE or LZMA_CHECK_CRC32 +when encoding. With the xz command line tool, use --check=none or +--check=crc32. + +Using CRC32 is strongly recommended unless there is some other layer +which will verify the integrity of the uncompressed data anyway. +Double checking the integrity would probably be waste of CPU cycles. +Note that the headers will always have a CRC32 which will be validated +by the decoder; you can only change the integrity check type (or +disable it) for the actual uncompressed data. + +In userspace, LZMA2 is typically used with dictionary sizes of several +megabytes. The decoder needs to have the dictionary in RAM, thus big +dictionaries cannot be used for files that are intended to be decoded +by the kernel. 1 MiB is probably the maximum reasonable dictionary +size for in-kernel use (maybe more is OK for initramfs). The presets +in XZ Utils may not be optimal when creating files for the kernel, +so don't hesitate to use custom settings. Example:: + + xz --check=crc32 --lzma2=dict=512KiB inputfile + +An exception to above dictionary size limitation is when the decoder +is used in single-call mode. Decompressing the kernel itself is an +example of this situation. In single-call mode, the memory usage +doesn't depend on the dictionary size, and it is perfectly fine to +use a big dictionary: for maximum compression, the dictionary should +be at least as big as the uncompressed data itself. Future plans +============ - Creating a limited XZ encoder may be considered if people think it is - useful. LZMA2 is slower to compress than e.g. Deflate or LZO even at - the fastest settings, so it isn't clear if LZMA2 encoder is wanted - into the kernel. +Creating a limited XZ encoder may be considered if people think it is +useful. LZMA2 is slower to compress than e.g. Deflate or LZO even at +the fastest settings, so it isn't clear if LZMA2 encoder is wanted +into the kernel. - Support for limited random-access reading is planned for the - decompression code. I don't know if it could have any use in the - kernel, but I know that it would be useful in some embedded projects - outside the Linux kernel. +Support for limited random-access reading is planned for the +decompression code. I don't know if it could have any use in the +kernel, but I know that it would be useful in some embedded projects +outside the Linux kernel. Conformance to the .xz file format specification +================================================ - There are a couple of corner cases where things have been simplified - at expense of detecting errors as early as possible. These should not - matter in practice all, since they don't cause security issues. But - it is good to know this if testing the code e.g. with the test files - from XZ Utils. +There are a couple of corner cases where things have been simplified +at expense of detecting errors as early as possible. These should not +matter in practice all, since they don't cause security issues. But +it is good to know this if testing the code e.g. with the test files +from XZ Utils. Reporting bugs +============== - Before reporting a bug, please check that it's not fixed already - at upstream. See <http://tukaani.org/xz/embedded.html> to get the - latest code. +Before reporting a bug, please check that it's not fixed already +at upstream. See <http://tukaani.org/xz/embedded.html> to get the +latest code. - Report bugs to <lasse.collin@tukaani.org> or visit #tukaani on - Freenode and talk to Larhzu. I don't actively read LKML or other - kernel-related mailing lists, so if there's something I should know, - you should email to me personally or use IRC. +Report bugs to <lasse.collin@tukaani.org> or visit #tukaani on +Freenode and talk to Larhzu. I don't actively read LKML or other +kernel-related mailing lists, so if there's something I should know, +you should email to me personally or use IRC. - Don't bother Igor Pavlov with questions about the XZ implementation - in the kernel or about XZ Utils. While these two implementations - include essential code that is directly based on Igor Pavlov's code, - these implementations aren't maintained nor supported by him. +Don't bother Igor Pavlov with questions about the XZ implementation +in the kernel or about XZ Utils. While these two implementations +include essential code that is directly based on Igor Pavlov's code, +these implementations aren't maintained nor supported by him. diff --git a/Documentation/zorro.txt b/Documentation/zorro.txt index d530971beb00..664072b017e3 100644 --- a/Documentation/zorro.txt +++ b/Documentation/zorro.txt @@ -1,12 +1,13 @@ - Writing Device Drivers for Zorro Devices - ---------------------------------------- +======================================== +Writing Device Drivers for Zorro Devices +======================================== -Written by Geert Uytterhoeven <geert@linux-m68k.org> -Last revised: September 5, 2003 +:Author: Written by Geert Uytterhoeven <geert@linux-m68k.org> +:Last revised: September 5, 2003 -1. Introduction ---------------- +Introduction +------------ The Zorro bus is the bus used in the Amiga family of computers. Thanks to AutoConfig(tm), it's 100% Plug-and-Play. @@ -20,12 +21,12 @@ There are two types of Zorro buses, Zorro II and Zorro III: with Zorro II. The Zorro III address space lies outside the first 16 MB. -2. Probing for Zorro Devices ----------------------------- +Probing for Zorro Devices +------------------------- -Zorro devices are found by calling `zorro_find_device()', which returns a -pointer to the `next' Zorro device with the specified Zorro ID. A probe loop -for the board with Zorro ID `ZORRO_PROD_xxx' looks like: +Zorro devices are found by calling ``zorro_find_device()``, which returns a +pointer to the ``next`` Zorro device with the specified Zorro ID. A probe loop +for the board with Zorro ID ``ZORRO_PROD_xxx`` looks like:: struct zorro_dev *z = NULL; @@ -35,8 +36,8 @@ for the board with Zorro ID `ZORRO_PROD_xxx' looks like: ... } -`ZORRO_WILDCARD' acts as a wildcard and finds any Zorro device. If your driver -supports different types of boards, you can use a construct like: +``ZORRO_WILDCARD`` acts as a wildcard and finds any Zorro device. If your driver +supports different types of boards, you can use a construct like:: struct zorro_dev *z = NULL; @@ -49,24 +50,24 @@ supports different types of boards, you can use a construct like: } -3. Zorro Resources ------------------- +Zorro Resources +--------------- Before you can access a Zorro device's registers, you have to make sure it's not yet in use. This is done using the I/O memory space resource management -functions: +functions:: request_mem_region() release_mem_region() -Shortcuts to claim the whole device's address space are provided as well: +Shortcuts to claim the whole device's address space are provided as well:: zorro_request_device zorro_release_device -4. Accessing the Zorro Address Space ------------------------------------- +Accessing the Zorro Address Space +--------------------------------- The address regions in the Zorro device resources are Zorro bus address regions. Due to the identity bus-physical address mapping on the Zorro bus, @@ -78,26 +79,26 @@ The treatment of these regions depends on the type of Zorro space: explicitly using z_ioremap(). Conversion from bus/physical Zorro II addresses to kernel virtual addresses - and vice versa is done using: + and vice versa is done using:: virt_addr = ZTWO_VADDR(bus_addr); bus_addr = ZTWO_PADDR(virt_addr); - Zorro III address space must be mapped explicitly using z_ioremap() first - before it can be accessed: + before it can be accessed:: virt_addr = z_ioremap(bus_addr, size); ... z_iounmap(virt_addr); -5. References -------------- +References +---------- -linux/include/linux/zorro.h -linux/include/uapi/linux/zorro.h -linux/include/uapi/linux/zorro_ids.h -linux/arch/m68k/include/asm/zorro.h -linux/drivers/zorro -/proc/bus/zorro +#. linux/include/linux/zorro.h +#. linux/include/uapi/linux/zorro.h +#. linux/include/uapi/linux/zorro_ids.h +#. linux/arch/m68k/include/asm/zorro.h +#. linux/drivers/zorro +#. /proc/bus/zorro |