From 9f63df26beea78aae0500d1225a99bbd905dd157 Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Sun, 6 Jan 2019 19:19:29 -0800 Subject: Documentation/filesystems: fix title underline lengths in path-lookup.rst Fix Sphinx warnings in path-lookup.rst: Documentation/filesystems/path-lookup.rst:347: WARNING: Title underline too short. Documentation/filesystems/path-lookup.rst:358: WARNING: Title underline too short. [...] Signed-off-by: Randy Dunlap Acked-by: NeilBrown Signed-off-by: Jonathan Corbet --- Documentation/filesystems/path-lookup.rst | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst index 9d6b68853f5b..80e22eda4132 100644 --- a/Documentation/filesystems/path-lookup.rst +++ b/Documentation/filesystems/path-lookup.rst @@ -344,7 +344,7 @@ In particular it is held while scanning chains in the dcache hash table, and the mount point hash table. Bringing it together with ``struct nameidata`` --------------------------------------------- +---------------------------------------------- .. _First edition Unix: http://minnie.tuhs.org/cgi-bin/utree.pl?file=V1/u2.s @@ -355,7 +355,7 @@ converts a "name" to an "inode". ``struct nameidata`` contains (among other fields): ``struct path path`` -~~~~~~~~~~~~~~~~~~ +~~~~~~~~~~~~~~~~~~~~ A ``path`` contains a ``struct vfsmount`` (which is embedded in a ``struct mount``) and a ``struct dentry``. Together these @@ -366,13 +366,13 @@ step. A reference through ``d_lockref`` and ``mnt_count`` is always held. ``struct qstr last`` -~~~~~~~~~~~~~~~~~~ +~~~~~~~~~~~~~~~~~~~~ This is a string together with a length (i.e. _not_ ``nul`` terminated) that is the "next" component in the pathname. ``int last_type`` -~~~~~~~~~~~~~~~ +~~~~~~~~~~~~~~~~~ This is one of ``LAST_NORM``, ``LAST_ROOT``, ``LAST_DOT``, ``LAST_DOTDOT``, or ``LAST_BIND``. The ``last`` field is only valid if the type is @@ -381,7 +381,7 @@ components of the symlink have been processed yet. Others should be fairly self-explanatory. ``struct path root`` -~~~~~~~~~~~~~~~~~~ +~~~~~~~~~~~~~~~~~~~~ This is used to hold a reference to the effective root of the filesystem. Often that reference won't be needed, so this field is @@ -510,7 +510,7 @@ potentially interesting things about these dentries corresponding to three different flags that might be set in ``dentry->d_flags``: ``DCACHE_MANAGE_TRANSIT`` -~~~~~~~~~~~~~~~~~~~~~~~ +~~~~~~~~~~~~~~~~~~~~~~~~~ If this flag has been set, then the filesystem has requested that the ``d_manage()`` dentry operation be called before handling any possible @@ -529,7 +529,7 @@ filesystem, which will then give it a special pass through ``d_manage()`` by returning ``-EISDIR``. ``DCACHE_MOUNTED`` -~~~~~~~~~~~~~~~~ +~~~~~~~~~~~~~~~~~~ This flag is set on every dentry that is mounted on. As Linux supports multiple filesystem namespaces, it is possible that the @@ -542,7 +542,7 @@ If this flag is set, and ``d_manage()`` didn't return ``-EISDIR``, and a new ``dentry`` (both with counted references). ``DCACHE_NEED_AUTOMOUNT`` -~~~~~~~~~~~~~~~~~~~~~~~ +~~~~~~~~~~~~~~~~~~~~~~~~~ If ``d_manage()`` allowed us to get this far, and ``lookup_mnt()`` didn't find a mount point, then this flag causes the ``d_automount()`` dentry @@ -698,7 +698,7 @@ With that little refresher on seqlocks out of the way we can look at the bigger picture of how RCU-walk uses seqlocks. ``mount_lock`` and ``nd->m_seq`` -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We already met the ``mount_lock`` seqlock when REF-walk used it to ensure that crossing a mount point is performed safely. RCU-walk uses @@ -727,7 +727,7 @@ results would have been the same. This ensures the invariant holds, at least for vfsmount structures. ``dentry->d_seq`` and ``nd->seq`` -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In place of taking a count or lock on ``d_reflock``, RCU-walk samples the per-dentry ``d_seq`` seqlock, and stores the sequence number in the @@ -774,7 +774,7 @@ getting a counted reference to the new dentry before dropping that for the old dentry which we saw in REF-walk. No ``inode->i_rwsem`` or even ``rename_lock`` -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A semaphore is a fairly heavyweight lock that can only be taken when it is permissible to sleep. As ``rcu_read_lock()`` forbids sleeping, @@ -796,7 +796,7 @@ locking. This neatly handles all cases, so adding extra checks on rename_lock would bring no significant value. ``unlazy walk()`` and ``complete_walk()`` -------------------------------------- +----------------------------------------- That "dropping down to REF-walk" typically involves a call to ``unlazy_walk()``, so named because "RCU-walk" is also sometimes -- cgit v1.2.3 From 35283f56626cbc6d5b2dcb42f6d2189ce34fee3d Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Fri, 11 Jan 2019 14:40:59 +0100 Subject: Documentation/filesystems: add binderfs This documents the Android binderfs filesystem used to dynamically add and remove binder devices that are private to each instance. Signed-off-by: Christian Brauner [jc: tweaked markup and added to filesystems/index.rst] Signed-off-by: Jonathan Corbet --- Documentation/filesystems/binderfs.rst | 68 ++++++++++++++++++++++++++++++++++ Documentation/filesystems/index.rst | 7 ++++ 2 files changed, 75 insertions(+) create mode 100644 Documentation/filesystems/binderfs.rst (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/binderfs.rst b/Documentation/filesystems/binderfs.rst new file mode 100644 index 000000000000..c009671f8434 --- /dev/null +++ b/Documentation/filesystems/binderfs.rst @@ -0,0 +1,68 @@ +.. SPDX-License-Identifier: GPL-2.0 + +The Android binderfs Filesystem +=============================== + +Android binderfs is a filesystem for the Android binder IPC mechanism. It +allows to dynamically add and remove binder devices at runtime. Binder devices +located in a new binderfs instance are independent of binder devices located in +other binderfs instances. Mounting a new binderfs instance makes it possible +to get a set of private binder devices. + +Mounting binderfs +----------------- + +Android binderfs can be mounted with:: + + mkdir /dev/binderfs + mount -t binder binder /dev/binderfs + +at which point a new instance of binderfs will show up at ``/dev/binderfs``. +In a fresh instance of binderfs no binder devices will be present. There will +only be a ``binder-control`` device which serves as the request handler for +binderfs. Mounting another binderfs instance at a different location will +create a new and separate instance from all other binderfs mounts. This is +identical to the behavior of e.g. ``devpts`` and ``tmpfs``. The Android +binderfs filesystem can be mounted in user namespaces. + +Options +------- +max + binderfs instances can be mounted with a limit on the number of binder + devices that can be allocated. The ``max=`` mount option serves as + a per-instance limit. If ``max=`` is set then only ```` number + of binder devices can be allocated in this binderfs instance. + +Allocating binder Devices +------------------------- + +.. _ioctl: http://man7.org/linux/man-pages/man2/ioctl.2.html + +To allocate a new binder device in a binderfs instance a request needs to be +sent through the ``binder-control`` device node. A request is sent in the form +of an `ioctl() `_. + +What a program needs to do is to open the ``binder-control`` device node and +send a ``BINDER_CTL_ADD`` request to the kernel. Users of binderfs need to +tell the kernel which name the new binder device should get. By default a name +can only contain up to ``BINDERFS_MAX_NAME`` chars including the terminating +zero byte. + +Once the request is made via an `ioctl() `_ passing a ``struct +binder_device`` with the name to the kernel it will allocate a new binder +device and return the major and minor number of the new device in the struct +(This is necessary because binderfs allocates a major device number +dynamically.). After the `ioctl() `_ returns there will be a new +binder device located under /dev/binderfs with the chosen name. + +Deleting binder Devices +----------------------- + +.. _unlink: http://man7.org/linux/man-pages/man2/unlink.2.html +.. _rm: http://man7.org/linux/man-pages/man1/rm.1.html + +Binderfs binder devices can be deleted via `unlink() `_. This means +that the `rm() `_ tool can be used to delete them. Note that the +``binder-control`` device cannot be deleted since this would make the binderfs +instance unuseable. The ``binder-control`` device will be deleted when the +binderfs instance is unmounted and all references to it have been dropped. diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 605befab300b..61d2441b25d5 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -380,3 +380,10 @@ including: :maxdepth: 2 path-lookup.rst + +binderfs +======== + +.. toctree:: + + binderfs.rst -- cgit v1.2.3 From 44a47f0e3ec213a670c0170bb8ee588551ff1549 Mon Sep 17 00:00:00 2001 From: Nicholas Mc Guire Date: Fri, 15 Feb 2019 08:29:48 +0100 Subject: sysfs.txt: add note on available attribute macros The common cases of attributes wrappers should probably be using the __ATTR_XXX macros to make code more concise and readable but the current sysfs.txt does not point developers to those convenience macros. Further there is no note in sysfs.txt currently explaining why trying to set a sysfs file to mode 0666 will fail respectively revert to 0664. Signed-off-by: Nicholas Mc Guire Signed-off-by: Jonathan Corbet --- Documentation/filesystems/sysfs.txt | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/sysfs.txt b/Documentation/filesystems/sysfs.txt index a1426cabcef1..d70c8f112662 100644 --- a/Documentation/filesystems/sysfs.txt +++ b/Documentation/filesystems/sysfs.txt @@ -116,6 +116,27 @@ static struct device_attribute dev_attr_foo = { .store = store_foo, }; +Note as stated in include/linux/kernel.h "OTHER_WRITABLE? Generally +considered a bad idea." so trying to set a sysfs file writable for +everyone will fail reverting to RO mode for "Others". + +For the common cases sysfs.h provides convenience macros to make +defining attributes easier as well as making code more concise and +readable. The above case could be shortened to: + +static struct device_attribute dev_attr_foo = __ATTR_RW(foo); + +the list of helpers available to define your wrapper function is: +__ATTR_RO(name): assumes default name_show and mode 0444 +__ATTR_WO(name): assumes a name_store only and is restricted to mode + 0200 that is root write access only. +__ATTR_RO_MODE(name, mode): fore more restrictive RO access currently + only use case is the EFI System Resource Table + (see drivers/firmware/efi/esrt.c) +__ATTR_RW(name): assumes default name_show, name_store and setting + mode to 0644. +__ATTR_NULL: which sets the name to NULL and is used as end of list + indicator (see: kernel/workqueue.c) Subsystem-Specific Callbacks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -- cgit v1.2.3 From 4064174becc09a5a2385a27c8a6fd40888b0e13c Mon Sep 17 00:00:00 2001 From: Jonathan Corbet Date: Wed, 20 Feb 2019 15:29:36 -0700 Subject: docs: Bring some order to filesystem documentation Documentation/filesystems is, like much of the rest of the kernel's documentation, a jumble of unorganized information. Split the documentation into categories and try to bring some order to the top-level index.rst files. No text changes other than a few section-introductory blurbs; this is all just moving stuff around. Cc: linux-fsdevel@vger.kernel.org Cc: Al Viro Signed-off-by: Jonathan Corbet --- Documentation/filesystems/api-summary.rst | 150 ++++++++++++ Documentation/filesystems/index.rst | 394 ++---------------------------- Documentation/filesystems/journalling.rst | 184 ++++++++++++++ Documentation/filesystems/path-lookup.rst | 15 ++ Documentation/filesystems/splice.rst | 22 ++ 5 files changed, 395 insertions(+), 370 deletions(-) create mode 100644 Documentation/filesystems/api-summary.rst create mode 100644 Documentation/filesystems/journalling.rst create mode 100644 Documentation/filesystems/splice.rst (limited to 'Documentation/filesystems') diff --git a/Documentation/filesystems/api-summary.rst b/Documentation/filesystems/api-summary.rst new file mode 100644 index 000000000000..aa51ffcfa029 --- /dev/null +++ b/Documentation/filesystems/api-summary.rst @@ -0,0 +1,150 @@ +============================= +Linux Filesystems API summary +============================= + +This section contains API-level documentation, mostly taken from the source +code itself. + +The Linux VFS +============= + +The Filesystem types +-------------------- + +.. kernel-doc:: include/linux/fs.h + :internal: + +The Directory Cache +------------------- + +.. kernel-doc:: fs/dcache.c + :export: + +.. kernel-doc:: include/linux/dcache.h + :internal: + +Inode Handling +-------------- + +.. kernel-doc:: fs/inode.c + :export: + +.. kernel-doc:: fs/bad_inode.c + :export: + +Registration and Superblocks +---------------------------- + +.. kernel-doc:: fs/super.c + :export: + +File Locks +---------- + +.. kernel-doc:: fs/locks.c + :export: + +.. kernel-doc:: fs/locks.c + :internal: + +Other Functions +--------------- + +.. kernel-doc:: fs/mpage.c + :export: + +.. kernel-doc:: fs/namei.c + :export: + +.. kernel-doc:: fs/buffer.c + :export: + +.. kernel-doc:: block/bio.c + :export: + +.. kernel-doc:: fs/seq_file.c + :export: + +.. kernel-doc:: fs/filesystems.c + :export: + +.. kernel-doc:: fs/fs-writeback.c + :export: + +.. kernel-doc:: fs/block_dev.c + :export: + +.. kernel-doc:: fs/anon_inodes.c + :export: + +.. kernel-doc:: fs/attr.c + :export: + +.. kernel-doc:: fs/d_path.c + :export: + +.. kernel-doc:: fs/dax.c + :export: + +.. kernel-doc:: fs/direct-io.c + :export: + +.. kernel-doc:: fs/file_table.c + :export: + +.. kernel-doc:: fs/libfs.c + :export: + +.. kernel-doc:: fs/posix_acl.c + :export: + +.. kernel-doc:: fs/stat.c + :export: + +.. kernel-doc:: fs/sync.c + :export: + +.. kernel-doc:: fs/xattr.c + :export: + +The proc filesystem +=================== + +sysctl interface +---------------- + +.. kernel-doc:: kernel/sysctl.c + :export: + +proc filesystem interface +------------------------- + +.. kernel-doc:: fs/proc/base.c + :internal: + +Events based on file descriptors +================================ + +.. kernel-doc:: fs/eventfd.c + :export: + +The Filesystem for Exporting Kernel Objects +=========================================== + +.. kernel-doc:: fs/sysfs/file.c + :export: + +.. kernel-doc:: fs/sysfs/symlink.c + :export: + +The debugfs filesystem +====================== + +debugfs interface +----------------- + +.. kernel-doc:: fs/debugfs/inode.c + :export: + +.. kernel-doc:: fs/debugfs/file.c + :export: diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 61d2441b25d5..1131c34d77f6 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -1,389 +1,43 @@ -===================== -Linux Filesystems API -===================== +=============================== +Filesystems in the Linux kernel +=============================== -The Linux VFS -============= +This under-development manual will, some glorious day, provide +comprehensive information on how the Linux virtual filesystem (VFS) layer +works, along with the filesystems that sit below it. For now, what we have +can be found below. -The Filesystem types --------------------- - -.. kernel-doc:: include/linux/fs.h - :internal: - -The Directory Cache -------------------- - -.. kernel-doc:: fs/dcache.c - :export: - -.. kernel-doc:: include/linux/dcache.h - :internal: - -Inode Handling --------------- - -.. kernel-doc:: fs/inode.c - :export: - -.. kernel-doc:: fs/bad_inode.c - :export: - -Registration and Superblocks ----------------------------- - -.. kernel-doc:: fs/super.c - :export: - -File Locks ----------- - -.. kernel-doc:: fs/locks.c - :export: - -.. kernel-doc:: fs/locks.c - :internal: - -Other Functions ---------------- - -.. kernel-doc:: fs/mpage.c - :export: - -.. kernel-doc:: fs/namei.c - :export: - -.. kernel-doc:: fs/buffer.c - :export: - -.. kernel-doc:: block/bio.c - :export: - -.. kernel-doc:: fs/seq_file.c - :export: - -.. kernel-doc:: fs/filesystems.c - :export: - -.. kernel-doc:: fs/fs-writeback.c - :export: - -.. kernel-doc:: fs/block_dev.c - :export: - -.. kernel-doc:: fs/anon_inodes.c - :export: - -.. kernel-doc:: fs/attr.c - :export: - -.. kernel-doc:: fs/d_path.c - :export: - -.. kernel-doc:: fs/dax.c - :export: - -.. kernel-doc:: fs/direct-io.c - :export: - -.. kernel-doc:: fs/file_table.c - :export: - -.. kernel-doc:: fs/libfs.c - :export: - -.. kernel-doc:: fs/posix_acl.c - :export: - -.. kernel-doc:: fs/stat.c - :export: - -.. kernel-doc:: fs/sync.c - :export: - -.. kernel-doc:: fs/xattr.c - :export: - -The proc filesystem -=================== - -sysctl interface ----------------- - -.. kernel-doc:: kernel/sysctl.c - :export: - -proc filesystem interface -------------------------- - -.. kernel-doc:: fs/proc/base.c - :internal: - -Events based on file descriptors -================================ - -.. kernel-doc:: fs/eventfd.c - :export: - -The Filesystem for Exporting Kernel Objects -=========================================== - -.. kernel-doc:: fs/sysfs/file.c - :export: - -.. kernel-doc:: fs/sysfs/symlink.c - :export: - -The debugfs filesystem +Core VFS documentation ====================== -debugfs interface ------------------ +See these manuals for documentation about the VFS layer itself and how its +algorithms work. -.. kernel-doc:: fs/debugfs/inode.c - :export: +.. toctree:: + :maxdepth: 2 -.. kernel-doc:: fs/debugfs/file.c - :export: + path-lookup.rst + api-summary + splice -The Linux Journalling API +Filesystem support layers ========================= -Overview --------- - -Details -~~~~~~~ - -The journalling layer is easy to use. You need to first of all create a -journal_t data structure. There are two calls to do this dependent on -how you decide to allocate the physical media on which the journal -resides. The :c:func:`jbd2_journal_init_inode` call is for journals stored in -filesystem inodes, or the :c:func:`jbd2_journal_init_dev` call can be used -for journal stored on a raw device (in a continuous range of blocks). A -journal_t is a typedef for a struct pointer, so when you are finally -finished make sure you call :c:func:`jbd2_journal_destroy` on it to free up -any used kernel memory. - -Once you have got your journal_t object you need to 'mount' or load the -journal file. The journalling layer expects the space for the journal -was already allocated and initialized properly by the userspace tools. -When loading the journal you must call :c:func:`jbd2_journal_load` to process -journal contents. If the client file system detects the journal contents -does not need to be processed (or even need not have valid contents), it -may call :c:func:`jbd2_journal_wipe` to clear the journal contents before -calling :c:func:`jbd2_journal_load`. - -Note that jbd2_journal_wipe(..,0) calls -:c:func:`jbd2_journal_skip_recovery` for you if it detects any outstanding -transactions in the journal and similarly :c:func:`jbd2_journal_load` will -call :c:func:`jbd2_journal_recover` if necessary. I would advise reading -:c:func:`ext4_load_journal` in fs/ext4/super.c for examples on this stage. - -Now you can go ahead and start modifying the underlying filesystem. -Almost. - -You still need to actually journal your filesystem changes, this is done -by wrapping them into transactions. Additionally you also need to wrap -the modification of each of the buffers with calls to the journal layer, -so it knows what the modifications you are actually making are. To do -this use :c:func:`jbd2_journal_start` which returns a transaction handle. - -:c:func:`jbd2_journal_start` and its counterpart :c:func:`jbd2_journal_stop`, -which indicates the end of a transaction are nestable calls, so you can -reenter a transaction if necessary, but remember you must call -:c:func:`jbd2_journal_stop` the same number of times as -:c:func:`jbd2_journal_start` before the transaction is completed (or more -accurately leaves the update phase). Ext4/VFS makes use of this feature to -simplify handling of inode dirtying, quota support, etc. - -Inside each transaction you need to wrap the modifications to the -individual buffers (blocks). Before you start to modify a buffer you -need to call :c:func:`jbd2_journal_get_create_access()` / -:c:func:`jbd2_journal_get_write_access()` / -:c:func:`jbd2_journal_get_undo_access()` as appropriate, this allows the -journalling layer to copy the unmodified -data if it needs to. After all the buffer may be part of a previously -uncommitted transaction. At this point you are at last ready to modify a -buffer, and once you are have done so you need to call -:c:func:`jbd2_journal_dirty_metadata`. Or if you've asked for access to a -buffer you now know is now longer required to be pushed back on the -device you can call :c:func:`jbd2_journal_forget` in much the same way as you -might have used :c:func:`bforget` in the past. - -A :c:func:`jbd2_journal_flush` may be called at any time to commit and -checkpoint all your transactions. - -Then at umount time , in your :c:func:`put_super` you can then call -:c:func:`jbd2_journal_destroy` to clean up your in-core journal object. - -Unfortunately there a couple of ways the journal layer can cause a -deadlock. The first thing to note is that each task can only have a -single outstanding transaction at any one time, remember nothing commits -until the outermost :c:func:`jbd2_journal_stop`. This means you must complete -the transaction at the end of each file/inode/address etc. operation you -perform, so that the journalling system isn't re-entered on another -journal. Since transactions can't be nested/batched across differing -journals, and another filesystem other than yours (say ext4) may be -modified in a later syscall. - -The second case to bear in mind is that :c:func:`jbd2_journal_start` can block -if there isn't enough space in the journal for your transaction (based -on the passed nblocks param) - when it blocks it merely(!) needs to wait -for transactions to complete and be committed from other tasks, so -essentially we are waiting for :c:func:`jbd2_journal_stop`. So to avoid -deadlocks you must treat :c:func:`jbd2_journal_start` / -:c:func:`jbd2_journal_stop` as if they were semaphores and include them in -your semaphore ordering rules to prevent -deadlocks. Note that :c:func:`jbd2_journal_extend` has similar blocking -behaviour to :c:func:`jbd2_journal_start` so you can deadlock here just as -easily as on :c:func:`jbd2_journal_start`. - -Try to reserve the right number of blocks the first time. ;-). This will -be the maximum number of blocks you are going to touch in this -transaction. I advise having a look at at least ext4_jbd.h to see the -basis on which ext4 uses to make these decisions. - -Another wriggle to watch out for is your on-disk block allocation -strategy. Why? Because, if you do a delete, you need to ensure you -haven't reused any of the freed blocks until the transaction freeing -these blocks commits. If you reused these blocks and crash happens, -there is no way to restore the contents of the reallocated blocks at the -end of the last fully committed transaction. One simple way of doing -this is to mark blocks as free in internal in-memory block allocation -structures only after the transaction freeing them commits. Ext4 uses -journal commit callback for this purpose. - -With journal commit callbacks you can ask the journalling layer to call -a callback function when the transaction is finally committed to disk, -so that you can do some of your own management. You ask the journalling -layer for calling the callback by simply setting -``journal->j_commit_callback`` function pointer and that function is -called after each transaction commit. You can also use -``transaction->t_private_list`` for attaching entries to a transaction -that need processing when the transaction commits. - -JBD2 also provides a way to block all transaction updates via -:c:func:`jbd2_journal_lock_updates()` / -:c:func:`jbd2_journal_unlock_updates()`. Ext4 uses this when it wants a -window with a clean and stable fs for a moment. E.g. - -:: - - - jbd2_journal_lock_updates() //stop new stuff happening.. - jbd2_journal_flush() // checkpoint everything. - ..do stuff on stable fs - jbd2_journal_unlock_updates() // carry on with filesystem use. - -The opportunities for abuse and DOS attacks with this should be obvious, -if you allow unprivileged userspace to trigger codepaths containing -these calls. - -Summary -~~~~~~~ - -Using the journal is a matter of wrapping the different context changes, -being each mount, each modification (transaction) and each changed -buffer to tell the journalling layer about them. - -Data Types ----------- - -The journalling layer uses typedefs to 'hide' the concrete definitions -of the structures used. As a client of the JBD2 layer you can just rely -on the using the pointer as a magic cookie of some sort. Obviously the -hiding is not enforced as this is 'C'. - -Structures -~~~~~~~~~~ - -.. kernel-doc:: include/linux/jbd2.h - :internal: - -Functions ---------- - -The functions here are split into two groups those that affect a journal -as a whole, and those which are used to manage transactions - -Journal Level -~~~~~~~~~~~~~ - -.. kernel-doc:: fs/jbd2/journal.c - :export: - -.. kernel-doc:: fs/jbd2/recovery.c - :internal: - -Transasction Level -~~~~~~~~~~~~~~~~~~ - -.. kernel-doc:: fs/jbd2/transaction.c - -See also --------- - -`Journaling the Linux ext2fs Filesystem, LinuxExpo 98, Stephen -Tweedie `__ - -`Ext3 Journalling FileSystem, OLS 2000, Dr. Stephen -Tweedie `__ - -splice API -========== - -splice is a method for moving blocks of data around inside the kernel, -without continually transferring them between the kernel and user space. - -.. kernel-doc:: fs/splice.c - -pipes API -========= - -Pipe interfaces are all for in-kernel (builtin image) use. They are not -exported for use by modules. - -.. kernel-doc:: include/linux/pipe_fs_i.h - :internal: - -.. kernel-doc:: fs/pipe.c - -Encryption API -============== - -A library which filesystems can hook into to support transparent -encryption of files and directories. +Documentation for the support code within the filesystem layer for use in +filesystem implementations. .. toctree:: - :maxdepth: 2 - - fscrypt - -Pathname lookup -=============== - - -This write-up is based on three articles published at lwn.net: + :maxdepth: 2 -- Pathname lookup in Linux -- RCU-walk: faster pathname lookup in Linux -- A walk among the symlinks + journalling + fscrypt -Written by Neil Brown with help from Al Viro and Jon Corbet. -It has subsequently been updated to reflect changes in the kernel -including: +Filesystem-specific documentation +================================= -- per-directory parallel name lookup. +Documentation for individual filesystem types can be found here. .. toctree:: :maxdepth: 2 - path-lookup.rst - -binderfs -======== - -.. toctree:: - binderfs.rst diff --git a/Documentation/filesystems/journalling.rst b/Documentation/filesystems/journalling.rst new file mode 100644 index 000000000000..58ce6b395206 --- /dev/null +++ b/Documentation/filesystems/journalling.rst @@ -0,0 +1,184 @@ +The Linux Journalling API +========================= + +Overview +-------- + +Details +~~~~~~~ + +The journalling layer is easy to use. You need to first of all create a +journal_t data structure. There are two calls to do this dependent on +how you decide to allocate the physical media on which the journal +resides. The :c:func:`jbd2_journal_init_inode` call is for journals stored in +filesystem inodes, or the :c:func:`jbd2_journal_init_dev` call can be used +for journal stored on a raw device (in a continuous range of blocks). A +journal_t is a typedef for a struct pointer, so when you are finally +finished make sure you call :c:func:`jbd2_journal_destroy` on it to free up +any used kernel memory. + +Once you have got your journal_t object you need to 'mount' or load the +journal file. The journalling layer expects the space for the journal +was already allocated and initialized properly by the userspace tools. +When loading the journal you must call :c:func:`jbd2_journal_load` to process +journal contents. If the client file system detects the journal contents +does not need to be processed (or even need not have valid contents), it +may call :c:func:`jbd2_journal_wipe` to clear the journal contents before +calling :c:func:`jbd2_journal_load`. + +Note that jbd2_journal_wipe(..,0) calls +:c:func:`jbd2_journal_skip_recovery` for you if it detects any outstanding +transactions in the journal and similarly :c:func:`jbd2_journal_load` will +call :c:func:`jbd2_journal_recover` if necessary. I would advise reading +:c:func:`ext4_load_journal` in fs/ext4/super.c for examples on this stage. + +Now you can go ahead and start modifying the underlying filesystem. +Almost. + +You still need to actually journal your filesystem changes, this is done +by wrapping them into transactions. Additionally you also need to wrap +the modification of each of the buffers with calls to the journal layer, +so it knows what the modifications you are actually making are. To do +this use :c:func:`jbd2_journal_start` which returns a transaction handle. + +:c:func:`jbd2_journal_start` and its counterpart :c:func:`jbd2_journal_stop`, +which indicates the end of a transaction are nestable calls, so you can +reenter a transaction if necessary, but remember you must call +:c:func:`jbd2_journal_stop` the same number of times as +:c:func:`jbd2_journal_start` before the transaction is completed (or more +accurately leaves the update phase). Ext4/VFS makes use of this feature to +simplify handling of inode dirtying, quota support, etc. + +Inside each transaction you need to wrap the modifications to the +individual buffers (blocks). Before you start to modify a buffer you +need to call :c:func:`jbd2_journal_get_create_access()` / +:c:func:`jbd2_journal_get_write_access()` / +:c:func:`jbd2_journal_get_undo_access()` as appropriate, this allows the +journalling layer to copy the unmodified +data if it needs to. After all the buffer may be part of a previously +uncommitted transaction. At this point you are at last ready to modify a +buffer, and once you are have done so you need to call +:c:func:`jbd2_journal_dirty_metadata`. Or if you've asked for access to a +buffer you now know is now longer required to be pushed back on the +device you can call :c:func:`jbd2_journal_forget` in much the same way as you +might have used :c:func:`bforget` in the past. + +A :c:func:`jbd2_journal_flush` may be called at any time to commit and +checkpoint all your transactions. + +Then at umount time , in your :c:func:`put_super` you can then call +:c:func:`jbd2_journal_destroy` to clean up your in-core journal object. + +Unfortunately there a couple of ways the journal layer can cause a +deadlock. The first thing to note is that each task can only have a +single outstanding transaction at any one time, remember nothing commits +until the outermost :c:func:`jbd2_journal_stop`. This means you must complete +the transaction at the end of each file/inode/address etc. operation you +perform, so that the journalling system isn't re-entered on another +journal. Since transactions can't be nested/batched across differing +journals, and another filesystem other than yours (say ext4) may be +modified in a later syscall. + +The second case to bear in mind is that :c:func:`jbd2_journal_start` can block +if there isn't enough space in the journal for your transaction (based +on the passed nblocks param) - when it blocks it merely(!) needs to wait +for transactions to complete and be committed from other tasks, so +essentially we are waiting for :c:func:`jbd2_journal_stop`. So to avoid +deadlocks you must treat :c:func:`jbd2_journal_start` / +:c:func:`jbd2_journal_stop` as if they were semaphores and include them in +your semaphore ordering rules to prevent +deadlocks. Note that :c:func:`jbd2_journal_extend` has similar blocking +behaviour to :c:func:`jbd2_journal_start` so you can deadlock here just as +easily as on :c:func:`jbd2_journal_start`. + +Try to reserve the right number of blocks the first time. ;-). This will +be the maximum number of blocks you are going to touch in this +transaction. I advise having a look at at least ext4_jbd.h to see the +basis on which ext4 uses to make these decisions. + +Another wriggle to watch out for is your on-disk block allocation +strategy. Why? Because, if you do a delete, you need to ensure you +haven't reused any of the freed blocks until the transaction freeing +these blocks commits. If you reused these blocks and crash happens, +there is no way to restore the contents of the reallocated blocks at the +end of the last fully committed transaction. One simple way of doing +this is to mark blocks as free in internal in-memory block allocation +structures only after the transaction freeing them commits. Ext4 uses +journal commit callback for this purpose. + +With journal commit callbacks you can ask the journalling layer to call +a callback function when the transaction is finally committed to disk, +so that you can do some of your own management. You ask the journalling +layer for calling the callback by simply setting +``journal->j_commit_callback`` function pointer and that function is +called after each transaction commit. You can also use +``transaction->t_private_list`` for attaching entries to a transaction +that need processing when the transaction commits. + +JBD2 also provides a way to block all transaction updates via +:c:func:`jbd2_journal_lock_updates()` / +:c:func:`jbd2_journal_unlock_updates()`. Ext4 uses this when it wants a +window with a clean and stable fs for a moment. E.g. + +:: + + + jbd2_journal_lock_updates() //stop new stuff happening.. + jbd2_journal_flush() // checkpoint everything. + ..do stuff on stable fs + jbd2_journal_unlock_updates() // carry on with filesystem use. + +The opportunities for abuse and DOS attacks with this should be obvious, +if you allow unprivileged userspace to trigger codepaths containing +these calls. + +Summary +~~~~~~~ + +Using the journal is a matter of wrapping the different context changes, +being each mount, each modification (transaction) and each changed +buffer to tell the journalling layer about them. + +Data Types +---------- + +The journalling layer uses typedefs to 'hide' the concrete definitions +of the structures used. As a client of the JBD2 layer you can just rely +on the using the pointer as a magic cookie of some sort. Obviously the +hiding is not enforced as this is 'C'. + +Structures +~~~~~~~~~~ + +.. kernel-doc:: include/linux/jbd2.h + :internal: + +Functions +--------- + +The functions here are split into two groups those that affect a journal +as a whole, and those which are used to manage transactions + +Journal Level +~~~~~~~~~~~~~ + +.. kernel-doc:: fs/jbd2/journal.c + :export: + +.. kernel-doc:: fs/jbd2/recovery.c + :internal: + +Transasction Level +~~~~~~~~~~~~~~~~~~ + +.. kernel-doc:: fs/jbd2/transaction.c + +See also +-------- + +`Journaling the Linux ext2fs Filesystem, LinuxExpo 98, Stephen +Tweedie `__ + +`Ext3 Journalling FileSystem, OLS 2000, Dr. Stephen +Tweedie `__ + diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst index 80e22eda4132..434a07b0002b 100644 --- a/Documentation/filesystems/path-lookup.rst +++ b/Documentation/filesystems/path-lookup.rst @@ -1,3 +1,18 @@ +=============== +Pathname lookup +=============== + +This write-up is based on three articles published at lwn.net: + +- Pathname lookup in Linux +- RCU-walk: faster pathname lookup in Linux +- A walk among the symlinks + +Written by Neil Brown with help from Al Viro and Jon Corbet. +It has subsequently been updated to reflect changes in the kernel +including: + +- per-directory parallel name lookup. Introduction to pathname lookup =============================== diff --git a/Documentation/filesystems/splice.rst b/Documentation/filesystems/splice.rst new file mode 100644 index 000000000000..edd874808472 --- /dev/null +++ b/Documentation/filesystems/splice.rst @@ -0,0 +1,22 @@ +================ +splice and pipes +================ + +splice API +========== + +splice is a method for moving blocks of data around inside the kernel, +without continually transferring them between the kernel and user space. + +.. kernel-doc:: fs/splice.c + +pipes API +========= + +Pipe interfaces are all for in-kernel (builtin image) use. They are not +exported for use by modules. + +.. kernel-doc:: include/linux/pipe_fs_i.h + :internal: + +.. kernel-doc:: fs/pipe.c -- cgit v1.2.3