diff options
Diffstat (limited to 'Documentation/device-mapper/cache.txt')
-rw-r--r-- | Documentation/device-mapper/cache.txt | 311 |
1 files changed, 0 insertions, 311 deletions
diff --git a/Documentation/device-mapper/cache.txt b/Documentation/device-mapper/cache.txt deleted file mode 100644 index 8ae1cf8e94da..000000000000 --- a/Documentation/device-mapper/cache.txt +++ /dev/null @@ -1,311 +0,0 @@ -Introduction -============ - -dm-cache is a device mapper target written by Joe Thornber, Heinz -Mauelshagen, and Mike Snitzer. - -It aims to improve performance of a block device (eg, a spindle) by -dynamically migrating some of its data to a faster, smaller device -(eg, an SSD). - -This device-mapper solution allows us to insert this caching at -different levels of the dm stack, for instance above the data device for -a thin-provisioning pool. Caching solutions that are integrated more -closely with the virtual memory system should give better performance. - -The target reuses the metadata library used in the thin-provisioning -library. - -The decision as to what data to migrate and when is left to a plug-in -policy module. Several of these have been written as we experiment, -and we hope other people will contribute others for specific io -scenarios (eg. a vm image server). - -Glossary -======== - - Migration - Movement of the primary copy of a logical block from one - device to the other. - Promotion - Migration from slow device to fast device. - Demotion - Migration from fast device to slow device. - -The origin device always contains a copy of the logical block, which -may be out of date or kept in sync with the copy on the cache device -(depending on policy). - -Design -====== - -Sub-devices ------------ - -The target is constructed by passing three devices to it (along with -other parameters detailed later): - -1. An origin device - the big, slow one. - -2. A cache device - the small, fast one. - -3. A small metadata device - records which blocks are in the cache, - which are dirty, and extra hints for use by the policy object. - This information could be put on the cache device, but having it - separate allows the volume manager to configure it differently, - e.g. as a mirror for extra robustness. This metadata device may only - be used by a single cache device. - -Fixed block size ----------------- - -The origin is divided up into blocks of a fixed size. This block size -is configurable when you first create the cache. Typically we've been -using block sizes of 256KB - 1024KB. The block size must be between 64 -sectors (32KB) and 2097152 sectors (1GB) and a multiple of 64 sectors (32KB). - -Having a fixed block size simplifies the target a lot. But it is -something of a compromise. For instance, a small part of a block may be -getting hit a lot, yet the whole block will be promoted to the cache. -So large block sizes are bad because they waste cache space. And small -block sizes are bad because they increase the amount of metadata (both -in core and on disk). - -Cache operating modes ---------------------- - -The cache has three operating modes: writeback, writethrough and -passthrough. - -If writeback, the default, is selected then a write to a block that is -cached will go only to the cache and the block will be marked dirty in -the metadata. - -If writethrough is selected then a write to a cached block will not -complete until it has hit both the origin and cache devices. Clean -blocks should remain clean. - -If passthrough is selected, useful when the cache contents are not known -to be coherent with the origin device, then all reads are served from -the origin device (all reads miss the cache) and all writes are -forwarded to the origin device; additionally, write hits cause cache -block invalidates. To enable passthrough mode the cache must be clean. -Passthrough mode allows a cache device to be activated without having to -worry about coherency. Coherency that exists is maintained, although -the cache will gradually cool as writes take place. If the coherency of -the cache can later be verified, or established through use of the -"invalidate_cblocks" message, the cache device can be transitioned to -writethrough or writeback mode while still warm. Otherwise, the cache -contents can be discarded prior to transitioning to the desired -operating mode. - -A simple cleaner policy is provided, which will clean (write back) all -dirty blocks in a cache. Useful for decommissioning a cache or when -shrinking a cache. Shrinking the cache's fast device requires all cache -blocks, in the area of the cache being removed, to be clean. If the -area being removed from the cache still contains dirty blocks the resize -will fail. Care must be taken to never reduce the volume used for the -cache's fast device until the cache is clean. This is of particular -importance if writeback mode is used. Writethrough and passthrough -modes already maintain a clean cache. Future support to partially clean -the cache, above a specified threshold, will allow for keeping the cache -warm and in writeback mode during resize. - -Migration throttling --------------------- - -Migrating data between the origin and cache device uses bandwidth. -The user can set a throttle to prevent more than a certain amount of -migration occurring at any one time. Currently we're not taking any -account of normal io traffic going to the devices. More work needs -doing here to avoid migrating during those peak io moments. - -For the time being, a message "migration_threshold <#sectors>" -can be used to set the maximum number of sectors being migrated, -the default being 2048 sectors (1MB). - -Updating on-disk metadata -------------------------- - -On-disk metadata is committed every time a FLUSH or FUA bio is written. -If no such requests are made then commits will occur every second. This -means the cache behaves like a physical disk that has a volatile write -cache. If power is lost you may lose some recent writes. The metadata -should always be consistent in spite of any crash. - -The 'dirty' state for a cache block changes far too frequently for us -to keep updating it on the fly. So we treat it as a hint. In normal -operation it will be written when the dm device is suspended. If the -system crashes all cache blocks will be assumed dirty when restarted. - -Per-block policy hints ----------------------- - -Policy plug-ins can store a chunk of data per cache block. It's up to -the policy how big this chunk is, but it should be kept small. Like the -dirty flags this data is lost if there's a crash so a safe fallback -value should always be possible. - -Policy hints affect performance, not correctness. - -Policy messaging ----------------- - -Policies will have different tunables, specific to each one, so we -need a generic way of getting and setting these. Device-mapper -messages are used. Refer to cache-policies.txt. - -Discard bitset resolution -------------------------- - -We can avoid copying data during migration if we know the block has -been discarded. A prime example of this is when mkfs discards the -whole block device. We store a bitset tracking the discard state of -blocks. However, we allow this bitset to have a different block size -from the cache blocks. This is because we need to track the discard -state for all of the origin device (compare with the dirty bitset -which is just for the smaller cache device). - -Target interface -================ - -Constructor ------------ - - cache <metadata dev> <cache dev> <origin dev> <block size> - <#feature args> [<feature arg>]* - <policy> <#policy args> [policy args]* - - metadata dev : fast device holding the persistent metadata - cache dev : fast device holding cached data blocks - origin dev : slow device holding original data blocks - block size : cache unit size in sectors - - #feature args : number of feature arguments passed - feature args : writethrough or passthrough (The default is writeback.) - - policy : the replacement policy to use - #policy args : an even number of arguments corresponding to - key/value pairs passed to the policy - policy args : key/value pairs passed to the policy - E.g. 'sequential_threshold 1024' - See cache-policies.txt for details. - -Optional feature arguments are: - writethrough : write through caching that prohibits cache block - content from being different from origin block content. - Without this argument, the default behaviour is to write - back cache block contents later for performance reasons, - so they may differ from the corresponding origin blocks. - - passthrough : a degraded mode useful for various cache coherency - situations (e.g., rolling back snapshots of - underlying storage). Reads and writes always go to - the origin. If a write goes to a cached origin - block, then the cache block is invalidated. - To enable passthrough mode the cache must be clean. - - metadata2 : use version 2 of the metadata. This stores the dirty bits - in a separate btree, which improves speed of shutting - down the cache. - - no_discard_passdown : disable passing down discards from the cache - to the origin's data device. - -A policy called 'default' is always registered. This is an alias for -the policy we currently think is giving best all round performance. - -As the default policy could vary between kernels, if you are relying on -the characteristics of a specific policy, always request it by name. - -Status ------- - -<metadata block size> <#used metadata blocks>/<#total metadata blocks> -<cache block size> <#used cache blocks>/<#total cache blocks> -<#read hits> <#read misses> <#write hits> <#write misses> -<#demotions> <#promotions> <#dirty> <#features> <features>* -<#core args> <core args>* <policy name> <#policy args> <policy args>* -<cache metadata mode> - -metadata block size : Fixed block size for each metadata block in - sectors -#used metadata blocks : Number of metadata blocks used -#total metadata blocks : Total number of metadata blocks -cache block size : Configurable block size for the cache device - in sectors -#used cache blocks : Number of blocks resident in the cache -#total cache blocks : Total number of cache blocks -#read hits : Number of times a READ bio has been mapped - to the cache -#read misses : Number of times a READ bio has been mapped - to the origin -#write hits : Number of times a WRITE bio has been mapped - to the cache -#write misses : Number of times a WRITE bio has been - mapped to the origin -#demotions : Number of times a block has been removed - from the cache -#promotions : Number of times a block has been moved to - the cache -#dirty : Number of blocks in the cache that differ - from the origin -#feature args : Number of feature args to follow -feature args : 'writethrough' (optional) -#core args : Number of core arguments (must be even) -core args : Key/value pairs for tuning the core - e.g. migration_threshold -policy name : Name of the policy -#policy args : Number of policy arguments to follow (must be even) -policy args : Key/value pairs e.g. sequential_threshold -cache metadata mode : ro if read-only, rw if read-write - In serious cases where even a read-only mode is deemed unsafe - no further I/O will be permitted and the status will just - contain the string 'Fail'. The userspace recovery tools - should then be used. -needs_check : 'needs_check' if set, '-' if not set - A metadata operation has failed, resulting in the needs_check - flag being set in the metadata's superblock. The metadata - device must be deactivated and checked/repaired before the - cache can be made fully operational again. '-' indicates - needs_check is not set. - -Messages --------- - -Policies will have different tunables, specific to each one, so we -need a generic way of getting and setting these. Device-mapper -messages are used. (A sysfs interface would also be possible.) - -The message format is: - - <key> <value> - -E.g. - dmsetup message my_cache 0 sequential_threshold 1024 - - -Invalidation is removing an entry from the cache without writing it -back. Cache blocks can be invalidated via the invalidate_cblocks -message, which takes an arbitrary number of cblock ranges. Each cblock -range's end value is "one past the end", meaning 5-10 expresses a range -of values from 5 to 9. Each cblock must be expressed as a decimal -value, in the future a variant message that takes cblock ranges -expressed in hexadecimal may be needed to better support efficient -invalidation of larger caches. The cache must be in passthrough mode -when invalidate_cblocks is used. - - invalidate_cblocks [<cblock>|<cblock begin>-<cblock end>]* - -E.g. - dmsetup message my_cache 0 invalidate_cblocks 2345 3456-4567 5678-6789 - -Examples -======== - -The test suite can be found here: - -https://github.com/jthornber/device-mapper-test-suite - -dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \ - /dev/mapper/ssd /dev/mapper/origin 512 1 writeback default 0' -dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \ - /dev/mapper/ssd /dev/mapper/origin 1024 1 writeback \ - mq 4 sequential_threshold 1024 random_threshold 8' |