summaryrefslogtreecommitdiffstats
path: root/Documentation
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@linux-foundation.org>2017-09-05 12:19:08 -0700
committerLinus Torvalds <torvalds@linux-foundation.org>2017-09-05 12:19:08 -0700
commit439644096c1a6afb9bd9953130f4444a856f76c5 (patch)
treecdf21533aa0ec95d084988f234186f8a5071d5df /Documentation
parentb42a362e6d10c342004b183defcb9940331b6737 (diff)
parentd97561f461e4cb5b2e170bc33b466cffbddc8719 (diff)
downloadlinux-stable-439644096c1a6afb9bd9953130f4444a856f76c5.tar.gz
linux-stable-439644096c1a6afb9bd9953130f4444a856f76c5.tar.bz2
linux-stable-439644096c1a6afb9bd9953130f4444a856f76c5.zip
Merge tag 'pm-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki: "This time (again) cpufreq gets the majority of changes which mostly are driver updates (including a major consolidation of intel_pstate), some schedutil governor modifications and core cleanups. There also are some changes in the system suspend area, mostly related to diagnostics and debug messages plus some renames of things related to suspend-to-idle. One major change here is that suspend-to-idle is now going to be preferred over S3 on systems where the ACPI tables indicate to do so and provide requsite support (the Low Power Idle S0 _DSM in particular). The system sleep documentation and the tools related to it are updated too. The rest is a few cpuidle changes (nothing major), devfreq updates, generic power domains (genpd) framework updates and a few assorted modifications elsewhere. Specifics: - Drop the P-state selection algorithm based on a PID controller from intel_pstate and make it use the same P-state selection method (based on the CPU load) for all types of systems in the active mode (Rafael Wysocki, Srinivas Pandruvada). - Rework the cpufreq core and governors to make it possible to take cross-CPU utilization updates into account and modify the schedutil governor to actually do so (Viresh Kumar). - Clean up the handling of transition latency information in the cpufreq core and untangle it from the information on which drivers cannot do dynamic frequency switching (Viresh Kumar). - Add support for new SoCs (MT2701/MT7623 and MT7622) to the mediatek cpufreq driver and update its DT bindings (Sean Wang). - Modify the cpufreq dt-platdev driver to autimatically create cpufreq devices for the new (v2) Operating Performance Points (OPP) DT bindings and update its whitelist of supported systems (Viresh Kumar, Shubhrajyoti Datta, Marc Gonzalez, Khiem Nguyen, Finley Xiao). - Add support for Ux500 to the cpufreq-dt driver and drop the obsolete dbx500 cpufreq driver (Linus Walleij, Arnd Bergmann). - Add new SoC (R8A7795) support to the cpufreq rcar driver (Khiem Nguyen). - Fix and clean up assorted issues in the cpufreq drivers and core (Arvind Yadav, Christophe Jaillet, Colin Ian King, Gustavo Silva, Julia Lawall, Leonard Crestez, Rob Herring, Sudeep Holla). - Update the IO-wait boost handling in the schedutil governor to make it less aggressive (Joel Fernandes). - Rework system suspend diagnostics to make it print fewer messages to the kernel log by default, add a sysfs knob to allow more suspend-related messages to be printed and add Low Power S0 Idle constraints checks to the ACPI suspend-to-idle code (Rafael Wysocki, Srinivas Pandruvada). - Prefer suspend-to-idle over S3 on ACPI-based systems with the ACPI_FADT_LOW_POWER_S0 flag set and the Low Power Idle S0 _DSM interface present in the ACPI tables (Rafael Wysocki). - Update documentation related to system sleep and rename a number of items in the code to make it cleare that they are related to suspend-to-idle (Rafael Wysocki). - Export a variable allowing device drivers to check the target system sleep state from the core system suspend code (Florian Fainelli). - Clean up the cpuidle subsystem to handle the polling state on x86 in a more straightforward way and to use %pOF instead of full_name (Rafael Wysocki, Rob Herring). - Update the devfreq framework to fix and clean up a few minor issues (Chanwoo Choi, Rob Herring). - Extend diagnostics in the generic power domains (genpd) framework and clean it up slightly (Thara Gopinath, Rob Herring). - Fix and clean up a couple of issues in the operating performance points (OPP) framework (Viresh Kumar, Waldemar Rymarkiewicz). - Add support for RV1108 to the rockchip-io Adaptive Voltage Scaling (AVS) driver (David Wu). - Fix the usage of notifiers in CPU power management on some platforms (Alex Shi). - Update the pm-graph system suspend/hibernation and boot profiling utility (Todd Brandt). - Make it possible to run the cpupower utility without CPU0 (Prarit Bhargava)" * tag 'pm-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (87 commits) cpuidle: Make drivers initialize polling state cpuidle: Move polling state initialization code to separate file cpuidle: Eliminate the CPUIDLE_DRIVER_STATE_START symbol cpufreq: imx6q: Fix imx6sx low frequency support cpufreq: speedstep-lib: make several arrays static, makes code smaller PM: docs: Delete the obsolete states.txt document PM: docs: Describe high-level PM strategies and sleep states PM / devfreq: Fix memory leak when fail to register device PM / devfreq: Add dependency on PM_OPP PM / devfreq: Move private devfreq_update_stats() into devfreq PM / devfreq: Convert to using %pOF instead of full_name PM / AVS: rockchip-io: add io selectors and supplies for RV1108 cpufreq: ti: Fix 'of_node_put' being called twice in error handling path cpufreq: dt-platdev: Drop few entries from whitelist cpufreq: dt-platdev: Automatically create cpufreq device with OPP v2 ARM: ux500: don't select CPUFREQ_DT cpuidle: Convert to using %pOF instead of full_name cpufreq: Convert to using %pOF instead of full_name PM / Domains: Convert to using %pOF instead of full_name cpufreq: Cap the default transition delay value to 10 ms ...
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/ABI/testing/sysfs-power12
-rw-r--r--Documentation/admin-guide/pm/cpufreq.rst8
-rw-r--r--Documentation/admin-guide/pm/index.rst12
-rw-r--r--Documentation/admin-guide/pm/intel_pstate.rst61
-rw-r--r--Documentation/admin-guide/pm/sleep-states.rst245
-rw-r--r--Documentation/admin-guide/pm/strategies.rst52
-rw-r--r--Documentation/admin-guide/pm/system-wide.rst8
-rw-r--r--Documentation/admin-guide/pm/working-state.rst9
-rw-r--r--Documentation/devicetree/bindings/clock/mt8173-cpu-dvfs.txt83
-rw-r--r--Documentation/devicetree/bindings/cpufreq/cpufreq-mediatek.txt247
-rw-r--r--Documentation/devicetree/bindings/power/rockchip-io-domain.txt2
-rw-r--r--Documentation/power/states.txt125
12 files changed, 586 insertions, 278 deletions
diff --git a/Documentation/ABI/testing/sysfs-power b/Documentation/ABI/testing/sysfs-power
index f523e5a3ac33..713cab1d5f12 100644
--- a/Documentation/ABI/testing/sysfs-power
+++ b/Documentation/ABI/testing/sysfs-power
@@ -273,3 +273,15 @@ Description:
This output is useful for system wakeup diagnostics of spurious
wakeup interrupts.
+
+What: /sys/power/pm_debug_messages
+Date: July 2017
+Contact: Rafael J. Wysocki <rjw@rjwysocki.net>
+Description:
+ The /sys/power/pm_debug_messages file controls the printing
+ of debug messages from the system suspend/hiberbation
+ infrastructure to the kernel log.
+
+ Writing a "1" to this file enables the debug messages and
+ writing a "0" (default) to it disables them. Reads from
+ this file return the current value.
diff --git a/Documentation/admin-guide/pm/cpufreq.rst b/Documentation/admin-guide/pm/cpufreq.rst
index 7af83a92d2d6..47153e64dfb5 100644
--- a/Documentation/admin-guide/pm/cpufreq.rst
+++ b/Documentation/admin-guide/pm/cpufreq.rst
@@ -479,14 +479,6 @@ This governor exposes the following tunables:
# echo `$(($(cat cpuinfo_transition_latency) * 750 / 1000)) > ondemand/sampling_rate
-
-``min_sampling_rate``
- The minimum value of ``sampling_rate``.
-
- Equal to 10000 (10 ms) if :c:macro:`CONFIG_NO_HZ_COMMON` and
- :c:data:`tick_nohz_active` are both set or to 20 times the value of
- :c:data:`jiffies` in microseconds otherwise.
-
``up_threshold``
If the estimated CPU load is above this value (in percent), the governor
will set the frequency to the maximum value allowed for the policy.
diff --git a/Documentation/admin-guide/pm/index.rst b/Documentation/admin-guide/pm/index.rst
index 7f148f76f432..49237ac73442 100644
--- a/Documentation/admin-guide/pm/index.rst
+++ b/Documentation/admin-guide/pm/index.rst
@@ -5,12 +5,6 @@ Power Management
.. toctree::
:maxdepth: 2
- cpufreq
- intel_pstate
-
-.. only:: subproject and html
-
- Indices
- =======
-
- * :ref:`genindex`
+ strategies
+ system-wide
+ working-state
diff --git a/Documentation/admin-guide/pm/intel_pstate.rst b/Documentation/admin-guide/pm/intel_pstate.rst
index 1d6249825efc..d2b6fda3d67b 100644
--- a/Documentation/admin-guide/pm/intel_pstate.rst
+++ b/Documentation/admin-guide/pm/intel_pstate.rst
@@ -167,35 +167,17 @@ is set.
``powersave``
.............
-Without HWP, this P-state selection algorithm generally depends on the
-processor model and/or the system profile setting in the ACPI tables and there
-are two variants of it.
-
-One of them is used with processors from the Atom line and (regardless of the
-processor model) on platforms with the system profile in the ACPI tables set to
-"mobile" (laptops mostly), "tablet", "appliance PC", "desktop", or
-"workstation". It is also used with processors supporting the HWP feature if
-that feature has not been enabled (that is, with the ``intel_pstate=no_hwp``
-argument in the kernel command line). It is similar to the algorithm
+Without HWP, this P-state selection algorithm is similar to the algorithm
implemented by the generic ``schedutil`` scaling governor except that the
utilization metric used by it is based on numbers coming from feedback
registers of the CPU. It generally selects P-states proportional to the
-current CPU utilization, so it is referred to as the "proportional" algorithm.
-
-The second variant of the ``powersave`` P-state selection algorithm, used in all
-of the other cases (generally, on processors from the Core line, so it is
-referred to as the "Core" algorithm), is based on the values read from the APERF
-and MPERF feedback registers and the previously requested target P-state.
-It does not really take CPU utilization into account explicitly, but as a rule
-it causes the CPU P-state to ramp up very quickly in response to increased
-utilization which is generally desirable in server environments.
-
-Regardless of the variant, this algorithm is run by the driver's utilization
-update callback for the given CPU when it is invoked by the CPU scheduler, but
-not more often than every 10 ms (that can be tweaked via ``debugfs`` in `this
-particular case <Tuning Interface in debugfs_>`_). Like in the ``performance``
-case, the hardware configuration is not touched if the new P-state turns out to
-be the same as the current one.
+current CPU utilization.
+
+This algorithm is run by the driver's utilization update callback for the
+given CPU when it is invoked by the CPU scheduler, but not more often than
+every 10 ms. Like in the ``performance`` case, the hardware configuration
+is not touched if the new P-state turns out to be the same as the current
+one.
This is the default P-state selection algorithm if the
:c:macro:`CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE` kernel configuration option
@@ -720,34 +702,7 @@ P-state is called, the ``ftrace`` filter can be set to to
gnome-shell-3409 [001] ..s. 2537.650850: intel_pstate_set_pstate <-intel_pstate_timer_func
<idle>-0 [000] ..s. 2537.654843: intel_pstate_set_pstate <-intel_pstate_timer_func
-Tuning Interface in ``debugfs``
--------------------------------
-
-The ``powersave`` algorithm provided by ``intel_pstate`` for `the Core line of
-processors in the active mode <powersave_>`_ is based on a `PID controller`_
-whose parameters were chosen to address a number of different use cases at the
-same time. However, it still is possible to fine-tune it to a specific workload
-and the ``debugfs`` interface under ``/sys/kernel/debug/pstate_snb/`` is
-provided for this purpose. [Note that the ``pstate_snb`` directory will be
-present only if the specific P-state selection algorithm matching the interface
-in it actually is in use.]
-
-The following files present in that directory can be used to modify the PID
-controller parameters at run time:
-
-| ``deadband``
-| ``d_gain_pct``
-| ``i_gain_pct``
-| ``p_gain_pct``
-| ``sample_rate_ms``
-| ``setpoint``
-
-Note, however, that achieving desirable results this way generally requires
-expert-level understanding of the power vs performance tradeoff, so extra care
-is recommended when attempting to do that.
-
.. _LCEU2015: http://events.linuxfoundation.org/sites/events/files/slides/LinuxConEurope_2015.pdf
.. _SDM: http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-system-programming-manual-325384.html
.. _ACPI specification: http://www.uefi.org/sites/default/files/resources/ACPI_6_1.pdf
-.. _PID controller: https://en.wikipedia.org/wiki/PID_controller
diff --git a/Documentation/admin-guide/pm/sleep-states.rst b/Documentation/admin-guide/pm/sleep-states.rst
new file mode 100644
index 000000000000..1e5c0f00cb2f
--- /dev/null
+++ b/Documentation/admin-guide/pm/sleep-states.rst
@@ -0,0 +1,245 @@
+===================
+System Sleep States
+===================
+
+::
+
+ Copyright (c) 2017 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com>
+
+Sleep states are global low-power states of the entire system in which user
+space code cannot be executed and the overall system activity is significantly
+reduced.
+
+
+Sleep States That Can Be Supported
+==================================
+
+Depending on its configuration and the capabilities of the platform it runs on,
+the Linux kernel can support up to four system sleep states, includig
+hibernation and up to three variants of system suspend. The sleep states that
+can be supported by the kernel are listed below.
+
+.. _s2idle:
+
+Suspend-to-Idle
+---------------
+
+This is a generic, pure software, light-weight variant of system suspend (also
+referred to as S2I or S2Idle). It allows more energy to be saved relative to
+runtime idle by freezing user space, suspending the timekeeping and putting all
+I/O devices into low-power states (possibly lower-power than available in the
+working state), such that the processors can spend time in their deepest idle
+states while the system is suspended.
+
+The system is woken up from this state by in-band interrupts, so theoretically
+any devices that can cause interrupts to be generated in the working state can
+also be set up as wakeup devices for S2Idle.
+
+This state can be used on platforms without support for :ref:`standby <standby>`
+or :ref:`suspend-to-RAM <s2ram>`, or it can be used in addition to any of the
+deeper system suspend variants to provide reduced resume latency. It is always
+supported if the :c:macro:`CONFIG_SUSPEND` kernel configuration option is set.
+
+.. _standby:
+
+Standby
+-------
+
+This state, if supported, offers moderate, but real, energy savings, while
+providing a relatively straightforward transition back to the working state. No
+operating state is lost (the system core logic retains power), so the system can
+go back to where it left off easily enough.
+
+In addition to freezing user space, suspending the timekeeping and putting all
+I/O devices into low-power states, which is done for :ref:`suspend-to-idle
+<s2idle>` too, nonboot CPUs are taken offline and all low-level system functions
+are suspended during transitions into this state. For this reason, it should
+allow more energy to be saved relative to :ref:`suspend-to-idle <s2idle>`, but
+the resume latency will generally be greater than for that state.
+
+The set of devices that can wake up the system from this state usually is
+reduced relative to :ref:`suspend-to-idle <s2idle>` and it may be necessary to
+rely on the platform for setting up the wakeup functionality as appropriate.
+
+This state is supported if the :c:macro:`CONFIG_SUSPEND` kernel configuration
+option is set and the support for it is registered by the platform with the
+core system suspend subsystem. On ACPI-based systems this state is mapped to
+the S1 system state defined by ACPI.
+
+.. _s2ram:
+
+Suspend-to-RAM
+--------------
+
+This state (also referred to as STR or S2RAM), if supported, offers significant
+energy savings as everything in the system is put into a low-power state, except
+for memory, which should be placed into the self-refresh mode to retain its
+contents. All of the steps carried out when entering :ref:`standby <standby>`
+are also carried out during transitions to S2RAM. Additional operations may
+take place depending on the platform capabilities. In particular, on ACPI-based
+systems the kernel passes control to the platform firmware (BIOS) as the last
+step during S2RAM transitions and that usually results in powering down some
+more low-level components that are not directly controlled by the kernel.
+
+The state of devices and CPUs is saved and held in memory. All devices are
+suspended and put into low-power states. In many cases, all peripheral buses
+lose power when entering S2RAM, so devices must be able to handle the transition
+back to the "on" state.
+
+On ACPI-based systems S2RAM requires some minimal boot-strapping code in the
+platform firmware to resume the system from it. This may be the case on other
+platforms too.
+
+The set of devices that can wake up the system from S2RAM usually is reduced
+relative to :ref:`suspend-to-idle <s2idle>` and :ref:`standby <standby>` and it
+may be necessary to rely on the platform for setting up the wakeup functionality
+as appropriate.
+
+S2RAM is supported if the :c:macro:`CONFIG_SUSPEND` kernel configuration option
+is set and the support for it is registered by the platform with the core system
+suspend subsystem. On ACPI-based systems it is mapped to the S3 system state
+defined by ACPI.
+
+.. _hibernation:
+
+Hibernation
+-----------
+
+This state (also referred to as Suspend-to-Disk or STD) offers the greatest
+energy savings and can be used even in the absence of low-level platform support
+for system suspend. However, it requires some low-level code for resuming the
+system to be present for the underlying CPU architecture.
+
+Hibernation is significantly different from any of the system suspend variants.
+It takes three system state changes to put it into hibernation and two system
+state changes to resume it.
+
+First, when hibernation is triggered, the kernel stops all system activity and
+creates a snapshot image of memory to be written into persistent storage. Next,
+the system goes into a state in which the snapshot image can be saved, the image
+is written out and finally the system goes into the target low-power state in
+which power is cut from almost all of its hardware components, including memory,
+except for a limited set of wakeup devices.
+
+Once the snapshot image has been written out, the system may either enter a
+special low-power state (like ACPI S4), or it may simply power down itself.
+Powering down means minimum power draw and it allows this mechanism to work on
+any system. However, entering a special low-power state may allow additional
+means of system wakeup to be used (e.g. pressing a key on the keyboard or
+opening a laptop lid).
+
+After wakeup, control goes to the platform firmware that runs a boot loader
+which boots a fresh instance of the kernel (control may also go directly to
+the boot loader, depending on the system configuration, but anyway it causes
+a fresh instance of the kernel to be booted). That new instance of the kernel
+(referred to as the ``restore kernel``) looks for a hibernation image in
+persistent storage and if one is found, it is loaded into memory. Next, all
+activity in the system is stopped and the restore kernel overwrites itself with
+the image contents and jumps into a special trampoline area in the original
+kernel stored in the image (referred to as the ``image kernel``), which is where
+the special architecture-specific low-level code is needed. Finally, the
+image kernel restores the system to the pre-hibernation state and allows user
+space to run again.
+
+Hibernation is supported if the :c:macro:`CONFIG_HIBERNATION` kernel
+configuration option is set. However, this option can only be set if support
+for the given CPU architecture includes the low-level code for system resume.
+
+
+Basic ``sysfs`` Interfaces for System Suspend and Hibernation
+=============================================================
+
+The following files located in the :file:`/sys/power/` directory can be used by
+user space for sleep states control.
+
+``state``
+ This file contains a list of strings representing sleep states supported
+ by the kernel. Writing one of these strings into it causes the kernel
+ to start a transition of the system into the sleep state represented by
+ that string.
+
+ In particular, the strings "disk", "freeze" and "standby" represent the
+ :ref:`hibernation <hibernation>`, :ref:`suspend-to-idle <s2idle>` and
+ :ref:`standby <standby>` sleep states, respectively. The string "mem"
+ is interpreted in accordance with the contents of the ``mem_sleep`` file
+ described below.
+
+ If the kernel does not support any system sleep states, this file is
+ not present.
+
+``mem_sleep``
+ This file contains a list of strings representing supported system
+ suspend variants and allows user space to select the variant to be
+ associated with the "mem" string in the ``state`` file described above.
+
+ The strings that may be present in this file are "s2idle", "shallow"
+ and "deep". The string "s2idle" always represents :ref:`suspend-to-idle
+ <s2idle>` and, by convention, "shallow" and "deep" represent
+ :ref:`standby <standby>` and :ref:`suspend-to-RAM <s2ram>`,
+ respectively.
+
+ Writing one of the listed strings into this file causes the system
+ suspend variant represented by it to be associated with the "mem" string
+ in the ``state`` file. The string representing the suspend variant
+ currently associated with the "mem" string in the ``state`` file
+ is listed in square brackets.
+
+ If the kernel does not support system suspend, this file is not present.
+
+``disk``
+ This file contains a list of strings representing different operations
+ that can be carried out after the hibernation image has been saved. The
+ possible options are as follows:
+
+ ``platform``
+ Put the system into a special low-power state (e.g. ACPI S4) to
+ make additional wakeup options available and possibly allow the
+ platform firmware to take a simplified initialization path after
+ wakeup.
+
+ ``shutdown``
+ Power off the system.
+
+ ``reboot``
+ Reboot the system (useful for diagnostics mostly).
+
+ ``suspend``
+ Hybrid system suspend. Put the system into the suspend sleep
+ state selected through the ``mem_sleep`` file described above.
+ If the system is successfully woken up from that state, discard
+ the hibernation image and continue. Otherwise, use the image
+ to restore the previous state of the system.
+
+ ``test_resume``
+ Diagnostic operation. Load the image as though the system had
+ just woken up from hibernation and the currently running kernel
+ instance was a restore kernel and follow up with full system
+ resume.
+
+ Writing one of the listed strings into this file causes the option
+ represented by it to be selected.
+
+ The currently selected option is shown in square brackets which means
+ that the operation represented by it will be carried out after creating
+ and saving the image next time hibernation is triggered by writing
+ ``disk`` to :file:`/sys/power/state`.
+
+ If the kernel does not support hibernation, this file is not present.
+
+According to the above, there are two ways to make the system go into the
+:ref:`suspend-to-idle <s2idle>` state. The first one is to write "freeze"
+directly to :file:`/sys/power/state`. The second one is to write "s2idle" to
+:file:`/sys/power/mem_sleep` and then to write "mem" to
+:file:`/sys/power/state`. Likewise, there are two ways to make the system go
+into the :ref:`standby <standby>` state (the strings to write to the control
+files in that case are "standby" or "shallow" and "mem", respectively) if that
+state is supported by the platform. However, there is only one way to make the
+system go into the :ref:`suspend-to-RAM <s2ram>` state (write "deep" into
+:file:`/sys/power/mem_sleep` and "mem" into :file:`/sys/power/state`).
+
+The default suspend variant (ie. the one to be used without writing anything
+into :file:`/sys/power/mem_sleep`) is either "deep" (on the majority of systems
+supporting :ref:`suspend-to-RAM <s2ram>`) or "s2idle", but it can be overridden
+by the value of the "mem_sleep_default" parameter in the kernel command line.
+On some ACPI-based systems, depending on the information in the ACPI tables, the
+default may be "s2idle" even if :ref:`suspend-to-RAM <s2ram>` is supported.
diff --git a/Documentation/admin-guide/pm/strategies.rst b/Documentation/admin-guide/pm/strategies.rst
new file mode 100644
index 000000000000..afe4d3f831fe
--- /dev/null
+++ b/Documentation/admin-guide/pm/strategies.rst
@@ -0,0 +1,52 @@
+===========================
+Power Management Strategies
+===========================
+
+::
+
+ Copyright (c) 2017 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com>
+
+The Linux kernel supports two major high-level power management strategies.
+
+One of them is based on using global low-power states of the whole system in
+which user space code cannot be executed and the overall system activity is
+significantly reduced, referred to as :doc:`sleep states <sleep-states>`. The
+kernel puts the system into one of these states when requested by user space
+and the system stays in it until a special signal is received from one of
+designated devices, triggering a transition to the ``working state`` in which
+user space code can run. Because sleep states are global and the whole system
+is affected by the state changes, this strategy is referred to as the
+:doc:`system-wide power management <system-wide>`.
+
+The other strategy, referred to as the :doc:`working-state power management
+<working-state>`, is based on adjusting the power states of individual hardware
+components of the system, as needed, in the working state. In consequence, if
+this strategy is in use, the working state of the system usually does not
+correspond to any particular physical configuration of it, but can be treated as
+a metastate covering a range of different power states of the system in which
+the individual components of it can be either ``active`` (in use) or
+``inactive`` (idle). If they are active, they have to be in power states
+allowing them to process data and to be accessed by software. In turn, if they
+are inactive, ideally, they should be in low-power states in which they may not
+be accessible.
+
+If all of the system components are active, the system as a whole is regarded as
+"runtime active" and that situation typically corresponds to the maximum power
+draw (or maximum energy usage) of it. If all of them are inactive, the system
+as a whole is regarded as "runtime idle" which may be very close to a sleep
+state from the physical system configuration and power draw perspective, but
+then it takes much less time and effort to start executing user space code than
+for the same system in a sleep state. However, transitions from sleep states
+back to the working state can only be started by a limited set of devices, so
+typically the system can spend much more time in a sleep state than it can be
+runtime idle in one go. For this reason, systems usually use less energy in
+sleep states than when they are runtime idle most of the time.
+
+Moreover, the two power management strategies address different usage scenarios.
+Namely, if the user indicates that the system will not be in use going forward,
+for example by closing its lid (if the system is a laptop), it probably should
+go into a sleep state at that point. On the other hand, if the user simply goes
+away from the laptop keyboard, it probably should stay in the working state and
+use the working-state power management in case it becomes idle, because the user
+may come back to it at any time and then may want the system to be immediately
+accessible.
diff --git a/Documentation/admin-guide/pm/system-wide.rst b/Documentation/admin-guide/pm/system-wide.rst
new file mode 100644
index 000000000000..0c81e4c5de39
--- /dev/null
+++ b/Documentation/admin-guide/pm/system-wide.rst
@@ -0,0 +1,8 @@
+============================
+System-Wide Power Management
+============================
+
+.. toctree::
+ :maxdepth: 2
+
+ sleep-states
diff --git a/Documentation/admin-guide/pm/working-state.rst b/Documentation/admin-guide/pm/working-state.rst
new file mode 100644
index 000000000000..fa01bf083dfe
--- /dev/null
+++ b/Documentation/admin-guide/pm/working-state.rst
@@ -0,0 +1,9 @@
+==============================
+Working-State Power Management
+==============================
+
+.. toctree::
+ :maxdepth: 2
+
+ cpufreq
+ intel_pstate
diff --git a/Documentation/devicetree/bindings/clock/mt8173-cpu-dvfs.txt b/Documentation/devicetree/bindings/clock/mt8173-cpu-dvfs.txt
deleted file mode 100644
index 52b457c23eed..000000000000
--- a/Documentation/devicetree/bindings/clock/mt8173-cpu-dvfs.txt
+++ /dev/null
@@ -1,83 +0,0 @@
-Device Tree Clock bindins for CPU DVFS of Mediatek MT8173 SoC
-
-Required properties:
-- clocks: A list of phandle + clock-specifier pairs for the clocks listed in clock names.
-- clock-names: Should contain the following:
- "cpu" - The multiplexer for clock input of CPU cluster.
- "intermediate" - A parent of "cpu" clock which is used as "intermediate" clock
- source (usually MAINPLL) when the original CPU PLL is under
- transition and not stable yet.
- Please refer to Documentation/devicetree/bindings/clk/clock-bindings.txt for
- generic clock consumer properties.
-- proc-supply: Regulator for Vproc of CPU cluster.
-
-Optional properties:
-- sram-supply: Regulator for Vsram of CPU cluster. When present, the cpufreq driver
- needs to do "voltage tracking" to step by step scale up/down Vproc and
- Vsram to fit SoC specific needs. When absent, the voltage scaling
- flow is handled by hardware, hence no software "voltage tracking" is
- needed.
-
-Example:
---------
- cpu0: cpu@0 {
- device_type = "cpu";
- compatible = "arm,cortex-a53";
- reg = <0x000>;
- enable-method = "psci";
- cpu-idle-states = <&CPU_SLEEP_0>;
- clocks = <&infracfg CLK_INFRA_CA53SEL>,
- <&apmixedsys CLK_APMIXED_MAINPLL>;
- clock-names = "cpu", "intermediate";
- };
-
- cpu1: cpu@1 {
- device_type = "cpu";
- compatible = "arm,cortex-a53";
- reg = <0x001>;
- enable-method = "psci";
- cpu-idle-states = <&CPU_SLEEP_0>;
- clocks = <&infracfg CLK_INFRA_CA53SEL>,
- <&apmixedsys CLK_APMIXED_MAINPLL>;
- clock-names = "cpu", "intermediate";
- };
-
- cpu2: cpu@100 {
- device_type = "cpu";
- compatible = "arm,cortex-a57";
- reg = <0x100>;
- enable-method = "psci";
- cpu-idle-states = <&CPU_SLEEP_0>;
- clocks = <&infracfg CLK_INFRA_CA57SEL>,
- <&apmixedsys CLK_APMIXED_MAINPLL>;
- clock-names = "cpu", "intermediate";
- };
-
- cpu3: cpu@101 {
- device_type = "cpu";
- compatible = "arm,cortex-a57";
- reg = <0x101>;
- enable-method = "psci";
- cpu-idle-states = <&CPU_SLEEP_0>;
- clocks = <&infracfg CLK_INFRA_CA57SEL>,
- <&apmixedsys CLK_APMIXED_MAINPLL>;
- clock-names = "cpu", "intermediate";
- };
-
- &cpu0 {
- proc-supply = <&mt6397_vpca15_reg>;
- };
-
- &cpu1 {
- proc-supply = <&mt6397_vpca15_reg>;
- };
-
- &cpu2 {
- proc-supply = <&da9211_vcpu_reg>;
- sram-supply = <&mt6397_vsramca7_reg>;
- };
-
- &cpu3 {
- proc-supply = <&da9211_vcpu_reg>;
- sram-supply = <&mt6397_vsramca7_reg>;
- };
diff --git a/Documentation/devicetree/bindings/cpufreq/cpufreq-mediatek.txt b/Documentation/devicetree/bindings/cpufreq/cpufreq-mediatek.txt
new file mode 100644
index 000000000000..f6403089edcf
--- /dev/null
+++ b/Documentation/devicetree/bindings/cpufreq/cpufreq-mediatek.txt
@@ -0,0 +1,247 @@
+Binding for MediaTek's CPUFreq driver
+=====================================
+
+Required properties:
+- clocks: A list of phandle + clock-specifier pairs for the clocks listed in clock names.
+- clock-names: Should contain the following:
+ "cpu" - The multiplexer for clock input of CPU cluster.
+ "intermediate" - A parent of "cpu" clock which is used as "intermediate" clock
+ source (usually MAINPLL) when the original CPU PLL is under
+ transition and not stable yet.
+ Please refer to Documentation/devicetree/bindings/clk/clock-bindings.txt for
+ generic clock consumer properties.
+- operating-points-v2: Please refer to Documentation/devicetree/bindings/opp/opp.txt
+ for detail.
+- proc-supply: Regulator for Vproc of CPU cluster.
+
+Optional properties:
+- sram-supply: Regulator for Vsram of CPU cluster. When present, the cpufreq driver
+ needs to do "voltage tracking" to step by step scale up/down Vproc and
+ Vsram to fit SoC specific needs. When absent, the voltage scaling
+ flow is handled by hardware, hence no software "voltage tracking" is
+ needed.
+- #cooling-cells:
+- cooling-min-level:
+- cooling-max-level:
+ Please refer to Documentation/devicetree/bindings/thermal/thermal.txt
+ for detail.
+
+Example 1 (MT7623 SoC):
+
+ cpu_opp_table: opp_table {
+ compatible = "operating-points-v2";
+ opp-shared;
+
+ opp-598000000 {
+ opp-hz = /bits/ 64 <598000000>;
+ opp-microvolt = <1050000>;
+ };
+
+ opp-747500000 {
+ opp-hz = /bits/ 64 <747500000>;
+ opp-microvolt = <1050000>;
+ };
+
+ opp-1040000000 {
+ opp-hz = /bits/ 64 <1040000000>;
+ opp-microvolt = <1150000>;
+ };
+
+ opp-1196000000 {
+ opp-hz = /bits/ 64 <1196000000>;
+ opp-microvolt = <1200000>;
+ };
+
+ opp-1300000000 {
+ opp-hz = /bits/ 64 <1300000000>;
+ opp-microvolt = <1300000>;
+ };
+ };
+
+ cpu0: cpu@0 {
+ device_type = "cpu";
+ compatible = "arm,cortex-a7";
+ reg = <0x0>;
+ clocks = <&infracfg CLK_INFRA_CPUSEL>,
+ <&apmixedsys CLK_APMIXED_MAINPLL>;
+ clock-names = "cpu", "intermediate";
+ operating-points-v2 = <&cpu_opp_table>;
+ #cooling-cells = <2>;
+ cooling-min-level = <0>;
+ cooling-max-level = <7>;
+ };
+ cpu@1 {
+ device_type = "cpu";
+ compatible = "arm,cortex-a7";
+ reg = <0x1>;
+ operating-points-v2 = <&cpu_opp_table>;
+ };
+ cpu@2 {
+ device_type = "cpu";
+ compatible = "arm,cortex-a7";
+ reg = <0x2>;
+ operating-points-v2 = <&cpu_opp_table>;
+ };
+ cpu@3 {
+ device_type = "cpu";
+ compatible = "arm,cortex-a7";
+ reg = <0x3>;
+ operating-points-v2 = <&cpu_opp_table>;
+ };
+
+Example 2 (MT8173 SoC):
+ cpu_opp_table_a: opp_table_a {
+ compatible = "operating-points-v2";
+ opp-shared;
+
+ opp-507000000 {
+ opp-hz = /bits/ 64 <507000000>;
+ opp-microvolt = <859000>;
+ };
+
+ opp-702000000 {
+ opp-hz = /bits/ 64 <702000000>;
+ opp-microvolt = <908000>;
+ };
+
+ opp-1001000000 {
+ opp-hz = /bits/ 64 <1001000000>;
+ opp-microvolt = <983000>;
+ };
+
+ opp-1105000000 {
+ opp-hz = /bits/ 64 <1105000000>;
+ opp-microvolt = <1009000>;
+ };
+
+ opp-1183000000 {
+ opp-hz = /bits/ 64 <1183000000>;
+ opp-microvolt = <1028000>;
+ };
+
+ opp-1404000000 {
+ opp-hz = /bits/ 64 <1404000000>;
+ opp-microvolt = <1083000>;
+ };
+
+ opp-1508000000 {
+ opp-hz = /bits/ 64 <1508000000>;
+ opp-microvolt = <1109000>;
+ };
+
+ opp-1573000000 {
+ opp-hz = /bits/ 64 <1573000000>;
+ opp-microvolt = <1125000>;
+ };
+ };
+
+ cpu_opp_table_b: opp_table_b {
+ compatible = "operating-points-v2";
+ opp-shared;
+
+ opp-507000000 {
+ opp-hz = /bits/ 64 <507000000>;
+ opp-microvolt = <828000>;
+ };
+
+ opp-702000000 {
+ opp-hz = /bits/ 64 <702000000>;
+ opp-microvolt = <867000>;
+ };
+
+ opp-1001000000 {
+ opp-hz = /bits/ 64 <1001000000>;
+ opp-microvolt = <927000>;
+ };
+
+ opp-1209000000 {
+ opp-hz = /bits/ 64 <1209000000>;
+ opp-microvolt = <968000>;
+ };
+
+ opp-1404000000 {
+ opp-hz = /bits/ 64 <1007000000>;
+ opp-microvolt = <1028000>;
+ };
+
+ opp-1612000000 {
+ opp-hz = /bits/ 64 <1612000000>;
+ opp-microvolt = <1049000>;
+ };
+
+ opp-1807000000 {
+ opp-hz = /bits/ 64 <1807000000>;
+ opp-microvolt = <1089000>;
+ };
+
+ opp-1989000000 {
+ opp-hz = /bits/ 64 <1989000000>;
+ opp-microvolt = <1125000>;
+ };
+ };
+
+ cpu0: cpu@0 {
+ device_type = "cpu";
+ compatible = "arm,cortex-a53";
+ reg = <0x000>;
+ enable-method = "psci";
+ cpu-idle-states = <&CPU_SLEEP_0>;
+ clocks = <&infracfg CLK_INFRA_CA53SEL>,
+ <&apmixedsys CLK_APMIXED_MAINPLL>;
+ clock-names = "cpu", "intermediate";
+ operating-points-v2 = <&cpu_opp_table_a>;
+ };
+
+ cpu1: cpu@1 {
+ device_type = "cpu";
+ compatible = "arm,cortex-a53";
+ reg = <0x001>;
+ enable-method = "psci";
+ cpu-idle-states = <&CPU_SLEEP_0>;
+ clocks = <&infracfg CLK_INFRA_CA53SEL>,
+ <&apmixedsys CLK_APMIXED_MAINPLL>;
+ clock-names = "cpu", "intermediate";
+ operating-points-v2 = <&cpu_opp_table_a>;
+ };
+
+ cpu2: cpu@100 {
+ device_type = "cpu";
+ compatible = "arm,cortex-a57";
+ reg = <0x100>;
+ enable-method = "psci";
+ cpu-idle-states = <&CPU_SLEEP_0>;
+ clocks = <&infracfg CLK_INFRA_CA57SEL>,
+ <&apmixedsys CLK_APMIXED_MAINPLL>;
+ clock-names = "cpu", "intermediate";
+ operating-points-v2 = <&cpu_opp_table_b>;
+ };
+
+ cpu3: cpu@101 {
+ device_type = "cpu";
+ compatible = "arm,cortex-a57";
+ reg = <0x101>;
+ enable-method = "psci";
+ cpu-idle-states = <&CPU_SLEEP_0>;
+ clocks = <&infracfg CLK_INFRA_CA57SEL>,
+ <&apmixedsys CLK_APMIXED_MAINPLL>;
+ clock-names = "cpu", "intermediate";
+ operating-points-v2 = <&cpu_opp_table_b>;
+ };
+
+ &cpu0 {
+ proc-supply = <&mt6397_vpca15_reg>;
+ };
+
+ &cpu1 {
+ proc-supply = <&mt6397_vpca15_reg>;
+ };
+
+ &cpu2 {
+ proc-supply = <&da9211_vcpu_reg>;
+ sram-supply = <&mt6397_vsramca7_reg>;
+ };
+
+ &cpu3 {
+ proc-supply = <&da9211_vcpu_reg>;
+ sram-supply = <&mt6397_vsramca7_reg>;
+ };
diff --git a/Documentation/devicetree/bindings/power/rockchip-io-domain.txt b/Documentation/devicetree/bindings/power/rockchip-io-domain.txt
index 43c21fb04564..4a4766e9c254 100644
--- a/Documentation/devicetree/bindings/power/rockchip-io-domain.txt
+++ b/Documentation/devicetree/bindings/power/rockchip-io-domain.txt
@@ -39,6 +39,8 @@ Required properties:
- "rockchip,rk3368-pmu-io-voltage-domain" for rk3368 pmu-domains
- "rockchip,rk3399-io-voltage-domain" for rk3399
- "rockchip,rk3399-pmu-io-voltage-domain" for rk3399 pmu-domains
+ - "rockchip,rv1108-io-voltage-domain" for rv1108
+ - "rockchip,rv1108-pmu-io-voltage-domain" for rv1108 pmu-domains
Deprecated properties:
- rockchip,grf: phandle to the syscon managing the "general register files"
diff --git a/Documentation/power/states.txt b/Documentation/power/states.txt
deleted file mode 100644
index bc4548245a24..000000000000
--- a/Documentation/power/states.txt
+++ /dev/null
@@ -1,125 +0,0 @@
-System Power Management Sleep States
-
-(C) 2014 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com>
-
-The kernel supports up to four system sleep states generically, although three
-of them depend on the platform support code to implement the low-level details
-for each state.
-
-The states are represented by strings that can be read or written to the
-/sys/power/state file. Those strings may be "mem", "standby", "freeze" and
-"disk", where the last three always represent Power-On Suspend (if supported),
-Suspend-To-Idle and hibernation (Suspend-To-Disk), respectively.
-
-The meaning of the "mem" string is controlled by the /sys/power/mem_sleep file.
-It contains strings representing the available modes of system suspend that may
-be triggered by writing "mem" to /sys/power/state. These modes are "s2idle"
-(Suspend-To-Idle), "shallow" (Power-On Suspend) and "deep" (Suspend-To-RAM).
-The "s2idle" mode is always available, while the other ones are only available
-if supported by the platform (if not supported, the strings representing them
-are not present in /sys/power/mem_sleep). The string representing the suspend
-mode to be used subsequently is enclosed in square brackets. Writing one of
-the other strings present in /sys/power/mem_sleep to it causes the suspend mode
-to be used subsequently to change to the one represented by that string.
-
-Consequently, there are two ways to cause the system to go into the
-Suspend-To-Idle sleep state. The first one is to write "freeze" directly to
-/sys/power/state. The second one is to write "s2idle" to /sys/power/mem_sleep
-and then to write "mem" to /sys/power/state. Similarly, there are two ways
-to cause the system to go into the Power-On Suspend sleep state (the strings to
-write to the control files in that case are "standby" or "shallow" and "mem",
-respectively) if that state is supported by the platform. In turn, there is
-only one way to cause the system to go into the Suspend-To-RAM state (write
-"deep" into /sys/power/mem_sleep and "mem" into /sys/power/state).
-
-The default suspend mode (ie. the one to be used without writing anything into
-/sys/power/mem_sleep) is either "deep" (if Suspend-To-RAM is supported) or
-"s2idle", but it can be overridden by the value of the "mem_sleep_default"
-parameter in the kernel command line.
-
-The properties of all of the sleep states are described below.
-
-
-State: Suspend-To-Idle
-ACPI state: S0
-Label: "s2idle" ("freeze")
-
-This state is a generic, pure software, light-weight, system sleep state.
-It allows more energy to be saved relative to runtime idle by freezing user
-space and putting all I/O devices into low-power states (possibly
-lower-power than available at run time), such that the processors can
-spend more time in their idle states.
-
-This state can be used for platforms without Power-On Suspend/Suspend-to-RAM
-support, or it can be used in addition to Suspend-to-RAM to provide reduced
-resume latency. It is always supported.
-
-
-State: Standby / Power-On Suspend
-ACPI State: S1
-Label: "shallow" ("standby")
-
-This state, if supported, offers moderate, though real, power savings, while
-providing a relatively low-latency transition back to a working system. No
-operating state is lost (the CPU retains power), so the system easily starts up
-again where it left off.
-
-In addition to freezing user space and putting all I/O devices into low-power
-states, which is done for Suspend-To-Idle too, nonboot CPUs are taken offline
-and all low-level system functions are suspended during transitions into this
-state. For this reason, it should allow more energy to be saved relative to
-Suspend-To-Idle, but the resume latency will generally be greater than for that
-state.
-
-
-State: Suspend-to-RAM
-ACPI State: S3
-Label: "deep"
-
-This state, if supported, offers significant power savings as everything in the
-system is put into a low-power state, except for memory, which should be placed
-into the self-refresh mode to retain its contents. All of the steps carried out
-when entering Power-On Suspend are also carried out during transitions to STR.
-Additional operations may take place depending on the platform capabilities. In
-particular, on ACPI systems the kernel passes control to the BIOS (platform
-firmware) as the last step during STR transitions and that usually results in
-powering down some more low-level components that aren't directly controlled by
-the kernel.
-
-System and device state is saved and kept in memory. All devices are suspended
-and put into low-power states. In many cases, all peripheral buses lose power
-when entering STR, so devices must be able to handle the transition back to the
-"on" state.
-
-For at least ACPI, STR requires some minimal boot-strapping code to resume the
-system from it. This may be the case on other platforms too.
-
-
-State: Suspend-to-disk
-ACPI State: S4
-Label: "disk"
-
-This state offers the greatest power savings, and can be used even in
-the absence of low-level platform support for power management. This
-state operates similarly to Suspend-to-RAM, but includes a final step
-of writing memory contents to disk. On resume, this is read and memory
-is restored to its pre-suspend state.
-
-STD can be handled by the firmware or the kernel. If it is handled by
-the firmware, it usually requires a dedicated partition that must be
-setup via another operating system for it to use. Despite the
-inconvenience, this method requires minimal work by the kernel, since
-the firmware will also handle restoring memory contents on resume.
-
-For suspend-to-disk, a mechanism called 'swsusp' (Swap Suspend) is used
-to write memory contents to free swap space. swsusp has some restrictive
-requirements, but should work in most cases. Some, albeit outdated,
-documentation can be found in Documentation/power/swsusp.txt.
-Alternatively, userspace can do most of the actual suspend to disk work,
-see userland-swsusp.txt.
-
-Once memory state is written to disk, the system may either enter a
-low-power state (like ACPI S4), or it may simply power down. Powering
-down offers greater savings, and allows this mechanism to work on any
-system. However, entering a real low-power state allows the user to
-trigger wake up events (e.g. pressing a key or opening a laptop lid).