summaryrefslogtreecommitdiffstats
path: root/Documentation/admin-guide
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/admin-guide')
-rw-r--r--Documentation/admin-guide/LSM/SELinux.rst2
-rw-r--r--Documentation/admin-guide/LSM/Smack.rst4
-rw-r--r--Documentation/admin-guide/cgroup-v2.rst190
-rw-r--r--Documentation/admin-guide/devices.rst1
-rw-r--r--Documentation/admin-guide/dynamic-debug-howto.rst8
-rw-r--r--Documentation/admin-guide/index.rst1
-rw-r--r--Documentation/admin-guide/kernel-parameters.txt73
-rw-r--r--Documentation/admin-guide/l1tf.rst6
-rw-r--r--Documentation/admin-guide/mm/concepts.rst51
-rw-r--r--Documentation/admin-guide/perf-security.rst97
-rw-r--r--Documentation/admin-guide/pm/cpuidle.rst631
-rw-r--r--Documentation/admin-guide/pm/intel_pstate.rst10
-rw-r--r--Documentation/admin-guide/pm/working-state.rst1
-rw-r--r--Documentation/admin-guide/ras.rst2
-rw-r--r--Documentation/admin-guide/security-bugs.rst2
-rw-r--r--Documentation/admin-guide/thunderbolt.rst20
16 files changed, 1029 insertions, 70 deletions
diff --git a/Documentation/admin-guide/LSM/SELinux.rst b/Documentation/admin-guide/LSM/SELinux.rst
index f722c9b4173a..520a1c2c6fd2 100644
--- a/Documentation/admin-guide/LSM/SELinux.rst
+++ b/Documentation/admin-guide/LSM/SELinux.rst
@@ -6,7 +6,7 @@ If you want to use SELinux, chances are you will want
to use the distro-provided policies, or install the
latest reference policy release from
- http://oss.tresys.com/projects/refpolicy
+ https://github.com/SELinuxProject/refpolicy
However, if you want to install a dummy policy for
testing, you can do using ``mdp`` provided under
diff --git a/Documentation/admin-guide/LSM/Smack.rst b/Documentation/admin-guide/LSM/Smack.rst
index 6a5826a13aea..6d44f4fdbf59 100644
--- a/Documentation/admin-guide/LSM/Smack.rst
+++ b/Documentation/admin-guide/LSM/Smack.rst
@@ -818,6 +818,10 @@ Smack supports some mount options:
specifies a label to which all labels set on the
filesystem must have read access. Not yet enforced.
+ smackfstransmute=label:
+ behaves exactly like smackfsroot except that it also
+ sets the transmute flag on the root of the mount
+
These mount options apply to all file system types.
Smack auditing
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 476722b7b636..7bf3f129c68b 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -56,11 +56,13 @@ v1 is available under Documentation/cgroup-v1/.
5-3-3-2. IO Latency Interface Files
5-4. PID
5-4-1. PID Interface Files
- 5-5. Device
- 5-6. RDMA
- 5-6-1. RDMA Interface Files
- 5-7. Misc
- 5-7-1. perf_event
+ 5-5. Cpuset
+ 5.5-1. Cpuset Interface Files
+ 5-6. Device
+ 5-7. RDMA
+ 5-7-1. RDMA Interface Files
+ 5-8. Misc
+ 5-8-1. perf_event
5-N. Non-normative information
5-N-1. CPU controller root cgroup process behaviour
5-N-2. IO controller root cgroup process behaviour
@@ -1610,6 +1612,176 @@ through fork() or clone(). These will return -EAGAIN if the creation
of a new process would cause a cgroup policy to be violated.
+Cpuset
+------
+
+The "cpuset" controller provides a mechanism for constraining
+the CPU and memory node placement of tasks to only the resources
+specified in the cpuset interface files in a task's current cgroup.
+This is especially valuable on large NUMA systems where placing jobs
+on properly sized subsets of the systems with careful processor and
+memory placement to reduce cross-node memory access and contention
+can improve overall system performance.
+
+The "cpuset" controller is hierarchical. That means the controller
+cannot use CPUs or memory nodes not allowed in its parent.
+
+
+Cpuset Interface Files
+~~~~~~~~~~~~~~~~~~~~~~
+
+ cpuset.cpus
+ A read-write multiple values file which exists on non-root
+ cpuset-enabled cgroups.
+
+ It lists the requested CPUs to be used by tasks within this
+ cgroup. The actual list of CPUs to be granted, however, is
+ subjected to constraints imposed by its parent and can differ
+ from the requested CPUs.
+
+ The CPU numbers are comma-separated numbers or ranges.
+ For example:
+
+ # cat cpuset.cpus
+ 0-4,6,8-10
+
+ An empty value indicates that the cgroup is using the same
+ setting as the nearest cgroup ancestor with a non-empty
+ "cpuset.cpus" or all the available CPUs if none is found.
+
+ The value of "cpuset.cpus" stays constant until the next update
+ and won't be affected by any CPU hotplug events.
+
+ cpuset.cpus.effective
+ A read-only multiple values file which exists on all
+ cpuset-enabled cgroups.
+
+ It lists the onlined CPUs that are actually granted to this
+ cgroup by its parent. These CPUs are allowed to be used by
+ tasks within the current cgroup.
+
+ If "cpuset.cpus" is empty, the "cpuset.cpus.effective" file shows
+ all the CPUs from the parent cgroup that can be available to
+ be used by this cgroup. Otherwise, it should be a subset of
+ "cpuset.cpus" unless none of the CPUs listed in "cpuset.cpus"
+ can be granted. In this case, it will be treated just like an
+ empty "cpuset.cpus".
+
+ Its value will be affected by CPU hotplug events.
+
+ cpuset.mems
+ A read-write multiple values file which exists on non-root
+ cpuset-enabled cgroups.
+
+ It lists the requested memory nodes to be used by tasks within
+ this cgroup. The actual list of memory nodes granted, however,
+ is subjected to constraints imposed by its parent and can differ
+ from the requested memory nodes.
+
+ The memory node numbers are comma-separated numbers or ranges.
+ For example:
+
+ # cat cpuset.mems
+ 0-1,3
+
+ An empty value indicates that the cgroup is using the same
+ setting as the nearest cgroup ancestor with a non-empty
+ "cpuset.mems" or all the available memory nodes if none
+ is found.
+
+ The value of "cpuset.mems" stays constant until the next update
+ and won't be affected by any memory nodes hotplug events.
+
+ cpuset.mems.effective
+ A read-only multiple values file which exists on all
+ cpuset-enabled cgroups.
+
+ It lists the onlined memory nodes that are actually granted to
+ this cgroup by its parent. These memory nodes are allowed to
+ be used by tasks within the current cgroup.
+
+ If "cpuset.mems" is empty, it shows all the memory nodes from the
+ parent cgroup that will be available to be used by this cgroup.
+ Otherwise, it should be a subset of "cpuset.mems" unless none of
+ the memory nodes listed in "cpuset.mems" can be granted. In this
+ case, it will be treated just like an empty "cpuset.mems".
+
+ Its value will be affected by memory nodes hotplug events.
+
+ cpuset.cpus.partition
+ A read-write single value file which exists on non-root
+ cpuset-enabled cgroups. This flag is owned by the parent cgroup
+ and is not delegatable.
+
+ It accepts only the following input values when written to.
+
+ "root" - a paritition root
+ "member" - a non-root member of a partition
+
+ When set to be a partition root, the current cgroup is the
+ root of a new partition or scheduling domain that comprises
+ itself and all its descendants except those that are separate
+ partition roots themselves and their descendants. The root
+ cgroup is always a partition root.
+
+ There are constraints on where a partition root can be set.
+ It can only be set in a cgroup if all the following conditions
+ are true.
+
+ 1) The "cpuset.cpus" is not empty and the list of CPUs are
+ exclusive, i.e. they are not shared by any of its siblings.
+ 2) The parent cgroup is a partition root.
+ 3) The "cpuset.cpus" is also a proper subset of the parent's
+ "cpuset.cpus.effective".
+ 4) There is no child cgroups with cpuset enabled. This is for
+ eliminating corner cases that have to be handled if such a
+ condition is allowed.
+
+ Setting it to partition root will take the CPUs away from the
+ effective CPUs of the parent cgroup. Once it is set, this
+ file cannot be reverted back to "member" if there are any child
+ cgroups with cpuset enabled.
+
+ A parent partition cannot distribute all its CPUs to its
+ child partitions. There must be at least one cpu left in the
+ parent partition.
+
+ Once becoming a partition root, changes to "cpuset.cpus" is
+ generally allowed as long as the first condition above is true,
+ the change will not take away all the CPUs from the parent
+ partition and the new "cpuset.cpus" value is a superset of its
+ children's "cpuset.cpus" values.
+
+ Sometimes, external factors like changes to ancestors'
+ "cpuset.cpus" or cpu hotplug can cause the state of the partition
+ root to change. On read, the "cpuset.sched.partition" file
+ can show the following values.
+
+ "member" Non-root member of a partition
+ "root" Partition root
+ "root invalid" Invalid partition root
+
+ It is a partition root if the first 2 partition root conditions
+ above are true and at least one CPU from "cpuset.cpus" is
+ granted by the parent cgroup.
+
+ A partition root can become invalid if none of CPUs requested
+ in "cpuset.cpus" can be granted by the parent cgroup or the
+ parent cgroup is no longer a partition root itself. In this
+ case, it is not a real partition even though the restriction
+ of the first partition root condition above will still apply.
+ The cpu affinity of all the tasks in the cgroup will then be
+ associated with CPUs in the nearest ancestor partition.
+
+ An invalid partition root can be transitioned back to a
+ real partition root if at least one of the requested CPUs
+ can now be granted by its parent. In this case, the cpu
+ affinity of all the tasks in the formerly invalid partition
+ will be associated to the CPUs of the newly formed partition.
+ Changing the partition state of an invalid partition root to
+ "member" is always allowed even if child cpusets are present.
+
+
Device controller
-----------------
@@ -1879,8 +2051,10 @@ following two functions.
wbc_init_bio(@wbc, @bio)
Should be called for each bio carrying writeback data and
- associates the bio with the inode's owner cgroup. Can be
- called anytime between bio allocation and submission.
+ associates the bio with the inode's owner cgroup and the
+ corresponding request queue. This must be called after
+ a queue (device) has been associated with the bio and
+ before submission.
wbc_account_io(@wbc, @page, @bytes)
Should be called for each data segment being written out.
@@ -1899,7 +2073,7 @@ the configuration, the bio may be executed at a lower priority and if
the writeback session is holding shared resources, e.g. a journal
entry, may lead to priority inversion. There is no one easy solution
for the problem. Filesystems can try to work around specific problem
-cases by skipping wbc_init_bio() or using bio_associate_blkcg()
+cases by skipping wbc_init_bio() and using bio_associate_blkg()
directly.
diff --git a/Documentation/admin-guide/devices.rst b/Documentation/admin-guide/devices.rst
index 7fadc05330dd..d41671aeaef0 100644
--- a/Documentation/admin-guide/devices.rst
+++ b/Documentation/admin-guide/devices.rst
@@ -1,3 +1,4 @@
+.. _admin_devices:
Linux allocated devices (4.x+ version)
======================================
diff --git a/Documentation/admin-guide/dynamic-debug-howto.rst b/Documentation/admin-guide/dynamic-debug-howto.rst
index fdf72429f801..252e5ef324e5 100644
--- a/Documentation/admin-guide/dynamic-debug-howto.rst
+++ b/Documentation/admin-guide/dynamic-debug-howto.rst
@@ -110,8 +110,8 @@ If your query set is big, you can batch them too::
~# cat query-batch-file > <debugfs>/dynamic_debug/control
-A another way is to use wildcard. The match rule support ``*`` (matches
-zero or more characters) and ``?`` (matches exactly one character).For
+Another way is to use wildcards. The match rule supports ``*`` (matches
+zero or more characters) and ``?`` (matches exactly one character). For
example, you can match all usb drivers::
~# echo "file drivers/usb/* +p" > <debugfs>/dynamic_debug/control
@@ -258,7 +258,7 @@ this boot parameter for debugging purposes.
If ``foo`` module is not built-in, ``foo.dyndbg`` will still be processed at
boot time, without effect, but will be reprocessed when module is
-loaded later. ``dyndbg_query=`` and bare ``dyndbg=`` are only processed at
+loaded later. ``ddebug_query=`` and bare ``dyndbg=`` are only processed at
boot.
@@ -301,7 +301,7 @@ The ``dyndbg`` option is a "fake" module parameter, which means:
For ``CONFIG_DYNAMIC_DEBUG`` kernels, any settings given at boot-time (or
enabled by ``-DDEBUG`` flag during compilation) can be disabled later via
-the sysfs interface if the debug messages are no longer needed::
+the debugfs interface if the debug messages are no longer needed::
echo "module module_name -p" > <debugfs>/dynamic_debug/control
diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst
index 965745d5fb9a..0a491676685e 100644
--- a/Documentation/admin-guide/index.rst
+++ b/Documentation/admin-guide/index.rst
@@ -76,6 +76,7 @@ configure specific aspects of kernel behavior to your liking.
thunderbolt
LSM/index
mm/index
+ perf-security
.. only:: subproject and html
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 98ee9eaa52be..b799bcf67d7b 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -331,7 +331,7 @@
APC and your system crashes randomly.
apic= [APIC,X86] Advanced Programmable Interrupt Controller
- Change the output verbosity whilst booting
+ Change the output verbosity while booting
Format: { quiet (default) | verbose | debug }
Change the amount of debugging information output
when initialising the APIC and IO-APIC components.
@@ -486,10 +486,14 @@
cut the overhead, others just disable the usage. So
only cgroup_disable=memory is actually worthy}
- cgroup_no_v1= [KNL] Disable one, multiple, all cgroup controllers in v1
- Format: { controller[,controller...] | "all" }
+ cgroup_no_v1= [KNL] Disable cgroup controllers and named hierarchies in v1
+ Format: { { controller | "all" | "named" }
+ [,{ controller | "all" | "named" }...] }
Like cgroup_disable, but only applies to cgroup v1;
the blacklisted controllers remain available in cgroup2.
+ "all" blacklists all controllers and "named" disables
+ named mounts. Specifying both "all" and "named" disables
+ all v1 hierarchies.
cgroup.memory= [KNL] Pass options to the cgroup memory controller.
Format: <string>
@@ -674,6 +678,9 @@
cpuidle.off=1 [CPU_IDLE]
disable the cpuidle sub-system
+ cpuidle.governor=
+ [CPU_IDLE] Name of the cpuidle governor to use.
+
cpufreq.off=1 [CPU_FREQ]
disable the cpufreq sub-system
@@ -1689,12 +1696,12 @@
By default, super page will be supported if Intel IOMMU
has the capability. With this option, super page will
not be supported.
- ecs_off [Default Off]
- By default, extended context tables will be supported if
- the hardware advertises that it has support both for the
- extended tables themselves, and also PASID support. With
- this option set, extended tables will not be used even
- on hardware which claims to support them.
+ sm_off [Default Off]
+ By default, scalable mode will be supported if the
+ hardware advertises that it has support for the scalable
+ mode translation. With this option set, scalable mode
+ will not be used even on hardware which claims to support
+ it.
tboot_noforce [Default Off]
Do not force the Intel IOMMU enabled under tboot.
By default, tboot will force Intel IOMMU on, which
@@ -2102,6 +2109,9 @@
off
Disables hypervisor mitigations and doesn't
emit any warnings.
+ It also drops the swap size and available
+ RAM limit restriction on both hypervisor and
+ bare metal.
Default is 'flush'.
@@ -2833,7 +2843,7 @@
check bypass). With this option data leaks are possible
in the system.
- nospectre_v2 [X86] Disable all mitigations for the Spectre variant 2
+ nospectre_v2 [X86,PPC_FSL_BOOK3E] Disable all mitigations for the Spectre variant 2
(indirect branch prediction) vulnerability. System may
allow data leaks with this option, which is equivalent
to spectre_v2=off.
@@ -3088,6 +3098,14 @@
timeout < 0: reboot immediately
Format: <timeout>
+ panic_print= Bitmask for printing system info when panic happens.
+ User can chose combination of the following bits:
+ bit 0: print all tasks info
+ bit 1: print system memory info
+ bit 2: print timer info
+ bit 3: print locks info if CONFIG_LOCKDEP is on
+ bit 4: print ftrace buffer
+
panic_on_warn panic() instead of WARN(). Useful to cause kdump
on a WARN().
@@ -3754,24 +3772,6 @@
in microseconds. The default of zero says
no holdoff.
- rcutorture.cbflood_inter_holdoff= [KNL]
- Set holdoff time (jiffies) between successive
- callback-flood tests.
-
- rcutorture.cbflood_intra_holdoff= [KNL]
- Set holdoff time (jiffies) between successive
- bursts of callbacks within a given callback-flood
- test.
-
- rcutorture.cbflood_n_burst= [KNL]
- Set the number of bursts making up a given
- callback-flood test. Set this to zero to
- disable callback-flood testing.
-
- rcutorture.cbflood_n_per_burst= [KNL]
- Set the number of callbacks to be registered
- in a given burst of a callback-flood test.
-
rcutorture.fqs_duration= [KNL]
Set duration of force_quiescent_state bursts
in microseconds.
@@ -3784,6 +3784,23 @@
Set wait time between force_quiescent_state bursts
in seconds.
+ rcutorture.fwd_progress= [KNL]
+ Enable RCU grace-period forward-progress testing
+ for the types of RCU supporting this notion.
+
+ rcutorture.fwd_progress_div= [KNL]
+ Specify the fraction of a CPU-stall-warning
+ period to do tight-loop forward-progress testing.
+
+ rcutorture.fwd_progress_holdoff= [KNL]
+ Number of seconds to wait between successive
+ forward-progress tests.
+
+ rcutorture.fwd_progress_need_resched= [KNL]
+ Enclose cond_resched() calls within checks for
+ need_resched() during tight-loop forward-progress
+ testing.
+
rcutorture.gp_cond= [KNL]
Use conditional/asynchronous update-side
primitives, if available.
diff --git a/Documentation/admin-guide/l1tf.rst b/Documentation/admin-guide/l1tf.rst
index b85dd80510b0..9af977384168 100644
--- a/Documentation/admin-guide/l1tf.rst
+++ b/Documentation/admin-guide/l1tf.rst
@@ -405,6 +405,9 @@ time with the option "l1tf=". The valid arguments for this option are:
off Disables hypervisor mitigations and doesn't emit any
warnings.
+ It also drops the swap size and available RAM limit restrictions
+ on both hypervisor and bare metal.
+
============ =============================================================
The default is 'flush'. For details about L1D flushing see :ref:`l1d_flush`.
@@ -576,7 +579,8 @@ Default mitigations
The kernel default mitigations for vulnerable processors are:
- PTE inversion to protect against malicious user space. This is done
- unconditionally and cannot be controlled.
+ unconditionally and cannot be controlled. The swap storage is limited
+ to ~16TB.
- L1D conditional flushing on VMENTER when EPT is enabled for
a guest.
diff --git a/Documentation/admin-guide/mm/concepts.rst b/Documentation/admin-guide/mm/concepts.rst
index 291699c810d4..c2531b14bf46 100644
--- a/Documentation/admin-guide/mm/concepts.rst
+++ b/Documentation/admin-guide/mm/concepts.rst
@@ -4,13 +4,13 @@
Concepts overview
=================
-The memory management in Linux is complex system that evolved over the
-years and included more and more functionality to support variety of
+The memory management in Linux is a complex system that evolved over the
+years and included more and more functionality to support a variety of
systems from MMU-less microcontrollers to supercomputers. The memory
-management for systems without MMU is called ``nommu`` and it
+management for systems without an MMU is called ``nommu`` and it
definitely deserves a dedicated document, which hopefully will be
eventually written. Yet, although some of the concepts are the same,
-here we assume that MMU is available and CPU can translate a virtual
+here we assume that an MMU is available and a CPU can translate a virtual
address to a physical address.
.. contents:: :local:
@@ -21,10 +21,10 @@ Virtual Memory Primer
The physical memory in a computer system is a limited resource and
even for systems that support memory hotplug there is a hard limit on
the amount of memory that can be installed. The physical memory is not
-necessary contiguous, it might be accessible as a set of distinct
+necessarily contiguous; it might be accessible as a set of distinct
address ranges. Besides, different CPU architectures, and even
-different implementations of the same architecture have different view
-how these address ranges defined.
+different implementations of the same architecture have different views
+of how these address ranges are defined.
All this makes dealing directly with physical memory quite complex and
to avoid this complexity a concept of virtual memory was developed.
@@ -48,8 +48,8 @@ appropriate kernel configuration option.
Each physical memory page can be mapped as one or more virtual
pages. These mappings are described by page tables that allow
-translation from virtual address used by programs to real address in
-the physical memory. The page tables organized hierarchically.
+translation from a virtual address used by programs to the physical
+memory address. The page tables are organized hierarchically.
The tables at the lowest level of the hierarchy contain physical
addresses of actual pages used by the software. The tables at higher
@@ -121,8 +121,8 @@ Nodes
Many multi-processor machines are NUMA - Non-Uniform Memory Access -
systems. In such systems the memory is arranged into banks that have
different access latency depending on the "distance" from the
-processor. Each bank is referred as `node` and for each node Linux
-constructs an independent memory management subsystem. A node has it's
+processor. Each bank is referred to as a `node` and for each node Linux
+constructs an independent memory management subsystem. A node has its
own set of zones, lists of free and used pages and various statistics
counters. You can find more details about NUMA in
:ref:`Documentation/vm/numa.rst <numa>` and in
@@ -149,9 +149,9 @@ for program's stack and heap or by explicit calls to mmap(2) system
call. Usually, the anonymous mappings only define virtual memory areas
that the program is allowed to access. The read accesses will result
in creation of a page table entry that references a special physical
-page filled with zeroes. When the program performs a write, regular
+page filled with zeroes. When the program performs a write, a regular
physical page will be allocated to hold the written data. The page
-will be marked dirty and if the kernel will decide to repurpose it,
+will be marked dirty and if the kernel decides to repurpose it,
the dirty page will be swapped out.
Reclaim
@@ -181,8 +181,8 @@ pressure.
The process of freeing the reclaimable physical memory pages and
repurposing them is called (surprise!) `reclaim`. Linux can reclaim
pages either asynchronously or synchronously, depending on the state
-of the system. When system is not loaded, most of the memory is free
-and allocation request will be satisfied immediately from the free
+of the system. When the system is not loaded, most of the memory is free
+and allocation requests will be satisfied immediately from the free
pages supply. As the load increases, the amount of the free pages goes
down and when it reaches a certain threshold (high watermark), an
allocation request will awaken the ``kswapd`` daemon. It will
@@ -190,7 +190,7 @@ asynchronously scan memory pages and either just free them if the data
they contain is available elsewhere, or evict to the backing storage
device (remember those dirty pages?). As memory usage increases even
more and reaches another threshold - min watermark - an allocation
-will trigger the `direct reclaim`. In this case allocation is stalled
+will trigger `direct reclaim`. In this case allocation is stalled
until enough memory pages are reclaimed to satisfy the request.
Compaction
@@ -200,7 +200,7 @@ As the system runs, tasks allocate and free the memory and it becomes
fragmented. Although with virtual memory it is possible to present
scattered physical pages as virtually contiguous range, sometimes it is
necessary to allocate large physically contiguous memory areas. Such
-need may arise, for instance, when a device driver requires large
+need may arise, for instance, when a device driver requires a large
buffer for DMA, or when THP allocates a huge page. Memory `compaction`
addresses the fragmentation issue. This mechanism moves occupied pages
from the lower part of a memory zone to free pages in the upper part
@@ -208,15 +208,16 @@ of the zone. When a compaction scan is finished free pages are grouped
together at the beginning of the zone and allocations of large
physically contiguous areas become possible.
-Like reclaim, the compaction may happen asynchronously in ``kcompactd``
-daemon or synchronously as a result of memory allocation request.
+Like reclaim, the compaction may happen asynchronously in the ``kcompactd``
+daemon or synchronously as a result of a memory allocation request.
OOM killer
==========
-It may happen, that on a loaded machine memory will be exhausted. When
-the kernel detects that the system runs out of memory (OOM) it invokes
-`OOM killer`. Its mission is simple: all it has to do is to select a
-task to sacrifice for the sake of the overall system health. The
-selected task is killed in a hope that after it exits enough memory
-will be freed to continue normal operation.
+It is possible that on a loaded machine memory will be exhausted and the
+kernel will be unable to reclaim enough memory to continue to operate. In
+order to save the rest of the system, it invokes the `OOM killer`.
+
+The `OOM killer` selects a task to sacrifice for the sake of the overall
+system health. The selected task is killed in a hope that after it exits
+enough memory will be freed to continue normal operation.
diff --git a/Documentation/admin-guide/perf-security.rst b/Documentation/admin-guide/perf-security.rst
new file mode 100644
index 000000000000..f73ebfe9bfe2
--- /dev/null
+++ b/Documentation/admin-guide/perf-security.rst
@@ -0,0 +1,97 @@
+.. _perf_security:
+
+Perf Events and tool security
+=============================
+
+Overview
+--------
+
+Usage of Performance Counters for Linux (perf_events) [1]_ , [2]_ , [3]_ can
+impose a considerable risk of leaking sensitive data accessed by monitored
+processes. The data leakage is possible both in scenarios of direct usage of
+perf_events system call API [2]_ and over data files generated by Perf tool user
+mode utility (Perf) [3]_ , [4]_ . The risk depends on the nature of data that
+perf_events performance monitoring units (PMU) [2]_ collect and expose for
+performance analysis. Having that said perf_events/Perf performance monitoring
+is the subject for security access control management [5]_ .
+
+perf_events/Perf access control
+-------------------------------
+
+To perform security checks, the Linux implementation splits processes into two
+categories [6]_ : a) privileged processes (whose effective user ID is 0, referred
+to as superuser or root), and b) unprivileged processes (whose effective UID is
+nonzero). Privileged processes bypass all kernel security permission checks so
+perf_events performance monitoring is fully available to privileged processes
+without access, scope and resource restrictions.
+
+Unprivileged processes are subject to a full security permission check based on
+the process's credentials [5]_ (usually: effective UID, effective GID, and
+supplementary group list).
+
+Linux divides the privileges traditionally associated with superuser into
+distinct units, known as capabilities [6]_ , which can be independently enabled
+and disabled on per-thread basis for processes and files of unprivileged users.
+
+Unprivileged processes with enabled CAP_SYS_ADMIN capability are treated as
+privileged processes with respect to perf_events performance monitoring and
+bypass *scope* permissions checks in the kernel.
+
+Unprivileged processes using perf_events system call API is also subject for
+PTRACE_MODE_READ_REALCREDS ptrace access mode check [7]_ , whose outcome
+determines whether monitoring is permitted. So unprivileged processes provided
+with CAP_SYS_PTRACE capability are effectively permitted to pass the check.
+
+Other capabilities being granted to unprivileged processes can effectively
+enable capturing of additional data required for later performance analysis of
+monitored processes or a system. For example, CAP_SYSLOG capability permits
+reading kernel space memory addresses from /proc/kallsyms file.
+
+perf_events/Perf unprivileged users
+-----------------------------------
+
+perf_events/Perf *scope* and *access* control for unprivileged processes is
+governed by perf_event_paranoid [2]_ setting:
+
+-1:
+ Impose no *scope* and *access* restrictions on using perf_events performance
+ monitoring. Per-user per-cpu perf_event_mlock_kb [2]_ locking limit is
+ ignored when allocating memory buffers for storing performance data.
+ This is the least secure mode since allowed monitored *scope* is
+ maximized and no perf_events specific limits are imposed on *resources*
+ allocated for performance monitoring.
+
+>=0:
+ *scope* includes per-process and system wide performance monitoring
+ but excludes raw tracepoints and ftrace function tracepoints monitoring.
+ CPU and system events happened when executing either in user or
+ in kernel space can be monitored and captured for later analysis.
+ Per-user per-cpu perf_event_mlock_kb locking limit is imposed but
+ ignored for unprivileged processes with CAP_IPC_LOCK [6]_ capability.
+
+>=1:
+ *scope* includes per-process performance monitoring only and excludes
+ system wide performance monitoring. CPU and system events happened when
+ executing either in user or in kernel space can be monitored and
+ captured for later analysis. Per-user per-cpu perf_event_mlock_kb
+ locking limit is imposed but ignored for unprivileged processes with
+ CAP_IPC_LOCK capability.
+
+>=2:
+ *scope* includes per-process performance monitoring only. CPU and system
+ events happened when executing in user space only can be monitored and
+ captured for later analysis. Per-user per-cpu perf_event_mlock_kb
+ locking limit is imposed but ignored for unprivileged processes with
+ CAP_IPC_LOCK capability.
+
+Bibliography
+------------
+
+.. [1] `<https://lwn.net/Articles/337493/>`_
+.. [2] `<http://man7.org/linux/man-pages/man2/perf_event_open.2.html>`_
+.. [3] `<http://web.eece.maine.edu/~vweaver/projects/perf_events/>`_
+.. [4] `<https://perf.wiki.kernel.org/index.php/Main_Page>`_
+.. [5] `<https://www.kernel.org/doc/html/latest/security/credentials.html>`_
+.. [6] `<http://man7.org/linux/man-pages/man7/capabilities.7.html>`_
+.. [7] `<http://man7.org/linux/man-pages/man2/ptrace.2.html>`_
+
diff --git a/Documentation/admin-guide/pm/cpuidle.rst b/Documentation/admin-guide/pm/cpuidle.rst
new file mode 100644
index 000000000000..106379e2619f
--- /dev/null
+++ b/Documentation/admin-guide/pm/cpuidle.rst
@@ -0,0 +1,631 @@
+.. |struct cpuidle_state| replace:: :c:type:`struct cpuidle_state <cpuidle_state>`
+.. |cpufreq| replace:: :doc:`CPU Performance Scaling <cpufreq>`
+
+========================
+CPU Idle Time Management
+========================
+
+::
+
+ Copyright (c) 2018 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com>
+
+Concepts
+========
+
+Modern processors are generally able to enter states in which the execution of
+a program is suspended and instructions belonging to it are not fetched from
+memory or executed. Those states are the *idle* states of the processor.
+
+Since part of the processor hardware is not used in idle states, entering them
+generally allows power drawn by the processor to be reduced and, in consequence,
+it is an opportunity to save energy.
+
+CPU idle time management is an energy-efficiency feature concerned about using
+the idle states of processors for this purpose.
+
+Logical CPUs
+------------
+
+CPU idle time management operates on CPUs as seen by the *CPU scheduler* (that
+is the part of the kernel responsible for the distribution of computational
+work in the system). In its view, CPUs are *logical* units. That is, they need
+not be separate physical entities and may just be interfaces appearing to
+software as individual single-core processors. In other words, a CPU is an
+entity which appears to be fetching instructions that belong to one sequence
+(program) from memory and executing them, but it need not work this way
+physically. Generally, three different cases can be consider here.
+
+First, if the whole processor can only follow one sequence of instructions (one
+program) at a time, it is a CPU. In that case, if the hardware is asked to
+enter an idle state, that applies to the processor as a whole.
+
+Second, if the processor is multi-core, each core in it is able to follow at
+least one program at a time. The cores need not be entirely independent of each
+other (for example, they may share caches), but still most of the time they
+work physically in parallel with each other, so if each of them executes only
+one program, those programs run mostly independently of each other at the same
+time. The entire cores are CPUs in that case and if the hardware is asked to
+enter an idle state, that applies to the core that asked for it in the first
+place, but it also may apply to a larger unit (say a "package" or a "cluster")
+that the core belongs to (in fact, it may apply to an entire hierarchy of larger
+units containing the core). Namely, if all of the cores in the larger unit
+except for one have been put into idle states at the "core level" and the
+remaining core asks the processor to enter an idle state, that may trigger it
+to put the whole larger unit into an idle state which also will affect the
+other cores in that unit.
+
+Finally, each core in a multi-core processor may be able to follow more than one
+program in the same time frame (that is, each core may be able to fetch