From b851ee7921fabdd7dfc96ffc4e9609f5062bd12b Mon Sep 17 00:00:00 2001 From: Li Zefan Date: Wed, 18 Feb 2009 14:48:14 -0800 Subject: cgroups: update documentation about css_set hash table The css_set hash table was introduced in 2.6.26, so update the documentation accordingly. Signed-off-by: Li Zefan Acked-by: Paul Menage Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/cgroups/cgroups.txt | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) (limited to 'Documentation/cgroups') diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt index d9e5d6f41b92..93feb8444489 100644 --- a/Documentation/cgroups/cgroups.txt +++ b/Documentation/cgroups/cgroups.txt @@ -252,10 +252,8 @@ cgroup file system directories. When a task is moved from one cgroup to another, it gets a new css_set pointer - if there's an already existing css_set with the desired collection of cgroups then that group is reused, else a new -css_set is allocated. Note that the current implementation uses a -linear search to locate an appropriate existing css_set, so isn't -very efficient. A future version will use a hash table for better -performance. +css_set is allocated. The appropriate existing css_set is located by +looking into a hash table. To allow access from a cgroup to the css_sets (and hence tasks) that comprise it, a set of cg_cgroup_link objects form a lattice; -- cgit v1.2.3 From 3fd076dd955a34c35dc456f4ef676e03cdced044 Mon Sep 17 00:00:00 2001 From: Li Zefan Date: Fri, 20 Feb 2009 15:38:48 -0800 Subject: cpuset: various documentation fixes and updates I noticed the old commit 8f5aa26c75b7722e80c0c5c5bb833d41865d7019 ("cpusets: update_cpumask documentation fix") is not a complete fix, resulting in inconsistent paragraphs. This patch fixes it and does other fixes and updates: - s/migrate_all_tasks()/migrate_live_tasks()/ - describe more cpuset control files - s/cpumask_t/struct cpumask/ - document cpu hotplug and change of 'sched_relax_domain_level' may cause domain rebuild - document various ways to query and modify cpusets - the equivalent of "mount -t cpuset" is "mount -t cgroup -o cpuset,noprefix" Signed-off-by: Li Zefan Acked-by: Randy Dunlap Cc: Paul Menage Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/cgroups/cpusets.txt | 65 ++++++++++++++++++++++----------------- 1 file changed, 37 insertions(+), 28 deletions(-) (limited to 'Documentation/cgroups') diff --git a/Documentation/cgroups/cpusets.txt b/Documentation/cgroups/cpusets.txt index 5c86c258c791..0611e9528c7c 100644 --- a/Documentation/cgroups/cpusets.txt +++ b/Documentation/cgroups/cpusets.txt @@ -142,7 +142,7 @@ into the rest of the kernel, none in performance critical paths: - in fork and exit, to attach and detach a task from its cpuset. - in sched_setaffinity, to mask the requested CPUs by what's allowed in that tasks cpuset. - - in sched.c migrate_all_tasks(), to keep migrating tasks within + - in sched.c migrate_live_tasks(), to keep migrating tasks within the CPUs allowed by their cpuset, if possible. - in the mbind and set_mempolicy system calls, to mask the requested Memory Nodes by what's allowed in that tasks cpuset. @@ -175,6 +175,10 @@ files describing that cpuset: - mem_exclusive flag: is memory placement exclusive? - mem_hardwall flag: is memory allocation hardwalled - memory_pressure: measure of how much paging pressure in cpuset + - memory_spread_page flag: if set, spread page cache evenly on allowed nodes + - memory_spread_slab flag: if set, spread slab cache evenly on allowed nodes + - sched_load_balance flag: if set, load balance within CPUs on that cpuset + - sched_relax_domain_level: the searching range when migrating tasks In addition, the root cpuset only has the following file: - memory_pressure_enabled flag: compute memory_pressure? @@ -252,7 +256,7 @@ is causing. This is useful both on tightly managed systems running a wide mix of submitted jobs, which may choose to terminate or re-prioritize jobs that -are trying to use more memory than allowed on the nodes assigned them, +are trying to use more memory than allowed on the nodes assigned to them, and with tightly coupled, long running, massively parallel scientific computing jobs that will dramatically fail to meet required performance goals if they start to use more memory than allowed to them. @@ -378,7 +382,7 @@ as cpusets and sched_setaffinity. The algorithmic cost of load balancing and its impact on key shared kernel data structures such as the task list increases more than linearly with the number of CPUs being balanced. So the scheduler -has support to partition the systems CPUs into a number of sched +has support to partition the systems CPUs into a number of sched domains such that it only load balances within each sched domain. Each sched domain covers some subset of the CPUs in the system; no two sched domains overlap; some CPUs might not be in any sched @@ -485,17 +489,22 @@ of CPUs allowed to a cpuset having 'sched_load_balance' enabled. The internal kernel cpuset to scheduler interface passes from the cpuset code to the scheduler code a partition of the load balanced CPUs in the system. This partition is a set of subsets (represented -as an array of cpumask_t) of CPUs, pairwise disjoint, that cover all -the CPUs that must be load balanced. - -Whenever the 'sched_load_balance' flag changes, or CPUs come or go -from a cpuset with this flag enabled, or a cpuset with this flag -enabled is removed, the cpuset code builds a new such partition and -passes it to the scheduler sched domain setup code, to have the sched -domains rebuilt as necessary. +as an array of struct cpumask) of CPUs, pairwise disjoint, that cover +all the CPUs that must be load balanced. + +The cpuset code builds a new such partition and passes it to the +scheduler sched domain setup code, to have the sched domains rebuilt +as necessary, whenever: + - the 'sched_load_balance' flag of a cpuset with non-empty CPUs changes, + - or CPUs come or go from a cpuset with this flag enabled, + - or 'sched_relax_domain_level' value of a cpuset with non-empty CPUs + and with this flag enabled changes, + - or a cpuset with non-empty CPUs and with this flag enabled is removed, + - or a cpu is offlined/onlined. This partition exactly defines what sched domains the scheduler should -setup - one sched domain for each element (cpumask_t) in the partition. +setup - one sched domain for each element (struct cpumask) in the +partition. The scheduler remembers the currently active sched domain partitions. When the scheduler routine partition_sched_domains() is invoked from @@ -559,7 +568,7 @@ domain, the largest value among those is used. Be careful, if one requests 0 and others are -1 then 0 is used. Note that modifying this file will have both good and bad effects, -and whether it is acceptable or not will be depend on your situation. +and whether it is acceptable or not depends on your situation. Don't modify this file if you are not sure. If your situation is: @@ -600,19 +609,15 @@ to allocate a page of memory for that task. If a cpuset has its 'cpus' modified, then each task in that cpuset will have its allowed CPU placement changed immediately. Similarly, -if a tasks pid is written to a cpusets 'tasks' file, in either its -current cpuset or another cpuset, then its allowed CPU placement is -changed immediately. If such a task had been bound to some subset -of its cpuset using the sched_setaffinity() call, the task will be -allowed to run on any CPU allowed in its new cpuset, negating the -affect of the prior sched_setaffinity() call. +if a tasks pid is written to another cpusets 'tasks' file, then its +allowed CPU placement is changed immediately. If such a task had been +bound to some subset of its cpuset using the sched_setaffinity() call, +the task will be allowed to run on any CPU allowed in its new cpuset, +negating the effect of the prior sched_setaffinity() call. In summary, the memory placement of a task whose cpuset is changed is updated by the kernel, on the next allocation of a page for that task, -but the processor placement is not updated, until that tasks pid is -rewritten to the 'tasks' file of its cpuset. This is done to avoid -impacting the scheduler code in the kernel with a check for changes -in a tasks processor placement. +and the processor placement is updated immediately. Normally, once a page is allocated (given a physical page of main memory) then that page stays on whatever node it @@ -681,10 +686,14 @@ and then start a subshell 'sh' in that cpuset: # The next line should display '/Charlie' cat /proc/self/cpuset -In the future, a C library interface to cpusets will likely be -available. For now, the only way to query or modify cpusets is -via the cpuset file system, using the various cd, mkdir, echo, cat, -rmdir commands from the shell, or their equivalent from C. +There are ways to query or modify cpusets: + - via the cpuset file system directly, using the various cd, mkdir, echo, + cat, rmdir commands from the shell, or their equivalent from C. + - via the C library libcpuset. + - via the C library libcgroup. + (http://sourceforge.net/proects/libcg/) + - via the python application cset. + (http://developer.novell.com/wiki/index.php/Cpuset) The sched_setaffinity calls can also be done at the shell prompt using SGI's runon or Robert Love's taskset. The mbind and set_mempolicy @@ -756,7 +765,7 @@ mount -t cpuset X /dev/cpuset is equivalent to -mount -t cgroup -ocpuset X /dev/cpuset +mount -t cgroup -ocpuset,noprefix X /dev/cpuset echo "/sbin/cpuset_release_agent" > /dev/cpuset/release_agent 2.2 Adding/removing cpus -- cgit v1.2.3 From caa790ba6cb88dccfab356960d93e2f4e0bd8704 Mon Sep 17 00:00:00 2001 From: Chris Samuel Date: Sat, 17 Jan 2009 00:01:18 +1100 Subject: trivial: cgroups: documentation typo and spelling corrections Minor typo and spelling corrections fixed whilst reading to learn about cgroups capabilities. Signed-off-by: Chris Samuel Acked-by: Paul Menage Signed-off-by: Jiri Kosina --- Documentation/cgroups/cgroups.txt | 10 +++++----- Documentation/cgroups/cpusets.txt | 12 ++++++------ Documentation/cgroups/devices.txt | 2 +- Documentation/cgroups/memory.txt | 2 +- 4 files changed, 13 insertions(+), 13 deletions(-) (limited to 'Documentation/cgroups') diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt index 93feb8444489..f4f5ee97d4db 100644 --- a/Documentation/cgroups/cgroups.txt +++ b/Documentation/cgroups/cgroups.txt @@ -56,7 +56,7 @@ hierarchy, and a set of subsystems; each subsystem has system-specific state attached to each cgroup in the hierarchy. Each hierarchy has an instance of the cgroup virtual filesystem associated with it. -At any one time there may be multiple active hierachies of task +At any one time there may be multiple active hierarchies of task cgroups. Each hierarchy is a partition of all tasks in the system. User level code may create and destroy cgroups by name in an @@ -124,10 +124,10 @@ following lines: / \ Prof (15%) students (5%) -Browsers like firefox/lynx go into the WWW network class, while (k)nfsd go +Browsers like Firefox/Lynx go into the WWW network class, while (k)nfsd go into NFS network class. -At the same time firefox/lynx will share an appropriate CPU/Memory class +At the same time Firefox/Lynx will share an appropriate CPU/Memory class depending on who launched it (prof/student). With the ability to classify tasks differently for different resources @@ -325,7 +325,7 @@ and then start a subshell 'sh' in that cgroup: Creating, modifying, using the cgroups can be done through the cgroup virtual filesystem. -To mount a cgroup hierarchy will all available subsystems, type: +To mount a cgroup hierarchy with all available subsystems, type: # mount -t cgroup xxx /dev/cgroup The "xxx" is not interpreted by the cgroup code, but will appear in @@ -521,7 +521,7 @@ always handled well. void post_clone(struct cgroup_subsys *ss, struct cgroup *cgrp) (cgroup_mutex held by caller) -Called at the end of cgroup_clone() to do any paramater +Called at the end of cgroup_clone() to do any parameter initialization which might be required before a task could attach. For example in cpusets, no task may attach before 'cpus' and 'mems' are set up. diff --git a/Documentation/cgroups/cpusets.txt b/Documentation/cgroups/cpusets.txt index 0611e9528c7c..f9ca389dddf4 100644 --- a/Documentation/cgroups/cpusets.txt +++ b/Documentation/cgroups/cpusets.txt @@ -131,7 +131,7 @@ Cpusets extends these two mechanisms as follows: - The hierarchy of cpusets can be mounted at /dev/cpuset, for browsing and manipulation from user space. - A cpuset may be marked exclusive, which ensures that no other - cpuset (except direct ancestors and descendents) may contain + cpuset (except direct ancestors and descendants) may contain any overlapping CPUs or Memory Nodes. - You can list all the tasks (by pid) attached to any cpuset. @@ -226,7 +226,7 @@ nodes with memory--using the cpuset_track_online_nodes() hook. -------------------------------- If a cpuset is cpu or mem exclusive, no other cpuset, other than -a direct ancestor or descendent, may share any of the same CPUs or +a direct ancestor or descendant, may share any of the same CPUs or Memory Nodes. A cpuset that is mem_exclusive *or* mem_hardwall is "hardwalled", @@ -427,7 +427,7 @@ child cpusets have this flag enabled. When doing this, you don't usually want to leave any unpinned tasks in the top cpuset that might use non-trivial amounts of CPU, as such tasks may be artificially constrained to some subset of CPUs, depending on -the particulars of this flag setting in descendent cpusets. Even if +the particulars of this flag setting in descendant cpusets. Even if such a task could use spare CPU cycles in some other CPUs, the kernel scheduler might not consider the possibility of load balancing that task to that underused CPU. @@ -531,9 +531,9 @@ be idle. Of course it takes some searching cost to find movable tasks and/or idle CPUs, the scheduler might not search all CPUs in the domain -everytime. In fact, in some architectures, the searching ranges on +every time. In fact, in some architectures, the searching ranges on events are limited in the same socket or node where the CPU locates, -while the load balance on tick searchs all. +while the load balance on tick searches all. For example, assume CPU Z is relatively far from CPU X. Even if CPU Z is idle while CPU X and the siblings are busy, scheduler can't migrate @@ -601,7 +601,7 @@ its new cpuset, then the task will continue to use whatever subset of MPOL_BIND nodes are still allowed in the new cpuset. If the task was using MPOL_BIND and now none of its MPOL_BIND nodes are allowed in the new cpuset, then the task will be essentially treated as if it -was MPOL_BIND bound to the new cpuset (even though its numa placement, +was MPOL_BIND bound to the new cpuset (even though its NUMA placement, as queried by get_mempolicy(), doesn't change). If a task is moved from one cpuset to another, then the kernel will adjust the tasks memory placement, as above, the next time that the kernel attempts diff --git a/Documentation/cgroups/devices.txt b/Documentation/cgroups/devices.txt index 7cc6e6a60672..57ca4c89fe5c 100644 --- a/Documentation/cgroups/devices.txt +++ b/Documentation/cgroups/devices.txt @@ -42,7 +42,7 @@ suffice, but we can decide the best way to adequately restrict movement as people get some experience with this. We may just want to require CAP_SYS_ADMIN, which at least is a separate bit from CAP_MKNOD. We may want to just refuse moving to a cgroup which -isn't a descendent of the current one. Or we may want to use +isn't a descendant of the current one. Or we may want to use CAP_MAC_ADMIN, since we really are trying to lock down root. CAP_SYS_ADMIN is needed to modify the whitelist or move another diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index e1501964df1e..a98a7fe7aabb 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -302,7 +302,7 @@ will be charged as a new owner of it. unevictable - # of pages cannot be reclaimed.(mlocked etc) Below is depend on CONFIG_DEBUG_VM. - inactive_ratio - VM inernal parameter. (see mm/page_alloc.c) + inactive_ratio - VM internal parameter. (see mm/page_alloc.c) recent_rotated_anon - VM internal parameter. (see mm/vmscan.c) recent_rotated_file - VM internal parameter. (see mm/vmscan.c) recent_scanned_anon - VM internal parameter. (see mm/vmscan.c) -- cgit v1.2.3 From 6d5e147dd034d9ceedc89fe39f4284700944f0c8 Mon Sep 17 00:00:00 2001 From: Thadeu Lima de Souza Cascardo Date: Tue, 3 Feb 2009 11:57:13 +0100 Subject: trivial: Give the right path in Documentation example While the Documentation example creates /cgroup/test, it removes /test/cgroup, which is clearly not the intended path. Change that to /cgroup/test. Acked-by: KAMEZAWA Hiroyuki Signed-off-by: Thadeu Lima de Souza Cascardo Signed-off-by: Jiri Kosina --- Documentation/cgroups/memcg_test.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation/cgroups') diff --git a/Documentation/cgroups/memcg_test.txt b/Documentation/cgroups/memcg_test.txt index 523a9c16c400..a9263596f8d8 100644 --- a/Documentation/cgroups/memcg_test.txt +++ b/Documentation/cgroups/memcg_test.txt @@ -356,7 +356,7 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. (Shell-B) # move all tasks in /cgroup/test to /cgroup # /sbin/swapoff -a - # rmdir /test/cgroup + # rmdir /cgroup/test # kill malloc task. Of course, tmpfs v.s. swapoff test should be tested, too. -- cgit v1.2.3 From 21acb9caa2e30b100e9a1943d995bb99d40f4035 Mon Sep 17 00:00:00 2001 From: Thadeu Lima de Souza Cascardo Date: Wed, 4 Feb 2009 10:12:08 +0100 Subject: trivial: fix where cgroup documentation is not correctly referred to cgroup documentation was moved to Documentation/cgroups/. There are some places that still refer to Documentation/controllers/, Documentation/cgroups.txt and Documentation/cpusets.txt. Fix those. Signed-off-by: Thadeu Lima de Souza Cascardo Reviewed-by: Li Zefan Acked-by: Paul Menage Signed-off-by: Jiri Kosina --- Documentation/cgroups/00-INDEX | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) create mode 100644 Documentation/cgroups/00-INDEX (limited to 'Documentation/cgroups') diff --git a/Documentation/cgroups/00-INDEX b/Documentation/cgroups/00-INDEX new file mode 100644 index 000000000000..3f58fa3d6d00 --- /dev/null +++ b/Documentation/cgroups/00-INDEX @@ -0,0 +1,18 @@ +00-INDEX + - this file +cgroups.txt + - Control Groups definition, implementation details, examples and API. +cpuacct.txt + - CPU Accounting Controller; account CPU usage for groups of tasks. +cpusets.txt + - documents the cpusets feature; assign CPUs and Mem to a set of tasks. +devices.txt + - Device Whitelist Controller; description, interface and security. +freezer-subsystem.txt + - checkpointing; rationale to not use signals, interface. +memcg_test.txt + - Memory Resource Controller; implementation details. +memory.txt + - Memory Resource Controller; design, accounting, interface, testing. +resource_counter.txt + - Resource Counter API. -- cgit v1.2.3 From ec64f51545fffbc4cb968f0cea56341a4b07e85a Mon Sep 17 00:00:00 2001 From: KAMEZAWA Hiroyuki Date: Thu, 2 Apr 2009 16:57:26 -0700 Subject: cgroup: fix frequent -EBUSY at rmdir In following situation, with memory subsystem, /groupA use_hierarchy==1 /01 some tasks /02 some tasks /03 some tasks /04 empty When tasks under 01/02/03 hit limit on /groupA, hierarchical reclaim is triggered and the kernel walks tree under groupA. In this case, rmdir /groupA/04 fails with -EBUSY frequently because of temporal refcnt from the kernel. In general. cgroup can be rmdir'd if there are no children groups and no tasks. Frequent fails of rmdir() is not useful to users. (And the reason for -EBUSY is unknown to users.....in most cases) This patch tries to modify above behavior, by - retries if css_refcnt is got by someone. - add "return value" to pre_destroy() and allows subsystem to say "we're really busy!" Signed-off-by: KAMEZAWA Hiroyuki Cc: Paul Menage Cc: Li Zefan Cc: Balbir Singh Cc: Daisuke Nishimura Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/cgroups/cgroups.txt | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) (limited to 'Documentation/cgroups') diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt index 93feb8444489..cdc46a501b85 100644 --- a/Documentation/cgroups/cgroups.txt +++ b/Documentation/cgroups/cgroups.txt @@ -476,11 +476,13 @@ cgroup->parent is still valid. (Note - can also be called for a newly-created cgroup if an error occurs after this subsystem's create() method has been called for the new cgroup). -void pre_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp); +int pre_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp); Called before checking the reference count on each subsystem. This may be useful for subsystems which have some extra references even if -there are not tasks in the cgroup. +there are not tasks in the cgroup. If pre_destroy() returns error code, +rmdir() will fail with it. From this behavior, pre_destroy() can be +called multiple times against a cgroup. int can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp, struct task_struct *task) -- cgit v1.2.3 From b6719ec1ad54e47e40633b19703f2c1254708842 Mon Sep 17 00:00:00 2001 From: Li Zefan Date: Thu, 2 Apr 2009 16:57:28 -0700 Subject: cgroups: more documentation for remount and release_agent This won't remove cpuacct from the mounted hierachy: # mount -t cgroup -o cpu,cpuacct xxx /mnt # mount -o remount,cpu /mnt Because for this usage mount(8) will append the new options to the original options. And this will get you right: # mount [-t cgroup] -o remount,cpu xxx /mnt Also document how to specify or change release_agent. Signed-off-by: Li Zefan Reviewd-by: KAMEZAWA Hiroyuki Cc: Paul Menage Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/cgroups/cgroups.txt | 20 ++++++++++++++++++-- 1 file changed, 18 insertions(+), 2 deletions(-) (limited to 'Documentation/cgroups') diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt index cdc46a501b85..4ea852345a47 100644 --- a/Documentation/cgroups/cgroups.txt +++ b/Documentation/cgroups/cgroups.txt @@ -333,12 +333,23 @@ The "xxx" is not interpreted by the cgroup code, but will appear in To mount a cgroup hierarchy with just the cpuset and numtasks subsystems, type: -# mount -t cgroup -o cpuset,numtasks hier1 /dev/cgroup +# mount -t cgroup -o cpuset,memory hier1 /dev/cgroup To change the set of subsystems bound to a mounted hierarchy, just remount with different options: +# mount -o remount,cpuset,ns hier1 /dev/cgroup -# mount -o remount,cpuset,ns /dev/cgroup +Now memory is removed from the hierarchy and ns is added. + +Note this will add ns to the hierarchy but won't remove memory or +cpuset, because the new options are appended to the old ones: +# mount -o remount,ns /dev/cgroup + +To Specify a hierarchy's release_agent: +# mount -t cgroup -o cpuset,release_agent="/sbin/cpuset_release_agent" \ + xxx /dev/cgroup + +Note that specifying 'release_agent' more than once will return failure. Note that changing the set of subsystems is currently only supported when the hierarchy consists of a single (root) cgroup. Supporting @@ -349,6 +360,11 @@ Then under /dev/cgroup you can find a tree that corresponds to the tree of the cgroups in the system. For instance, /dev/cgroup is the cgroup that holds the whole system. +If you want to change the value of release_agent: +# echo "/sbin/new_release_agent" > /dev/cgroup/release_agent + +It can also be changed via remount. + If you want to create a new cgroup under /dev/cgroup: # cd /dev/cgroup # mkdir my_cgroup -- cgit v1.2.3 From 0b7f569e45bb6be142d87017030669a6a7d327a1 Mon Sep 17 00:00:00 2001 From: KAMEZAWA Hiroyuki Date: Thu, 2 Apr 2009 16:57:38 -0700 Subject: memcg: fix OOM killer under memcg This patch tries to fix OOM Killer problems caused by hierarchy. Now, memcg itself has OOM KILL function (in oom_kill.c) and tries to kill a task in memcg. But, when hierarchy is used, it's broken and correct task cannot be killed. For example, in following cgroup /groupA/ hierarchy=1, limit=1G, 01 nolimit 02 nolimit All tasks' memory usage under /groupA, /groupA/01, groupA/02 is limited to groupA's 1Gbytes but OOM Killer just kills tasks in groupA. This patch provides makes the bad process be selected from all tasks under hierarchy. BTW, currently, oom_jiffies is updated against groupA in above case. oom_jiffies of tree should be updated. To see how oom_jiffies is used, please check mem_cgroup_oom_called() callers. [akpm@linux-foundation.org: build fix] [akpm@linux-foundation.org: const fix] Signed-off-by: KAMEZAWA Hiroyuki Cc: Paul Menage Cc: Li Zefan Cc: Balbir Singh Cc: Daisuke Nishimura Cc: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/cgroups/memcg_test.txt | 20 +++++++++++++++++++- 1 file changed, 19 insertions(+), 1 deletion(-) (limited to 'Documentation/cgroups') diff --git a/Documentation/cgroups/memcg_test.txt b/Documentation/cgroups/memcg_test.txt index 523a9c16c400..8a11caf417a0 100644 --- a/Documentation/cgroups/memcg_test.txt +++ b/Documentation/cgroups/memcg_test.txt @@ -1,5 +1,5 @@ Memory Resource Controller(Memcg) Implementation Memo. -Last Updated: 2009/1/19 +Last Updated: 2009/1/20 Base Kernel Version: based on 2.6.29-rc2. Because VM is getting complex (one of reasons is memcg...), memcg's behavior @@ -360,3 +360,21 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. # kill malloc task. Of course, tmpfs v.s. swapoff test should be tested, too. + + 9.8 OOM-Killer + Out-of-memory caused by memcg's limit will kill tasks under + the memcg. When hierarchy is used, a task under hierarchy + will be killed by the kernel. + In this case, panic_on_oom shouldn't be invoked and tasks + in other groups shouldn't be killed. + + It's not difficult to cause OOM under memcg as following. + Case A) when you can swapoff + #swapoff -a + #echo 50M > /memory.limit_in_bytes + run 51M of malloc + + Case B) when you use mem+swap limitation. + #echo 50M > memory.limit_in_bytes + #echo 50M > memory.memsw.limit_in_bytes + run 51M of malloc -- cgit v1.2.3