--- zzzz-none-000/linux-3.10.107/Documentation/cgroups/memory.txt 2017-06-27 09:49:32.000000000 +0000 +++ scorpion-7490-727/linux-3.10.107/Documentation/cgroups/memory.txt 2021-02-04 17:41:59.000000000 +0000 @@ -1,5 +1,10 @@ Memory Resource Controller +NOTE: This document is hopelessly outdated and it asks for a complete + rewrite. It still contains a useful information so we are keeping it + here but make sure to check the current code if you need a deeper + understanding. + NOTE: The Memory Resource Controller has generically been referred to as the memory controller in this document. Do not confuse memory controller used here with the memory controller that is used in hardware. @@ -52,9 +57,9 @@ tasks # attach a task(thread) and show list of threads cgroup.procs # show list of processes cgroup.event_control # an interface for event_fd() - memory.usage_in_bytes # show current res_counter usage for memory + memory.usage_in_bytes # show current usage for memory (See 5.5 for details) - memory.memsw.usage_in_bytes # show current res_counter usage for memory+Swap + memory.memsw.usage_in_bytes # show current usage for memory+Swap (See 5.5 for details) memory.limit_in_bytes # set/show limit of memory usage memory.memsw.limit_in_bytes # set/show limit of memory+Swap usage @@ -116,16 +121,16 @@ 2.1. Design -The core of the design is a counter called the res_counter. The res_counter -tracks the current memory usage and limit of the group of processes associated -with the controller. Each cgroup has a memory controller specific data -structure (mem_cgroup) associated with it. +The core of the design is a counter called the page_counter. The +page_counter tracks the current memory usage and limit of the group of +processes associated with the controller. Each cgroup has a memory controller +specific data structure (mem_cgroup) associated with it. 2.2. Accounting +--------------------+ - | mem_cgroup | - | (res_counter) | + | mem_cgroup | + | (page_counter) | +--------------------+ / ^ \ / | \ @@ -304,7 +309,7 @@ memory usage is too high. * slab pages: pages allocated by the SLAB or SLUB allocator are tracked. A copy -of each kmem_cache is created everytime the cache is touched by the first time +of each kmem_cache is created every time the cache is touched by the first time from inside the memcg. The creation is done lazily, so some objects can still be skipped while the cache is being created. All objects in a slab page should belong to the same memcg. This only fails to hold when a task is migrated to a @@ -316,7 +321,7 @@ * tcp memory pressure: sockets memory pressure for the tcp protocol. -2.7.3 Common use cases +2.7.2 Common use cases Because the "kmem" counter is fed to the main user counter, kernel memory can never be limited completely independently of user memory. Say "U" is the user @@ -335,6 +340,9 @@ In this case, the admin could set up K so that the sum of all groups is never greater than the total memory, and freely set U at the cost of his QoS. + WARNING: In the current implementation, memory reclaim will NOT be + triggered for a cgroup when it hits K while staying below U, which makes + this setup impractical. U != 0, K >= U: Since kmem charges will also be fed to the user counter and reclaim will be @@ -344,20 +352,19 @@ 3. User Interface -0. Configuration +3.0. Configuration a. Enable CONFIG_CGROUPS -b. Enable CONFIG_RESOURCE_COUNTERS -c. Enable CONFIG_MEMCG -d. Enable CONFIG_MEMCG_SWAP (to use swap extension) +b. Enable CONFIG_MEMCG +c. Enable CONFIG_MEMCG_SWAP (to use swap extension) d. Enable CONFIG_MEMCG_KMEM (to use kmem extension) -1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?) +3.1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?) # mount -t tmpfs none /sys/fs/cgroup # mkdir /sys/fs/cgroup/memory # mount -t cgroup none /sys/fs/cgroup/memory -o memory -2. Make the new group and move bash into it +3.2. Make the new group and move bash into it # mkdir /sys/fs/cgroup/memory/0 # echo $$ > /sys/fs/cgroup/memory/0/tasks @@ -453,15 +460,11 @@ 5.1 force_empty memory.force_empty interface is provided to make cgroup's memory usage empty. - You can use this interface only when the cgroup has no tasks. When writing anything to this # echo 0 > memory.force_empty - Almost all pages tracked by this memory cgroup will be unmapped and freed. - Some pages cannot be freed because they are locked or in-use. Such pages are - moved to parent (if use_hierarchy==1) or root (if use_hierarchy==0) and this - cgroup will be empty. + the cgroup will be reclaimed and as many pages reclaimed as possible. The typical use case for this interface is before calling rmdir(). Because rmdir() moves all pages to parent, some out-of-use page caches can be @@ -490,10 +493,13 @@ pgpgout - # of uncharging events to the memory cgroup. The uncharging event happens each time a page is unaccounted from the cgroup. swap - # of bytes of swap usage -inactive_anon - # of bytes of anonymous memory and swap cache memory on +dirty - # of bytes that are waiting to get written back to the disk. +writeback - # of bytes of file/anon cache that are queued for syncing to + disk. +inactive_anon - # of bytes of anonymous and swap cache memory on inactive LRU list. active_anon - # of bytes of anonymous and swap cache memory on active - inactive LRU list. + LRU list. inactive_file - # of bytes of file-backed memory on inactive LRU list. active_file - # of bytes of file-backed memory on active LRU list. unevictable - # of bytes of memory that cannot be reclaimed (mlocked etc). @@ -533,16 +539,13 @@ 5.3 swappiness -Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only. -Please note that unlike the global swappiness, memcg knob set to 0 -really prevents from any swapping even if there is a swap storage -available. This might lead to memcg OOM killer if there are no file -pages to reclaim. - -Following cgroups' swappiness can't be changed. -- root cgroup (uses /proc/sys/vm/swappiness). -- a cgroup which uses hierarchy and it has other cgroup(s) below it. -- a cgroup which uses hierarchy and not the root of hierarchy. +Overrides /proc/sys/vm/swappiness for the particular group. The tunable +in the root cgroup corresponds to the global swappiness setting. + +Please note that unlike during the global reclaim, limit reclaim +enforces that 0 swappiness really prevents from any swapping even if +there is a swap storage available. This might lead to memcg OOM killer +if there are no file pages to reclaim. 5.4 failcnt @@ -571,15 +574,19 @@ node. One of the use cases is evaluating application performance by combining this information with the application's CPU allocation. -We export "total", "file", "anon" and "unevictable" pages per-node for -each memcg. The ouput format of memory.numa_stat is: +Each memcg's numa_stat file includes "total", "file", "anon" and "unevictable" +per-node page counts including "hierarchical_" which sums up all +hierarchical children's values in addition to the memcg's own value. + +The output format of memory.numa_stat is: total= N0= N1= ... file= N0= N1= ... anon= N0= N1= ... unevictable= N0= N1= ... +hierarchical_= N0= N1= ... -And we have total = file + anon + unevictable. +The "total" count is sum of file + anon + unevictable. 6. Hierarchy support @@ -664,7 +671,7 @@ 8.1 Interface -This feature is disabled by default. It can be enabledi (and disabled again) by +This feature is disabled by default. It can be enabled (and disabled again) by writing to memory.move_charge_at_immigrate of the destination cgroup. If you want to enable it: @@ -748,7 +755,6 @@ #echo 1 > memory.oom_control -This operation is only allowed to the top cgroup of a sub-hierarchy. If OOM-killer is disabled, tasks under cgroup will hang/sleep in memory cgroup's OOM-waitqueue when they request accountable memory. @@ -834,10 +840,9 @@ 12. TODO -1. Add support for accounting huge pages (as a separate controller) -2. Make per-cgroup scanner reclaim not-shared pages first -3. Teach controller to account for shared-pages -4. Start reclamation in the background when the limit is +1. Make per-cgroup scanner reclaim not-shared pages first +2. Teach controller to account for shared-pages +3. Start reclamation in the background when the limit is not yet hit but the usage is getting closer Summary