【轉】cgroup的使用
- 2019 年 10 月 4 日
- 筆記
本文的地址為:http://tiewei.github.io/devops/howto-use-cgroup/
介紹docker的的過程中,提到lxc利用cgroup來提供資源的限額和控制,本文主要介紹cgroup的用法和操作命令,主要內容來自
[2]https://www.kernel.org/doc/Documentation/cgroups/cgroups.txt
cgroup
cgroup的功能在於將一台電腦上的資源(CPU,memory, network)進行分片,來防止進程間不利的資源搶佔。
Terminology
- cgroup – 關聯一組task和一組subsystem的配置參數。一個task對應一個進程, cgroup是資源分片的最小單位。
- subsystem – 資源管理器,一個subsystem對應一項資源的管理,如 cpu, cpuset, memory等
- hierarchy – 關聯一個到多個
subsystem
和一組樹形結構的cgroup
. 和cgroup
不同,hierarchy
包含的是可管理的subsystem
而非具體參數
由此可見,cgroup對資源的管理是一個樹形結構,類似進程。
相同點 – 分層結構,子進程/cgroup繼承父進程/cgroup
不同點 – 進程是一個單根樹狀結構(pid=0為根),而cgroup整體來看是一個多樹的森林結構(hierarchy為根)。
一個典型的hierarchy
掛載目錄如下
/cgroup/ ├── blkio <--------------- hierarchy/root cgroup │ ├── blkio.io_merged <--------------- subsystem parameter ... ... │ ├── blkio.weight │ ├── blkio.weight_device │ ├── cgroup.event_control │ ├── cgroup.procs │ ├── lxc <--------------- cgroup │ │ ├── blkio.io_merged <--------------- subsystem parameter │ │ ├── blkio.io_queued ... ... ... │ │ └── tasks <--------------- task list │ ├── notify_on_release │ ├── release_agent │ └── tasks ...
subsystem列表
RHEL/centos支援的subsystem如下
- blkio — 塊存儲配額 >> this subsystem sets limits on input/output access to and from block devices such as physical drives (disk, solid state, USB, etc.).
- cpu — CPU時間分配限制 >> this subsystem uses the scheduler to provide cgroup tasks access to the CPU.
- cpuacct — CPU資源報告 >> this subsystem generates automatic reports on CPU resources used by tasks in a cgroup.
- cpuset — CPU綁定限制 >> this subsystem assigns individual CPUs (on a multicore system) and memory nodes to tasks in a cgroup.
- devices — 設備許可權限制 >> this subsystem allows or denies access to devices by tasks in a cgroup.
- freezer — cgroup停止/恢復 >> this subsystem suspends or resumes tasks in a cgroup.
- memory — 記憶體限制 >> this subsystem sets limits on memory use by tasks in a cgroup, and generates automatic reports on memory resources used by those tasks.
- net_cls — 配合tc進行網路限制 >> this subsystem tags network packets with a class identifier (classid) that allows the Linux traffic controller (tc) to identify packets originating from a particular cgroup task.
- net_prio — 網路設備優先順序 >> this subsystem provides a way to dynamically set the priority of network traffic per network interface.
- ns — 資源命名空間限制 >> the namespace subsystem.
cgroup操作準則與方法
操作準則
1.一個hierarchy可以有多個 subsystem (mount 的時候hierarchy可以attach多個subsystem)
A single hierarchy can have one or more subsystems attached to it.
eg.
mount -t cgroup -o cpu,cpuset,memory cpu_and_mem /cgroup/cpu_and_mem

2.一個已經被掛載的 subsystem 只能被再次掛載在一個空的 hierarchy 上 (已經mount一個subsystem的hierarchy不能掛載一個已經被其它hierarchy掛載的subsystem)
Any single subsystem (such as cpu) cannot be attached to more than one hierarchy if one of those hierarchies has a different subsystem attached to it already.

3.每個task只能在一同個hierarchy的唯一一個cgroup里(不能在同一個hierarchy下有超過一個cgroup的tasks里同時有這個進程的pid)
Each time a new hierarchy is created on the systems, all tasks on the system are initially members of the default cgroup of that hierarchy, which is known as the root cgroup. For any single hierarchy you create, each task on the system can be a member of exactly onecgroup in that hierarchy. A single task may be in multiple cgroups, as long as each of those cgroups is in a different hierarchy. As soon as a task becomes a member of a second cgroup in the same hierarchy, it is removed from the first cgroup in that hierarchy. At no time is a task ever in two different cgroups in the same hierarchy.

4.子進程在被fork出時自動繼承父進程所在cgroup,但是fork之後就可以按需調整到其他cgroup
Any process (task) on the system which forks itself creates a child task. A child task automatically inherits the cgroup membership of its parent but can be moved to different cgroups as needed. Once forked, the parent and child processes are completely independent.

5.其它
- 限制一個task的唯一方法就是將其加入到一個cgroup的task里
- 多個subsystem可以掛載到一個hierarchy里, 然後通過不同的cgroup中的subsystem參數來對不同的task進行限額
- 如果一個hierarchy有太多subsystem,可以考慮重構 – 將subsystem掛到獨立的hierarchy; 相應的, 可以將多個hierarchy合併成一個hierarchy
- 因為可以只掛載少量subsystem, 可以實現只對task單個方面的限額; 同時一個task可以被加到多個hierarchy中,從而實現對多個資源的控制
操作方法
1.掛載subsystem
- 利用cgconfig服務及其配置文件
/etc/cgconfig.conf
– 服務啟動時自動掛載 subsystem = /cgroup/hierarchy; - 命令行操作 mount -t cgroup -o subsystems name /cgroup/name 取消掛載 umount /cgroup/name
eg. 掛載 cpuset, cpu, cpuacct, memory 4個subsystem到/cgroup/cpu_and_mem
目錄(hierarchy)
mount { cpuset = /cgroup/cpu_and_mem; cpu = /cgroup/cpu_and_mem; cpuacct = /cgroup/cpu_and_mem; memory = /cgroup/cpu_and_mem; }
or
mount -t cgroup -o remount,cpu,cpuset,memory cpu_and_mem /cgroup/cpu_and_mem
2. 新建/刪除 cgroup
- 利用cgconfig服務及其配置文件
/etc/cgconfig.conf
– 服務啟動時自動掛載 group <name> { [<permissions>] <controller> { <param name> = <param value>; … } … } - 命令行操作
- 新建1
cgcreate -t uid:gid -a uid:gid -g subsystems:path
- 新建2
mkdir /cgroup/hierarchy/name/child_name
- 刪除1
cgdelete subsystems:path
(使用 -r 遞歸刪除) - 刪除2
rm -rf /cgroup/hierarchy/name/child_name
(cgconfig service not running)
- 新建1
3. 許可權管理
- 利用cgconfig服務及其配置文件
/etc/cgconfig.conf
– 服務啟動時自動掛載 perm { task { uid = <task user>; gid = <task group>; } admin { uid = <admin name>; gid = <admin group>; } } - 命令行操作
chown
eg.
group daemons { cpuset { cpuset.mems = 0; cpuset.cpus = 0; } } group daemons/sql { perm { task { uid = root; gid = sqladmin; } admin { uid = root; gid = root; } } cpuset { cpuset.mems = 0; cpuset.cpus = 0; } }
or
~]$ mkdir -p /cgroup/red/daemons/sql ~]$ chown root:root /cgroup/red/daemons/sql/* ~]$ chown root:sqladmin /cgroup/red/daemons/sql/tasks ~]$ echo 0 > /cgroup/red/daemons/cpuset.mems ~]$ echo 0 > /cgroup/red/daemons/cpuset.cpus ~]$ echo 0 > /cgroup/red/daemons/sql/cpuset.mems ~]$ echo 0 > /cgroup/red/daemons/sql/cpuset.cpus
4. cgroup參數設定
- 命令行1
cgset -r parameter=value path_to_cgroup
- 命令行2
cgset --copy-from path_to_source_cgroup path_to_target_cgroup
- 文件
echo value > path_to_cgroup/parameter
eg.
cgset -r cpuset.cpus=0-1 group1 cgset --copy-from group1/ group2/ echo 0-1 > /cgroup/cpuset/group1/cpuset.cpus
5. 添加task
- 命令行添加進程
cgclassify -g subsystems:path_to_cgroup pidlist
- 文件添加進程
echo pid > path_to_cgroup/tasks
- 在cgroup中啟動進程
cgexec -g subsystems:path_to_cgroup command arguments
- 在cgroup中啟動服務
echo 'CGROUP_DAEMON="subsystem:control_group"' >> /etc/sysconfig/<service>
- 利用cgrulesengd服務初始化,在配置文件
/etc/cgrules.conf
中 user<:command> subsystems control_group 其中: +用戶user的所有進程的subsystems限制的group為control_group +<:command>是可選項,表示對特定命令實行限制 +user可以用@group表示對特定的 usergroup 而非user +可以用*表示全部 +%表示和前一行的該項相同
eg.
cgclassify -g cpu,memory:group1 1701 1138 echo -e "1701n1138" |tee -a /cgroup/cpu/group1/tasks /cgroup/memory/group1/tasks cgexec -g cpu:group1 lynx http://www.redhat.com sh -c "echo $$ > /cgroup/lab1/group1/tasks && lynx http://www.redhat.com"
通過/etc/cgrules.conf 對特定服務限制
maria devices /usergroup/staff maria:ftp devices /usergroup/staff/ftp @student cpu,memory /usergroup/student/ % memory /test2/
6. 其他
- cgsnapshot會根據當前cgroup情況生成/etc/cgconfig.conf文件內容 gsnapshot [-s] [-b FILE] [-w FILE] [-f FILE] [controller] -b, –blacklist=FILE Set the blacklist configuration file (default /etc/cgsnapshot_blacklist.conf) -f, –file=FILE Redirect the output to output_file -s, –silent Ignore all warnings -t, –strict Don't show the variables which are not on the whitelist -w, –whitelist=FILE Set the whitelist configuration file (don't used by default)
- 查看進程在哪個cgroup ps -O cgroup 或 cat /proc/<PID>/cgroup
- 查看subsystem mount情況 cat /proc/cgroups lssubsys -m <subsystems>
- 查看cgroup
lscgroup
- 查看cgroup參數值 cgget -r parameter list_of_cgroups cgget -g <controllers>:<path>
- cgclear刪除hierarchy極其所有cgroup
- 事件通知API – 目前只支援memory.oom_control
- 更多
- man 1 cgclassify — the cgclassify command is used to move running tasks to one or more cgroups.
- man 1 cgclear — the cgclear command is used to delete all cgroups in a hierarchy.
- man 5 cgconfig.conf — cgroups are defined in the cgconfig.conf file.
- man 8 cgconfigparser — the cgconfigparser command parses the cgconfig.conf file and mounts hierarchies.
- man 1 cgcreate — the cgcreate command creates new cgroups in hierarchies.
- man 1 cgdelete — the cgdelete command removes specified cgroups.
- man 1 cgexec — the cgexec command runs tasks in specified cgroups.
- man 1 cgget — the cgget command displays cgroup parameters.
- man 1 cgsnapshot — the cgsnapshot command generates a configuration file from existing subsystems.
- man 5 cgred.conf — cgred.conf is the configuration file for the cgred service.
- man 5 cgrules.conf — cgrules.conf contains the rules used for determining when tasks belong to certain cgroups.
- man 8 cgrulesengd — the cgrulesengd service distributes tasks to cgroups.
- man 1 cgset — the cgset command sets parameters for a cgroup.
- man 1 lscgroup — the lscgroup command lists the cgroups in a hierarchy.
- man 1 lssubsys — the lssubsys command lists the hierarchies containing the specified subsystems.
subsystem配置
1. blkio – BLOCK IO限額
- common
- blkio.reset_stats – 重置統計資訊,寫int到此文件
- blkio.time – 統計cgroup對設備的訪問時間 –
device_types:node_numbers milliseconds
- blkio.sectors – 統計cgroup對設備扇區訪問數量 –
device_types:node_numbers sector_count
- blkio.avg_queue_size – 統計平均IO隊列大小(需要
CONFIG_DEBUG_BLK_CGROUP=y
) - blkio.group_wait_time – 統計cgroup等待總時間(需要
CONFIG_DEBUG_BLK_CGROUP=y
, 單位ns) - blkio.empty_time – 統計cgroup無等待io總時間(需要
CONFIG_DEBUG_BLK_CGROUP=y
, 單位ns) - blkio.idle_time – reports the total time (in nanoseconds — ns) the scheduler spent idling for a cgroup in anticipation of a better request than those requests already in other queues or from other groups.
- blkio.dequeue – 此cgroup IO操作被設備dequeue次數(需要
CONFIG_DEBUG_BLK_CGROUP=y
) –device_types:node_numbers number
- blkio.io_serviced – 報告CFQ scheduler統計的此cgroup對特定設備的IO操作(read, write, sync, or async)次數 –
device_types:node_numbers operation number
- blkio.io_service_bytes – 報告CFQ scheduler統計的此cgroup對特定設備的IO操作(read, write, sync, or async)數據量 –
device_types:node_numbers operation bytes
- blkio.io_service_time – 報告CFQ scheduler統計的此cgroup對特定設備的IO操作(read, write, sync, or async)時間(單位ns) –
device_types:node_numbers operation time
- blkio.io_wait_time – 此cgroup對特定設備的特定操作(read, write, sync, or async)的等待時間(單位ns) –
device_types:node_numbers operation time
- blkio.io_merged – 此cgroup的BIOS requests merged into IO請求的操作(read, write, sync, or async)的次數 –
number operation
- blkio.io_queued – 此cgroup的queued IO 操作(read, write, sync, or async)的請求次數 –
number operation
- Proportional weight division 策略 – 按比例分配block io資源
- blkio.weight – 100-1000的相對權重,會被blkio.weight_device的特定設備權重覆蓋
- blkio.weight_device – 特定設備的權重 – device_types:node_numbers weight
- I/O throttling (Upper limit) 策略 – 設定IO操作上限
- 每秒讀/寫數據上限 blkio.throttle.read_bps_device –
device_types:node_numbers bytes_per_second
blkio.throttle.write_bps_device –device_types:node_numbers bytes_per_second
- 每秒讀/寫操作次數上限 blkio.throttle.read_iops_device –
device_types:node_numbers operations_per_second
blkio.throttle.write_iops_device –device_types:node_numbers operations_per_second
- 每秒具體操作(read, write, sync, or async)的控制 blkio.throttle.io_serviced –
device_types:node_numbers operation operations_per_second
blkio.throttle.io_service_bytes –device_types:node_numbers operation bytes_per_second
- 每秒讀/寫數據上限 blkio.throttle.read_bps_device –
2. cpu – CPU使用時間限額
- CFS(Completely Fair Scheduler)策略 – CPU最大資源限制
- cpu.cfs_period_us, cpu.cfs_quota_us – 必選 – 二者配合,前者規定時間周期(微秒)後者規定cgroup最多可使用時間(微秒),實現task對單個cpu的使用上限(cfs_quota_us是cfs_period_us的兩倍即可限定在雙核上完全使用)。
- cpu.stat – 記錄cpu統計資訊,包含 nr_periods(經歷了幾個cfs_period_us), nr_throttled (cgroup里的task被限制了幾次), throttled_time (cgroup里的task被限制了多少納秒)
- cpu.shares – 可選 – cpu輪轉權重的相對值
- RT(Real-Time scheduler)策略 – CPU最小資源限制
- cpu.rt_period_us, cpu.rt_runtime_us 二者配合使用規定cgroup里的task每cpu.rt_period_us(微秒)必然會執行cpu.rt_runtime_us(微秒)
3. cpuacct – CPU資源報告
- cpuacct.usage – cgroup中所有task的cpu使用時長(納秒)
- cpuacct.stat – cgroup中所有task的用戶態和內核態分別使用cpu的時長
- cpuacct.usage_percpu – cgroup中所有task使用每個cpu的時長
4. cpuset – CPU綁定
- cpuset.cpus – 必選 – cgroup可使用的cpu,如0-2,16代表 0,1,2,16這4個cpu
- cpuset.mems – 必選 – cgroup可使用的memory node
- cpuset.memory_migrate – 可選 – 當cpuset.mems變化時page上的數據是否遷移, default 0
- cpuset.cpu_exclusive – 可選 – 是否獨佔cpu, default 0
- cpuset.mem_exclusive – 可選 – 是否獨佔memory,default 0
- cpuset.mem_hardwall – 可選 – cgroup中task的記憶體是否隔離, default 0
- cpuset.memory_pressure – 可選 – a read-only file that contains a running average of the memory pressure created by the processes in this cpuset
- cpuset.memory_pressure_enabled – 可選 – cpuset.memory_pressure開關,default 0
- cpuset.memory_spread_page – 可選 – contains a flag (0 or 1) that specifies whether file system buffers should be spread evenly across the memory nodes allocated to this cpuset, default 0
- cpuset.memory_spread_slab – 可選 – contains a flag (0 or 1) that specifies whether kernel slab caches for file input/output operations should be spread evenly across the cpuset, default 0
- cpuset.sched_load_balance – 可選 – cgroup的cpu壓力是否會被平均到cpu set中的多個cpu, default 1
- cpuset.sched_relax_domain_level – 可選 – cpuset.sched_load_balance的策略
- -1 = Use the system default value for load balancing
- 0 = Do not perform immediate load balancing; balance loads only periodically
- 1 = Immediately balance loads across threads on the same core
- 2 = Immediately balance loads across cores in the same package
- 3 = Immediately balance loads across CPUs on the same node or blade
- 4 = Immediately balance loads across several CPUs on architectures with non-uniform memory access (NUMA)
- 5 = Immediately balance loads across all CPUs on architectures with NUMA
5. device – cgoup的device限制
- 設備黑/白名單
- devices.allow – 允許名單
- devices.deny – 禁止名單
- 語法 – type device_types:node_numbers access type – b (塊設備) c (字元設備) a (全部設備) access – r 讀 w 寫 m 創建
- devices.list – 報告
6. freezer – 暫停/恢復 cgroup的限制
- 不能出現在root目錄下
- freezer.state – FROZEN 停止 FREEZING 正在停止 THAWED 恢復
7. memory – 記憶體限制
- memory.usage_in_bytes – 報告記憶體限制byte
- memory.memsw.usage_in_bytes – 報告cgroup中進程當前所用記憶體+swap空間
- memory.max_usage_in_bytes – 報告cgoup中的最大記憶體使用
- memory.memsw.max_usage_in_bytes – 報告最大使用到的記憶體+swap
- memory.limit_in_bytes – cgroup – 最大記憶體限制,單位k,m,g. -1代表取消限制
- memory.memsw.limit_in_bytes – 最大記憶體+swap限制,單位k,m,g. -1代表取消限制
- memory.failcnt – 報告達到最大允許記憶體的次數
- memory.memsw.failcnt – 報告達到最大允許記憶體+swap的次數
- memory.force_empty – 設為0且無task時,清除cgroup的記憶體頁
- memory.swappiness – 換頁策略,60基準,小於60降低換出機率,大於60增加換出機率
- memory.use_hierarchy – 是否影響子group
- memory.oom_control – 0 enabled,當oom發生時kill掉進程
- memory.stat – 報告cgroup限制狀態
- cache – page cache, including tmpfs (shmem), in bytes
- rss – anonymous and swap cache, not including tmpfs (shmem), in bytes
- mapped_file – size of memory-mapped mapped files, including tmpfs (shmem), in bytes
- pgpgin – number of pages paged into memory
- pgpgout – number of pages paged out of memory
- swap – swap usage, in bytes
- active_anon – anonymous and swap cache on active least-recently-used (LRU) list, including tmpfs (shmem), in bytes
- inactive_anon – anonymous and swap cache on inactive LRU list, including tmpfs (shmem), in bytes
- active_file – file-backed memory on active LRU list, in bytes
- inactive_file – file-backed memory on inactive LRU list, in bytes
- unevictable – memory that cannot be reclaimed, in bytes
- hierarchical_memory_limit – memory limit for the hierarchy that contains the memory cgroup, in bytes
- hierarchical_memsw_limit – memory plus swap limit for the hierarchy that contains the memory cgroup, in bytes
8. net_cls
- net_cls.classid – 指定tc的handle,通過tc實現網路控制
9.net_prio 指定task網路設備優先順序
- net_prio.prioidx – a read-only file which contains a unique integer value that the kernel uses as an internal representation of this cgroup.
- net_prio.ifpriomap – 網路設備使用優先順序 –
<network_interface> <priority>
10.其他
- tasks – 該cgroup的所有進程pid
- cgroup.event_control – event api
- cgroup.procs – thread group id
- release_agent(present in the root cgroup only) – 根據notify_on_release是否在task為空時執行的腳本
- notify_on_release – 當cgroup中沒有task時是否執行release_agent
總結
- 本文總結了cgroup的操作方法和詳細的可配置項,為對更好的控制系統中的資源分配打下基礎
- 對於限制資源分配的兩個場景,在針對特殊APP的場景中可進行非常細緻的調優,而在通用的資源隔離的角度上看,可能更關注的是CPU和記憶體相關的主要屬性