Skip to content

Commit 7caa471

Browse files
htejunaxboe
authored andcommitted
blkcg: implement blk-iocost
This patchset implements IO cost model based work-conserving proportional controller. While io.latency provides the capability to comprehensively prioritize and protect IOs depending on the cgroups, its protection is binary - the lowest latency target cgroup which is suffering is protected at the cost of all others. In many use cases including stacking multiple workload containers in a single system, it's necessary to distribute IO capacity with better granularity. One challenge of controlling IO resources is the lack of trivially observable cost metric. The most common metrics - bandwidth and iops - can be off by orders of magnitude depending on the device type and IO pattern. However, the cost isn't a complete mystery. Given several key attributes, we can make fairly reliable predictions on how expensive a given stream of IOs would be, at least compared to other IO patterns. The function which determines the cost of a given IO is the IO cost model for the device. This controller distributes IO capacity based on the costs estimated by such model. The more accurate the cost model the better but the controller adapts based on IO completion latency and as long as the relative costs across differents IO patterns are consistent and sensible, it'll adapt to the actual performance of the device. Currently, the only implemented cost model is a simple linear one with a few sets of default parameters for different classes of device. This covers most common devices reasonably well. All the infrastructure to tune and add different cost models is already in place and a later patch will also allow using bpf progs for cost models. Please see the top comment in blk-iocost.c and documentation for more details. v2: Rebased on top of RQ_ALLOC_TIME changes and folded in Rik's fix for a divide-by-zero bug in current_hweight() triggered by zero inuse_sum. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Andy Newell <newella@fb.com> Cc: Josef Bacik <jbacik@fb.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
1 parent 6f816b4 commit 7caa471

7 files changed

Lines changed: 2656 additions & 0 deletions

File tree

Documentation/admin-guide/cgroup-v2.rst

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1435,6 +1435,100 @@ IO Interface Files
14351435
8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353 dbytes=0 dios=0
14361436
8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252 dbytes=50331648 dios=3021
14371437

1438+
io.cost.qos
1439+
A read-write nested-keyed file with exists only on the root
1440+
cgroup.
1441+
1442+
This file configures the Quality of Service of the IO cost
1443+
model based controller (CONFIG_BLK_CGROUP_IOCOST) which
1444+
currently implements "io.weight" proportional control. Lines
1445+
are keyed by $MAJ:$MIN device numbers and not ordered. The
1446+
line for a given device is populated on the first write for
1447+
the device on "io.cost.qos" or "io.cost.model". The following
1448+
nested keys are defined.
1449+
1450+
====== =====================================
1451+
enable Weight-based control enable
1452+
ctrl "auto" or "user"
1453+
rpct Read latency percentile [0, 100]
1454+
rlat Read latency threshold
1455+
wpct Write latency percentile [0, 100]
1456+
wlat Write latency threshold
1457+
min Minimum scaling percentage [1, 10000]
1458+
max Maximum scaling percentage [1, 10000]
1459+
====== =====================================
1460+
1461+
The controller is disabled by default and can be enabled by
1462+
setting "enable" to 1. "rpct" and "wpct" parameters default
1463+
to zero and the controller uses internal device saturation
1464+
state to adjust the overall IO rate between "min" and "max".
1465+
1466+
When a better control quality is needed, latency QoS
1467+
parameters can be configured. For example::
1468+
1469+
8:16 enable=1 ctrl=auto rpct=95.00 rlat=75000 wpct=95.00 wlat=150000 min=50.00 max=150.0
1470+
1471+
shows that on sdb, the controller is enabled, will consider
1472+
the device saturated if the 95th percentile of read completion
1473+
latencies is above 75ms or write 150ms, and adjust the overall
1474+
IO issue rate between 50% and 150% accordingly.
1475+
1476+
The lower the saturation point, the better the latency QoS at
1477+
the cost of aggregate bandwidth. The narrower the allowed
1478+
adjustment range between "min" and "max", the more conformant
1479+
to the cost model the IO behavior. Note that the IO issue
1480+
base rate may be far off from 100% and setting "min" and "max"
1481+
blindly can lead to a significant loss of device capacity or
1482+
control quality. "min" and "max" are useful for regulating
1483+
devices which show wide temporary behavior changes - e.g. a
1484+
ssd which accepts writes at the line speed for a while and
1485+
then completely stalls for multiple seconds.
1486+
1487+
When "ctrl" is "auto", the parameters are controlled by the
1488+
kernel and may change automatically. Setting "ctrl" to "user"
1489+
or setting any of the percentile and latency parameters puts
1490+
it into "user" mode and disables the automatic changes. The
1491+
automatic mode can be restored by setting "ctrl" to "auto".
1492+
1493+
io.cost.model
1494+
A read-write nested-keyed file with exists only on the root
1495+
cgroup.
1496+
1497+
This file configures the cost model of the IO cost model based
1498+
controller (CONFIG_BLK_CGROUP_IOCOST) which currently
1499+
implements "io.weight" proportional control. Lines are keyed
1500+
by $MAJ:$MIN device numbers and not ordered. The line for a
1501+
given device is populated on the first write for the device on
1502+
"io.cost.qos" or "io.cost.model". The following nested keys
1503+
are defined.
1504+
1505+
===== ================================
1506+
ctrl "auto" or "user"
1507+
model The cost model in use - "linear"
1508+
===== ================================
1509+
1510+
When "ctrl" is "auto", the kernel may change all parameters
1511+
dynamically. When "ctrl" is set to "user" or any other
1512+
parameters are written to, "ctrl" become "user" and the
1513+
automatic changes are disabled.
1514+
1515+
When "model" is "linear", the following model parameters are
1516+
defined.
1517+
1518+
============= ========================================
1519+
[r|w]bps The maximum sequential IO throughput
1520+
[r|w]seqiops The maximum 4k sequential IOs per second
1521+
[r|w]randiops The maximum 4k random IOs per second
1522+
============= ========================================
1523+
1524+
From the above, the builtin linear model determines the base
1525+
costs of a sequential and random IO and the cost coefficient
1526+
for the IO size. While simple, this model can cover most
1527+
common device classes acceptably.
1528+
1529+
The IO cost model isn't expected to be accurate in absolute
1530+
sense and is scaled to the device behavior dynamically.
1531+
14381532
io.weight
14391533
A read-write flat-keyed file which exists on non-root cgroups.
14401534
The default is "default 100".

block/Kconfig

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -135,6 +135,16 @@ config BLK_CGROUP_IOLATENCY
135135

136136
Note, this is an experimental interface and could be changed someday.
137137

138+
config BLK_CGROUP_IOCOST
139+
bool "Enable support for cost model based cgroup IO controller"
140+
depends on BLK_CGROUP=y
141+
select BLK_RQ_ALLOC_TIME
142+
---help---
143+
Enabling this option enables the .weight interface for cost
144+
model based proportional IO control. The IO controller
145+
distributes IO capacity between different groups based on
146+
their share of the overall weight distribution.
147+
138148
config BLK_WBT_MQ
139149
bool "Multiqueue writeback throttling"
140150
default y

block/Makefile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@ obj-$(CONFIG_BLK_DEV_BSGLIB) += bsg-lib.o
1818
obj-$(CONFIG_BLK_CGROUP) += blk-cgroup.o
1919
obj-$(CONFIG_BLK_DEV_THROTTLING) += blk-throttle.o
2020
obj-$(CONFIG_BLK_CGROUP_IOLATENCY) += blk-iolatency.o
21+
obj-$(CONFIG_BLK_CGROUP_IOCOST) += blk-iocost.o
2122
obj-$(CONFIG_MQ_IOSCHED_DEADLINE) += mq-deadline.o
2223
obj-$(CONFIG_MQ_IOSCHED_KYBER) += kyber-iosched.o
2324
bfq-y := bfq-iosched.o bfq-wf2q.o bfq-cgroup.o

0 commit comments

Comments
 (0)