Skip to content

Commit

Permalink
Merge branch 'for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/g…
Browse files Browse the repository at this point in the history
…it/tj/cgroup

Pull cgroup updates from Tejun Heo:

 - cgroup v2 interface is now official.  It's no longer hidden behind a
   devel flag and can be mounted using the new cgroup2 fs type.

   Unfortunately, cpu v2 interface hasn't made it yet due to the
   discussion around in-process hierarchical resource distribution and
   only memory and io controllers can be used on the v2 interface at the
   moment.

 - The existing documentation which has always been a bit of mess is
   relocated under Documentation/cgroup-v1/. Documentation/cgroup-v2.txt
   is added as the authoritative documentation for the v2 interface.

 - Some features are added through for-4.5-ancestor-test branch to
   enable netfilter xt_cgroup match to use cgroup v2 paths.  The actual
   netfilter changes will be merged through the net tree which pulled in
   the said branch.

 - Various cleanups

* 'for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: rename cgroup documentations
  cgroup: fix a typo.
  cgroup: Remove resource_counter.txt in Documentation/cgroup-legacy/00-INDEX.
  cgroup: demote subsystem init messages to KERN_DEBUG
  cgroup: Fix uninitialized variable warning
  cgroup: put controller Kconfig options in meaningful order
  cgroup: clean up the kernel configuration menu nomenclature
  cgroup_pids: fix a typo.
  Subject: cgroup: Fix incomplete dd command in blkio documentation
  cgroup: kill cgrp_ss_priv[CGROUP_CANFORK_COUNT] and friends
  cpuset: Replace all instances of time_t with time64_t
  cgroup: replace unified-hierarchy.txt with a proper cgroup v2 documentation
  cgroup: rename Documentation/cgroups/ to Documentation/cgroup-legacy/
  cgroup: replace __DEVEL__sane_behavior with cgroup2 fs type
  • Loading branch information
torvalds committed Jan 13, 2016
2 parents aee3bfa + 6255c46 commit 34a9304
Show file tree
Hide file tree
Showing 27 changed files with 1,467 additions and 961 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,5 @@ net_prio.txt
- Network priority cgroups details and usages.
pids.txt
- Process number cgroups details and usages.
resource_counter.txt
- Resource Counter API.
unified-hierarchy.txt
- Description the new/next cgroup interface.
Original file line number Diff line number Diff line change
Expand Up @@ -84,8 +84,7 @@ Throttling/Upper Limit policy

- Run dd to read a file and see if rate is throttled to 1MB/s or not.

# dd if=/mnt/common/zerofile of=/dev/null bs=4K count=1024
# iflag=direct
# dd iflag=direct if=/mnt/common/zerofile of=/dev/null bs=4K count=1024
1024+0 records in
1024+0 records out
4194304 bytes (4.2 MB) copied, 4.0001 s, 1.0 MB/s
Expand Down Expand Up @@ -374,82 +373,3 @@ One can experience an overall throughput drop if you have created multiple
groups and put applications in that group which are not driving enough
IO to keep disk busy. In that case set group_idle=0, and CFQ will not idle
on individual groups and throughput should improve.

Writeback
=========

Page cache is dirtied through buffered writes and shared mmaps and
written asynchronously to the backing filesystem by the writeback
mechanism. Writeback sits between the memory and IO domains and
regulates the proportion of dirty memory by balancing dirtying and
write IOs.

On traditional cgroup hierarchies, relationships between different
controllers cannot be established making it impossible for writeback
to operate accounting for cgroup resource restrictions and all
writeback IOs are attributed to the root cgroup.

If both the blkio and memory controllers are used on the v2 hierarchy
and the filesystem supports cgroup writeback, writeback operations
correctly follow the resource restrictions imposed by both memory and
blkio controllers.

Writeback examines both system-wide and per-cgroup dirty memory status
and enforces the more restrictive of the two. Also, writeback control
parameters which are absolute values - vm.dirty_bytes and
vm.dirty_background_bytes - are distributed across cgroups according
to their current writeback bandwidth.

There's a peculiarity stemming from the discrepancy in ownership
granularity between memory controller and writeback. While memory
controller tracks ownership per page, writeback operates on inode
basis. cgroup writeback bridges the gap by tracking ownership by
inode but migrating ownership if too many foreign pages, pages which
don't match the current inode ownership, have been encountered while
writing back the inode.

This is a conscious design choice as writeback operations are
inherently tied to inodes making strictly following page ownership
complicated and inefficient. The only use case which suffers from
this compromise is multiple cgroups concurrently dirtying disjoint
regions of the same inode, which is an unlikely use case and decided
to be unsupported. Note that as memory controller assigns page
ownership on the first use and doesn't update it until the page is
released, even if cgroup writeback strictly follows page ownership,
multiple cgroups dirtying overlapping areas wouldn't work as expected.
In general, write-sharing an inode across multiple cgroups is not well
supported.

Filesystem support for cgroup writeback
---------------------------------------

A filesystem can make writeback IOs cgroup-aware by updating
address_space_operations->writepage[s]() to annotate bio's using the
following two functions.

* wbc_init_bio(@wbc, @bio)

Should be called for each bio carrying writeback data and associates
the bio with the inode's owner cgroup. Can be called anytime
between bio allocation and submission.

* wbc_account_io(@wbc, @page, @bytes)

Should be called for each data segment being written out. While
this function doesn't care exactly when it's called during the
writeback session, it's the easiest and most natural to call it as
data segments are added to a bio.

With writeback bio's annotated, cgroup support can be enabled per
super_block by setting MS_CGROUPWB in ->s_flags. This allows for
selective disabling of cgroup writeback support which is helpful when
certain filesystem features, e.g. journaled data mode, are
incompatible.

wbc_init_bio() binds the specified bio to its cgroup. Depending on
the configuration, the bio may be executed at a lower priority and if
the writeback session is holding shared resources, e.g. a journal
entry, may lead to priority inversion. There is no one easy solution
for the problem. Filesystems can try to work around specific problem
cases by skipping wbc_init_bio() or using bio_associate_blkcg()
directly.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Loading

0 comments on commit 34a9304

Please sign in to comment.