Skip to content

Commit

Permalink
Merge git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-s…
Browse files Browse the repository at this point in the history
…ched

* git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched: (140 commits)
  sched: sync wakeups preempt too
  sched: affine sync wakeups
  sched: guest CPU accounting: maintain guest state in KVM
  sched: guest CPU accounting: maintain stats in account_system_time()
  sched: guest CPU accounting: add guest-CPU /proc/<pid>/stat fields
  sched: guest CPU accounting: add guest-CPU /proc/stat field
  sched: domain sysctl fixes: add terminator comment
  sched: domain sysctl fixes: do not crash on allocation failure
  sched: domain sysctl fixes: unregister the sysctl table before domains
  sched: domain sysctl fixes: use for_each_online_cpu()
  sched: domain sysctl fixes: use kcalloc()
  Make scheduler debug file operations const
  sched: enable wake-idle on CONFIG_SCHED_MC=y
  sched: reintroduce topology.h tunings
  sched: allow the immediate migration of cache-cold tasks
  sched: debug, improve migration statistics
  sched: debug: increase width of debug line
  sched: activate task_hot() only on fair-scheduled tasks
  sched: reintroduce cache-hot affinity
  sched: speed up context-switches a bit
  ...
  • Loading branch information
Linus Torvalds committed Oct 15, 2007
2 parents df3d80f + 9c63d9c commit b5869ce
Show file tree
Hide file tree
Showing 25 changed files with 1,872 additions and 1,288 deletions.
67 changes: 67 additions & 0 deletions Documentation/sched-design-CFS.txt
Original file line number Diff line number Diff line change
Expand Up @@ -117,3 +117,70 @@ Some implementation details:
iterators of the scheduling modules are used. The balancing code got
quite a bit simpler as a result.


Group scheduler extension to CFS
================================

Normally the scheduler operates on individual tasks and strives to provide
fair CPU time to each task. Sometimes, it may be desirable to group tasks
and provide fair CPU time to each such task group. For example, it may
be desirable to first provide fair CPU time to each user on the system
and then to each task belonging to a user.

CONFIG_FAIR_GROUP_SCHED strives to achieve exactly that. It lets
SCHED_NORMAL/BATCH tasks be be grouped and divides CPU time fairly among such
groups. At present, there are two (mutually exclusive) mechanisms to group
tasks for CPU bandwidth control purpose:

- Based on user id (CONFIG_FAIR_USER_SCHED)
In this option, tasks are grouped according to their user id.
- Based on "cgroup" pseudo filesystem (CONFIG_FAIR_CGROUP_SCHED)
This options lets the administrator create arbitrary groups
of tasks, using the "cgroup" pseudo filesystem. See
Documentation/cgroups.txt for more information about this
filesystem.

Only one of these options to group tasks can be chosen and not both.

Group scheduler tunables:

When CONFIG_FAIR_USER_SCHED is defined, a directory is created in sysfs for
each new user and a "cpu_share" file is added in that directory.

# cd /sys/kernel/uids
# cat 512/cpu_share # Display user 512's CPU share
1024
# echo 2048 > 512/cpu_share # Modify user 512's CPU share
# cat 512/cpu_share # Display user 512's CPU share
2048
#

CPU bandwidth between two users are divided in the ratio of their CPU shares.
For ex: if you would like user "root" to get twice the bandwidth of user
"guest", then set the cpu_share for both the users such that "root"'s
cpu_share is twice "guest"'s cpu_share


When CONFIG_FAIR_CGROUP_SCHED is defined, a "cpu.shares" file is created
for each group created using the pseudo filesystem. See example steps
below to create task groups and modify their CPU share using the "cgroups"
pseudo filesystem

# mkdir /dev/cpuctl
# mount -t cgroup -ocpu none /dev/cpuctl
# cd /dev/cpuctl

# mkdir multimedia # create "multimedia" group of tasks
# mkdir browser # create "browser" group of tasks

# #Configure the multimedia group to receive twice the CPU bandwidth
# #that of browser group

# echo 2048 > multimedia/cpu.shares
# echo 1024 > browser/cpu.shares

# firefox & # Launch firefox and move it to "browser" group
# echo <firefox_pid> > browser/tasks

# #Launch gmplayer (or your favourite movie player)
# echo <movie_player_pid> > multimedia/tasks
11 changes: 11 additions & 0 deletions arch/i386/Kconfig
Original file line number Diff line number Diff line change
Expand Up @@ -214,6 +214,17 @@ config X86_ES7000

endchoice

config SCHED_NO_NO_OMIT_FRAME_POINTER
bool "Single-depth WCHAN output"
default y
help
Calculate simpler /proc/<PID>/wchan values. If this option
is disabled then wchan values will recurse back to the
caller function. This provides more accurate wchan values,
at the expense of slightly more scheduling overhead.

If in doubt, say "Y".

config PARAVIRT
bool "Paravirtualization support (EXPERIMENTAL)"
depends on EXPERIMENTAL
Expand Down
10 changes: 10 additions & 0 deletions drivers/kvm/kvm.h
Original file line number Diff line number Diff line change
Expand Up @@ -624,6 +624,16 @@ void kvm_mmu_unload(struct kvm_vcpu *vcpu);

int kvm_hypercall(struct kvm_vcpu *vcpu, struct kvm_run *run);

static inline void kvm_guest_enter(void)
{
current->flags |= PF_VCPU;
}

static inline void kvm_guest_exit(void)
{
current->flags &= ~PF_VCPU;
}

static inline int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
u32 error_code)
{
Expand Down
2 changes: 2 additions & 0 deletions drivers/kvm/kvm_main.c
Original file line number Diff line number Diff line change
Expand Up @@ -2046,13 +2046,15 @@ static int __vcpu_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
kvm_x86_ops->inject_pending_vectors(vcpu, kvm_run);

vcpu->guest_mode = 1;
kvm_guest_enter();

if (vcpu->requests)
if (test_and_clear_bit(KVM_TLB_FLUSH, &vcpu->requests))
kvm_x86_ops->tlb_flush(vcpu);

kvm_x86_ops->run(vcpu, kvm_run);

kvm_guest_exit();
vcpu->guest_mode = 0;
local_irq_enable();

Expand Down
9 changes: 4 additions & 5 deletions fs/pipe.c
Original file line number Diff line number Diff line change
Expand Up @@ -45,8 +45,7 @@ void pipe_wait(struct pipe_inode_info *pipe)
* Pipes are system-local resources, so sleeping on them
* is considered a noninteractive wait:
*/
prepare_to_wait(&pipe->wait, &wait,
TASK_INTERRUPTIBLE | TASK_NONINTERACTIVE);
prepare_to_wait(&pipe->wait, &wait, TASK_INTERRUPTIBLE);
if (pipe->inode)
mutex_unlock(&pipe->inode->i_mutex);
schedule();
Expand Down Expand Up @@ -383,7 +382,7 @@ pipe_read(struct kiocb *iocb, const struct iovec *_iov,

/* Signal writers asynchronously that there is more room. */
if (do_wakeup) {
wake_up_interruptible(&pipe->wait);
wake_up_interruptible_sync(&pipe->wait);
kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
}
if (ret > 0)
Expand Down Expand Up @@ -556,7 +555,7 @@ pipe_write(struct kiocb *iocb, const struct iovec *_iov,
out:
mutex_unlock(&inode->i_mutex);
if (do_wakeup) {
wake_up_interruptible(&pipe->wait);
wake_up_interruptible_sync(&pipe->wait);
kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
}
if (ret > 0)
Expand Down Expand Up @@ -650,7 +649,7 @@ pipe_release(struct inode *inode, int decr, int decw)
if (!pipe->readers && !pipe->writers) {
free_pipe_info(inode);
} else {
wake_up_interruptible(&pipe->wait);
wake_up_interruptible_sync(&pipe->wait);
kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
}
Expand Down
17 changes: 15 additions & 2 deletions fs/proc/array.c
Original file line number Diff line number Diff line change
Expand Up @@ -370,6 +370,11 @@ static cputime_t task_stime(struct task_struct *p)
}
#endif

static cputime_t task_gtime(struct task_struct *p)
{
return p->gtime;
}

static int do_task_stat(struct task_struct *task, char *buffer, int whole)
{
unsigned long vsize, eip, esp, wchan = ~0UL;
Expand All @@ -385,6 +390,7 @@ static int do_task_stat(struct task_struct *task, char *buffer, int whole)
unsigned long cmin_flt = 0, cmaj_flt = 0;
unsigned long min_flt = 0, maj_flt = 0;
cputime_t cutime, cstime, utime, stime;
cputime_t cgtime, gtime;
unsigned long rsslim = 0;
char tcomm[sizeof(task->comm)];
unsigned long flags;
Expand All @@ -403,6 +409,7 @@ static int do_task_stat(struct task_struct *task, char *buffer, int whole)
sigemptyset(&sigign);
sigemptyset(&sigcatch);
cutime = cstime = utime = stime = cputime_zero;
cgtime = gtime = cputime_zero;

rcu_read_lock();
if (lock_task_sighand(task, &flags)) {
Expand All @@ -420,6 +427,7 @@ static int do_task_stat(struct task_struct *task, char *buffer, int whole)
cmaj_flt = sig->cmaj_flt;
cutime = sig->cutime;
cstime = sig->cstime;
cgtime = sig->cgtime;
rsslim = sig->rlim[RLIMIT_RSS].rlim_cur;

/* add up live thread stats at the group level */
Expand All @@ -430,13 +438,15 @@ static int do_task_stat(struct task_struct *task, char *buffer, int whole)
maj_flt += t->maj_flt;
utime = cputime_add(utime, task_utime(t));
stime = cputime_add(stime, task_stime(t));
gtime = cputime_add(gtime, task_gtime(t));
t = next_thread(t);
} while (t != task);

min_flt += sig->min_flt;
maj_flt += sig->maj_flt;
utime = cputime_add(utime, sig->utime);
stime = cputime_add(stime, sig->stime);
gtime += cputime_add(gtime, sig->gtime);
}

sid = signal_session(sig);
Expand All @@ -454,6 +464,7 @@ static int do_task_stat(struct task_struct *task, char *buffer, int whole)
maj_flt = task->maj_flt;
utime = task_utime(task);
stime = task_stime(task);
gtime = task_gtime(task);
}

/* scale priority and nice values from timeslices to -20..20 */
Expand All @@ -471,7 +482,7 @@ static int do_task_stat(struct task_struct *task, char *buffer, int whole)

res = sprintf(buffer, "%d (%s) %c %d %d %d %d %d %u %lu \
%lu %lu %lu %lu %lu %ld %ld %ld %ld %d 0 %llu %lu %ld %lu %lu %lu %lu %lu \
%lu %lu %lu %lu %lu %lu %lu %lu %d %d %u %u %llu\n",
%lu %lu %lu %lu %lu %lu %lu %lu %d %d %u %u %llu %lu %ld\n",
task->pid,
tcomm,
state,
Expand Down Expand Up @@ -516,7 +527,9 @@ static int do_task_stat(struct task_struct *task, char *buffer, int whole)
task_cpu(task),
task->rt_priority,
task->policy,
(unsigned long long)delayacct_blkio_ticks(task));
(unsigned long long)delayacct_blkio_ticks(task),
cputime_to_clock_t(gtime),
cputime_to_clock_t(cgtime));
if (mm)
mmput(mm);
return res;
Expand Down
2 changes: 1 addition & 1 deletion fs/proc/base.c
Original file line number Diff line number Diff line change
Expand Up @@ -304,7 +304,7 @@ static int proc_pid_schedstat(struct task_struct *task, char *buffer)
return sprintf(buffer, "%llu %llu %lu\n",
task->sched_info.cpu_time,
task->sched_info.run_delay,
task->sched_info.pcnt);
task->sched_info.pcount);
}
#endif

Expand Down
15 changes: 11 additions & 4 deletions fs/proc/proc_misc.c
Original file line number Diff line number Diff line change
Expand Up @@ -443,6 +443,7 @@ static int show_stat(struct seq_file *p, void *v)
int i;
unsigned long jif;
cputime64_t user, nice, system, idle, iowait, irq, softirq, steal;
cputime64_t guest;
u64 sum = 0;
struct timespec boottime;
unsigned int *per_irq_sum;
Expand All @@ -453,6 +454,7 @@ static int show_stat(struct seq_file *p, void *v)

user = nice = system = idle = iowait =
irq = softirq = steal = cputime64_zero;
guest = cputime64_zero;
getboottime(&boottime);
jif = boottime.tv_sec;

Expand All @@ -467,22 +469,24 @@ static int show_stat(struct seq_file *p, void *v)
irq = cputime64_add(irq, kstat_cpu(i).cpustat.irq);
softirq = cputime64_add(softirq, kstat_cpu(i).cpustat.softirq);
steal = cputime64_add(steal, kstat_cpu(i).cpustat.steal);
guest = cputime64_add(guest, kstat_cpu(i).cpustat.guest);
for (j = 0; j < NR_IRQS; j++) {
unsigned int temp = kstat_cpu(i).irqs[j];
sum += temp;
per_irq_sum[j] += temp;
}
}

seq_printf(p, "cpu %llu %llu %llu %llu %llu %llu %llu %llu\n",
seq_printf(p, "cpu %llu %llu %llu %llu %llu %llu %llu %llu %llu\n",
(unsigned long long)cputime64_to_clock_t(user),
(unsigned long long)cputime64_to_clock_t(nice),
(unsigned long long)cputime64_to_clock_t(system),
(unsigned long long)cputime64_to_clock_t(idle),
(unsigned long long)cputime64_to_clock_t(iowait),
(unsigned long long)cputime64_to_clock_t(irq),
(unsigned long long)cputime64_to_clock_t(softirq),
(unsigned long long)cputime64_to_clock_t(steal));
(unsigned long long)cputime64_to_clock_t(steal),
(unsigned long long)cputime64_to_clock_t(guest));
for_each_online_cpu(i) {

/* Copy values here to work around gcc-2.95.3, gcc-2.96 */
Expand All @@ -494,7 +498,9 @@ static int show_stat(struct seq_file *p, void *v)
irq = kstat_cpu(i).cpustat.irq;
softirq = kstat_cpu(i).cpustat.softirq;
steal = kstat_cpu(i).cpustat.steal;
seq_printf(p, "cpu%d %llu %llu %llu %llu %llu %llu %llu %llu\n",
guest = kstat_cpu(i).cpustat.guest;
seq_printf(p,
"cpu%d %llu %llu %llu %llu %llu %llu %llu %llu %llu\n",
i,
(unsigned long long)cputime64_to_clock_t(user),
(unsigned long long)cputime64_to_clock_t(nice),
Expand All @@ -503,7 +509,8 @@ static int show_stat(struct seq_file *p, void *v)
(unsigned long long)cputime64_to_clock_t(iowait),
(unsigned long long)cputime64_to_clock_t(irq),
(unsigned long long)cputime64_to_clock_t(softirq),
(unsigned long long)cputime64_to_clock_t(steal));
(unsigned long long)cputime64_to_clock_t(steal),
(unsigned long long)cputime64_to_clock_t(guest));
}
seq_printf(p, "intr %llu", (unsigned long long)sum);

Expand Down
1 change: 1 addition & 0 deletions include/linux/kernel_stat.h
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ struct cpu_usage_stat {
cputime64_t idle;
cputime64_t iowait;
cputime64_t steal;
cputime64_t guest;
};

struct kernel_stat {
Expand Down
Loading

0 comments on commit b5869ce

Please sign in to comment.