summaryrefslogtreecommitdiffstats
path: root/kernel/sched/core.c
AgeCommit message (Collapse)Author
2015-06-24Merge branch 'sched-hrtimers-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Thomas Gleixner: "This series of scheduler updates depends on sched/core and timers/core branches, which are already in your tree: - Scheduler balancing overhaul to plug a hard to trigger race which causes an oops in the balancer (Peter Zijlstra) - Lockdep updates which are related to the balancing updates (Peter Zijlstra)" * 'sched-hrtimers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched,lockdep: Employ lock pinning lockdep: Implement lock pinning lockdep: Simplify lock_release() sched: Streamline the task migration locking a little sched: Move code around sched,dl: Fix sched class hopping CBS hole sched, dl: Convert switched_{from, to}_dl() / prio_changed_dl() to balance callbacks sched,dl: Remove return value from pull_dl_task() sched, rt: Convert switched_{from, to}_rt() / prio_changed_rt() to balance callbacks sched,rt: Remove return value from pull_rt_task() sched: Allow balance callbacks for check_class_changed() sched: Use replace normalize_task() with __sched_setscheduler() sched: Replace post_schedule with a balance callback list
2015-06-22Merge branch 'timers-nohz-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull NOHZ updates from Thomas Gleixner: "A few updates to the nohz infrastructure: - recursion protection for context tracking - make the TIF_NOHZ inheritance smarter - isolate cpus which belong to the NOHZ full set" * 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: nohz: Set isolcpus when nohz_full is set nohz: Add tick_nohz_full_add_cpus_to() API context_tracking: Inherit TIF_NOHZ through forks instead of context switches context_tracking: Protect against recursion
2015-06-22Merge branch 'timers-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer updates from Thomas Gleixner: "A rather largish update for everything time and timer related: - Cache footprint optimizations for both hrtimers and timer wheel - Lower the NOHZ impact on systems which have NOHZ or timer migration disabled at runtime. - Optimize run time overhead of hrtimer interrupt by making the clock offset updates smarter - hrtimer cleanups and removal of restrictions to tackle some problems in sched/perf - Some more leap second tweaks - Another round of changes addressing the 2038 problem - First step to change the internals of clock event devices by introducing the necessary infrastructure - Allow constant folding for usecs/msecs_to_jiffies() - The usual pile of clockevent/clocksource driver updates The hrtimer changes contain updates to sched, perf and x86 as they depend on them plus changes all over the tree to cleanup API changes and redundant code, which got copied all over the place. The y2038 changes touch s390 to remove the last non 2038 safe code related to boot/persistant clock" * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits) clocksource: Increase dependencies of timer-stm32 to limit build wreckage timer: Minimize nohz off overhead timer: Reduce timer migration overhead if disabled timer: Stats: Simplify the flags handling timer: Replace timer base by a cpu index timer: Use hlist for the timer wheel hash buckets timer: Remove FIFO "guarantee" timers: Sanitize catchup_timer_jiffies() usage hrtimer: Allow hrtimer::function() to free the timer seqcount: Introduce raw_write_seqcount_barrier() seqcount: Rename write_seqcount_barrier() hrtimer: Fix hrtimer_is_queued() hole hrtimer: Remove HRTIMER_STATE_MIGRATE selftest: Timers: Avoid signal deadlock in leap-a-day timekeeping: Copy the shadow-timekeeper over the real timekeeper last clockevents: Check state instead of mode in suspend/resume path selftests: timers: Add leap-second timer edge testing to leap-a-day.c ntp: Do leapsecond adjustment in adjtimex read path time: Prevent early expiry of hrtimers[CLOCK_REALTIME] at the leap second edge ntp: Introduce and use SECS_PER_DAY macro instead of 86400 ...
2015-06-22Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "The main changes are: - lockless wakeup support for futexes and IPC message queues (Davidlohr Bueso, Peter Zijlstra) - Replace spinlocks with atomics in thread_group_cputimer(), to improve scalability (Jason Low) - NUMA balancing improvements (Rik van Riel) - SCHED_DEADLINE improvements (Wanpeng Li) - clean up and reorganize preemption helpers (Frederic Weisbecker) - decouple page fault disabling machinery from the preemption counter, to improve debuggability and robustness (David Hildenbrand) - SCHED_DEADLINE documentation updates (Luca Abeni) - topology CPU masks cleanups (Bartosz Golaszewski) - /proc/sched_debug improvements (Srikar Dronamraju)" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (79 commits) sched/deadline: Remove needless parameter in dl_runtime_exceeded() sched: Remove superfluous resetting of the p->dl_throttled flag sched/deadline: Drop duplicate init_sched_dl_class() declaration sched/deadline: Reduce rq lock contention by eliminating locking of non-feasible target sched/deadline: Make init_sched_dl_class() __init sched/deadline: Optimize pull_dl_task() sched/preempt: Add static_key() to preempt_notifiers sched/preempt: Fix preempt notifiers documentation about hlist_del() within unsafe iteration sched/stop_machine: Fix deadlock between multiple stop_two_cpus() sched/debug: Add sum_sleep_runtime to /proc/<pid>/sched sched/debug: Replace vruntime with wait_sum in /proc/sched_debug sched/debug: Properly format runnable tasks in /proc/sched_debug sched/numa: Only consider less busy nodes as numa balancing destinations Revert 095bebf61a46 ("sched/numa: Do not move past the balance point if unbalanced") sched/fair: Prevent throttling in early pick_next_task_fair() preempt: Reorganize the notrace definitions a bit preempt: Use preempt_schedule_context() as the official tracing preemption point sched: Make preempt_schedule_context() function-tracing safe x86: Remove cpu_sibling_mask() and cpu_core_mask() x86: Replace cpu_**_mask() with topology_**_cpumask() ...
2015-06-19timer: Reduce timer migration overhead if disabledThomas Gleixner
Eric reported that the timer_migration sysctl is not really nice performance wise as it needs to check at every timer insertion whether the feature is enabled or not. Further the check does not live in the timer code, so we have an extra function call which checks an extra cache line to figure out that it is disabled. We can do better and store that information in the per cpu (hr)timer bases. I pondered to use a static key, but that's a nightmare to update from the nohz code and the timer base cache line is hot anyway when we select a timer base. The old logic enabled the timer migration unconditionally if CONFIG_NO_HZ was set even if nohz was disabled on the kernel command line. With this modification, we start off with migration disabled. The user visible sysctl is still set to enabled. If the kernel switches to NOHZ migration is enabled, if the user did not disable it via the sysctl prior to the switch. If nohz=off is on the kernel command line, migration stays disabled no matter what. Before: 47.76% hog [.] main 14.84% [kernel] [k] _raw_spin_lock_irqsave 9.55% [kernel] [k] _raw_spin_unlock_irqrestore 6.71% [kernel] [k] mod_timer 6.24% [kernel] [k] lock_timer_base.isra.38 3.76% [kernel] [k] detach_if_pending 3.71% [kernel] [k] del_timer 2.50% [kernel] [k] internal_add_timer 1.51% [kernel] [k] get_nohz_timer_target 1.28% [kernel] [k] __internal_add_timer 0.78% [kernel] [k] timerfn 0.48% [kernel] [k] wake_up_nohz_cpu After: 48.10% hog [.] main 15.25% [kernel] [k] _raw_spin_lock_irqsave 9.76% [kernel] [k] _raw_spin_unlock_irqrestore 6.50% [kernel] [k] mod_timer 6.44% [kernel] [k] lock_timer_base.isra.38 3.87% [kernel] [k] detach_if_pending 3.80% [kernel] [k] del_timer 2.67% [kernel] [k] internal_add_timer 1.33% [kernel] [k] __internal_add_timer 0.73% [kernel] [k] timerfn 0.54% [kernel] [k] wake_up_nohz_cpu Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: John Stultz <john.stultz@linaro.org> Cc: Joonwoo Park <joonwoop@codeaurora.org> Cc: Wenbo Wang <wenbo.wang@memblaze.com> Link: http://lkml.kernel.org/r/20150526224512.127050787@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-06-19sched: Remove superfluous resetting of the p->dl_throttled flagWanpeng Li
Resetting the p->dl_throttled flag in rt_mutex_setprio() (for a task that is going to be boosted) is superfluous, as the natural place to do so is in replenish_dl_entity(). If the task was on the runqueue and it is boosted by a DL task, it will be enqueued back with ENQUEUE_REPLENISH flag set, which can guarantee that dl_throttled is reset in replenish_dl_entity(). This patch drops the resetting of throttled status in function rt_mutex_setprio(). Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1431496867-4194-6-git-send-email-wanpeng.li@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-19sched/preempt: Add static_key() to preempt_notifiersPeter Zijlstra
Avoid touching the curr->preempt_notifier cacheline when not needed. Provides a small improvement on pipe-bench: taskset 01 perf stat --repeat 10 -- perf bench sched pipe before: Performance counter stats for 'perf bench sched pipe' (10 runs): 12385.016204 task-clock (msec) # 1.001 CPUs utilized ( +- 0.34% ) 2,000,023 context-switches # 0.161 M/sec ( +- 0.00% ) 0 cpu-migrations # 0.000 K/sec 175 page-faults # 0.014 K/sec ( +- 0.26% ) 41,376,162,250 cycles # 3.341 GHz ( +- 0.11% ) 17,389,139,321 stalled-cycles-frontend # 42.03% frontend cycles idle ( +- 0.25% ) <not supported> stalled-cycles-backend 68,788,588,003 instructions # 1.66 insns per cycle # 0.25 stalled cycles per insn ( +- 0.02% ) 13,449,387,620 branches # 1085.940 M/sec ( +- 0.02% ) 20,880,690 branch-misses # 0.16% of all branches ( +- 0.98% ) 12.372646094 seconds time elapsed ( +- 0.34% ) after: Performance counter stats for 'perf bench sched pipe' (10 runs): 12180.936528 task-clock (msec) # 1.001 CPUs utilized ( +- 0.33% ) 2,000,077 context-switches # 0.164 M/sec ( +- 0.00% ) 0 cpu-migrations # 0.000 K/sec 174 page-faults # 0.014 K/sec ( +- 0.27% ) 40,691,545,577 cycles # 3.341 GHz ( +- 0.06% ) 16,446,333,371 stalled-cycles-frontend # 40.42% frontend cycles idle ( +- 0.18% ) <not supported> stalled-cycles-backend 68,570,100,387 instructions # 1.69 insns per cycle # 0.24 stalled cycles per insn ( +- 0.01% ) 13,389,740,014 branches # 1099.237 M/sec ( +- 0.01% ) 20,175,440 branch-misses # 0.15% of all branches ( +- 0.52% ) 12.169253010 seconds time elapsed ( +- 0.33% ) Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-19sched/preempt: Fix preempt notifiers documentation about hlist_del() within ↵Mathieu Desnoyers
unsafe iteration preempt_notifier_unregister() documents: "This is safe to call from within a preemption notifier." However, both fire_sched_in_preempt_notifiers() and fire_sched_out_preempt_notifiers() are using hlist_for_each_entry(), which is not safe against entry removal during iteration. Inspection of the KVM code does not reveal any use of preempt_notifier_unregister() within the preempt notifiers. Therefore, fix the comment. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1431881590-1456-1-git-send-email-mathieu.desnoyers@efficios.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-19sched,lockdep: Employ lock pinningPeter Zijlstra
Employ the new lockdep lock pinning annotation to ensure no 'accidental' lock-breaks happen with rq->lock. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: ktkhai@parallels.com Cc: rostedt@goodmis.org Cc: juri.lelli@gmail.com Cc: pang.xunlei@linaro.org Cc: oleg@redhat.com Cc: wanpeng.li@linux.intel.com Cc: umgwanakikbuti@gmail.com Link: http://lkml.kernel.org/r/20150611124744.003233193@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-06-19sched: Streamline the task migration locking a littlePeter Zijlstra
The whole migrate_task{,s}() locking seems a little shaky, there's a lot of dropping an require happening. Pull the locking up into the callers as far as possible to streamline the lot. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: ktkhai@parallels.com Cc: rostedt@goodmis.org Cc: juri.lelli@gmail.com Cc: pang.xunlei@linaro.org Cc: oleg@redhat.com Cc: wanpeng.li@linux.intel.com Cc: umgwanakikbuti@gmail.com Link: http://lkml.kernel.org/r/20150611124743.755256708@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-06-19sched: Move code aroundPeter Zijlstra
In preparation to reworking set_cpus_allowed_ptr() move some code around. This also removes some superfluous #ifdefs and adds comments to some #endifs. text data bss dec hex filename 12211532 1738144 1081344 15031020 e55aec defconfig-build/vmlinux.pre 12211532 1738144 1081344 15031020 e55aec defconfig-build/vmlinux.post Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: ktkhai@parallels.com Cc: rostedt@goodmis.org Cc: juri.lelli@gmail.com Cc: pang.xunlei@linaro.org Cc: oleg@redhat.com Cc: wanpeng.li@linux.intel.com Cc: umgwanakikbuti@gmail.com Link: http://lkml.kernel.org/r/20150611124743.662086684@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-06-19sched: Allow balance callbacks for check_class_changed()Peter Zijlstra
In order to remove dropping rq->lock from the switched_{to,from}()/prio_changed() sched_class methods, run the balance callbacks after it. We need to remove dropping rq->lock because its buggy, suppose using sched_setattr()/sched_setscheduler() to change a running task from FIFO to OTHER. By the time we get to switched_from_rt() the task is already enqueued on the cfs runqueues. If switched_from_rt() does pull_rt_task() and drops rq->lock, load-balancing can come in and move our task @p to another rq. The subsequent switched_to_fair() still assumes @p is on @rq and bad things will happen. By using balance callbacks we delay the load-balancing operations {rt,dl}x{push,pull} until we've done all the important work and the task is fully set up. Furthermore, the balance callbacks do not know about @p, therefore they cannot get confused like this. Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: ktkhai@parallels.com Cc: rostedt@goodmis.org Cc: juri.lelli@gmail.com Cc: pang.xunlei@linaro.org Cc: oleg@redhat.com Cc: wanpeng.li@linux.intel.com Link: http://lkml.kernel.org/r/20150611124742.615343911@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-06-19sched: Use replace normalize_task() with __sched_setscheduler()Peter Zijlstra
Reduce duplicate logic; normalize_task() is a simplified version of __sched_setscheduler(). Parametrize the difference and collapse. This reduces the amount of check_class_changed() sites. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: ktkhai@parallels.com Cc: rostedt@goodmis.org Cc: juri.lelli@gmail.com Cc: pang.xunlei@linaro.org Cc: oleg@redhat.com Cc: wanpeng.li@linux.intel.com Cc: umgwanakikbuti@gmail.com Link: http://lkml.kernel.org/r/20150611124742.532642391@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-06-19sched: Replace post_schedule with a balance callback listPeter Zijlstra
Generalize the post_schedule() stuff into a balance callback list. This allows us to more easily use it outside of schedule() and cross sched_class. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: ktkhai@parallels.com Cc: rostedt@goodmis.org Cc: juri.lelli@gmail.com Cc: pang.xunlei@linaro.org Cc: oleg@redhat.com Cc: wanpeng.li@linux.intel.com Cc: umgwanakikbuti@gmail.com Link: http://lkml.kernel.org/r/20150611124742.424032725@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-06-19Merge branch 'timers/core' into sched/hrtimersThomas Gleixner
Merge sched/core and timers/core so we can apply the sched balancing patch queue, which depends on both.
2015-06-07preempt: Use preempt_schedule_context() as the official tracing preemption pointFrederic Weisbecker
preempt_schedule_context() is a tracing safe preemption point but it's only used when CONFIG_CONTEXT_TRACKING=y. Other configs have tracing recursion issues since commit: b30f0e3ffedf ("sched/preempt: Optimize preemption operations on __schedule() callers") introduced function based preemp_count_*() ops. Lets make it available on all configs and give it a more appropriate name for its new position. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1433432349-1021-3-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-07sched: Make preempt_schedule_context() function-tracing safeFrederic Weisbecker
Since function tracing disables preemption, it needs a safe preemption point to use when preemption is re-enabled without worrying about tracing recursion. Ie: to avoid tracing recursion, that preemption point can't be traced (use of notrace qualifier) and it can't call any traceable function before that preemption point disables preemption itself, which disarms the recursion. preempt_schedule() was fine until commit: b30f0e3ffedf ("sched/preempt: Optimize preemption operations on __schedule() callers") because PREEMPT_ACTIVE (which has the property to disable preemption and this disarm tracing preemption recursion) was set before calling any further function. But that commit introduced the use of preempt_count_add/sub() functions to set PREEMPT_ACTIVE and because these functions are called before preemption gets a chance to be disabled, we have a tracing recursion. preempt_schedule_context() is one of the possible preemption functions used by tracing. Its special purpose is to avoid tracing recursion against context tracking. Lets enhance this function to become more generally tracing safe by disabling preemption with raw accessors, such that no function is called before preemption gets disabled and disarm the tracing recursion. This function is going to become the specific tracing-safe preemption point in further commit. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1433432349-1021-2-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-06-02Merge branch 'linus' into sched/core, to resolve conflictIngo Molnar
Conflicts: arch/sparc/include/asm/topology_64.h Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-27Merge branch 'perf/urgent' into perf/core, before applying dependent patchesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-22Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "Three small fixes that have been picked up the last few weeks. Specifically: - Fix a memory corruption issue in NVMe with malignant user constructed request. From Christoph. - Kill (now) unused blk_queue_bio(), dm was changed to not need this anymore. From Mike Snitzer. - Always use blk_schedule_flush_plug() from the io_schedule() path when flushing a plug, fixing a !TASK_RUNNING warning with md. From Shaohua" * 'for-linus' of git://git.kernel.dk/linux-block: sched: always use blk_schedule_flush_plug in io_schedule_out nvme: fix kernel memory corruption with short INQUIRY buffers block: remove export for blk_queue_bio
2015-05-19Merge branch 'linus' into timers/coreThomas Gleixner
Make sure the upstream fixes are applied before adding further modifications.
2015-05-19sched/preempt: Optimize preemption operations on __schedule() callersFrederic Weisbecker
__schedule() disables preemption and some of its callers (the preempt_schedule*() family) also set PREEMPT_ACTIVE. So we have two preempt_count() modifications that could be performed at once. Lets remove the preemption disablement from __schedule() and pull this responsibility to its callers in order to optimize preempt_count() operations in a single place. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1431441711-29753-5-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-18sched: always use blk_schedule_flush_plug in io_schedule_outShaohua Li
block plug callback could sleep, so we introduce a parameter 'from_schedule' and corresponding drivers can use it to destinguish a schedule plug flush or a plug finish. Unfortunately io_schedule_out still uses blk_flush_plug(). This causes below output (Note, I added a might_sleep() in raid1_unplug to make it trigger faster, but the whole thing doesn't matter if I add might_sleep). In raid1/10, this can cause deadlock. This patch makes io_schedule_out always uses blk_schedule_flush_plug. This should only impact drivers (as far as I know, raid 1/10) which are sensitive to the 'from_schedule' parameter. [ 370.817949] ------------[ cut here ]------------ [ 370.817960] WARNING: CPU: 7 PID: 145 at ../kernel/sched/core.c:7306 __might_sleep+0x7f/0x90() [ 370.817969] do not call blocking ops when !TASK_RUNNING; state=2 set at [<ffffffff81092fcf>] prepare_to_wait+0x2f/0x90 [ 370.817971] Modules linked in: raid1 [ 370.817976] CPU: 7 PID: 145 Comm: kworker/u16:9 Tainted: G W 4.0.0+ #361 [ 370.817977] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153802- 04/01/2014 [ 370.817983] Workqueue: writeback bdi_writeback_workfn (flush-9:1) [ 370.817985] ffffffff81cd83be ffff8800ba8cb298 ffffffff819dd7af 0000000000000001 [ 370.817988] ffff8800ba8cb2e8 ffff8800ba8cb2d8 ffffffff81051afc ffff8800ba8cb2c8 [ 370.817990] ffffffffa00061a8 000000000000041e 0000000000000000 ffff8800ba8cba28 [ 370.817993] Call Trace: [ 370.817999] [<ffffffff819dd7af>] dump_stack+0x4f/0x7b [ 370.818002] [<ffffffff81051afc>] warn_slowpath_common+0x8c/0xd0 [ 370.818004] [<ffffffff81051b86>] warn_slowpath_fmt+0x46/0x50 [ 370.818006] [<ffffffff81092fcf>] ? prepare_to_wait+0x2f/0x90 [ 370.818008] [<ffffffff81092fcf>] ? prepare_to_wait+0x2f/0x90 [ 370.818010] [<ffffffff810776ef>] __might_sleep+0x7f/0x90 [ 370.818014] [<ffffffffa0000c03>] raid1_unplug+0xd3/0x170 [raid1] [ 370.818024] [<ffffffff81421d9a>] blk_flush_plug_list+0x8a/0x1e0 [ 370.818028] [<ffffffff819e3550>] ? bit_wait+0x50/0x50 [ 370.818031] [<ffffffff819e21b0>] io_schedule_timeout+0x130/0x140 [ 370.818033] [<ffffffff819e3586>] bit_wait_io+0x36/0x50 [ 370.818034] [<ffffffff819e31b5>] __wait_on_bit+0x65/0x90 [ 370.818041] [<ffffffff8125b67c>] ? ext4_read_block_bitmap_nowait+0xbc/0x630 [ 370.818043] [<ffffffff819e3550>] ? bit_wait+0x50/0x50 [ 370.818045] [<ffffffff819e3302>] out_of_line_wait_on_bit+0x72/0x80 [ 370.818047] [<ffffffff810935e0>] ? autoremove_wake_function+0x40/0x40 [ 370.818050] [<ffffffff811de744>] __wait_on_buffer+0x44/0x50 [ 370.818053] [<ffffffff8125ae80>] ext4_wait_block_bitmap+0xe0/0xf0 [ 370.818058] [<ffffffff812975d6>] ext4_mb_init_cache+0x206/0x790 [ 370.818062] [<ffffffff8114bc6c>] ? lru_cache_add+0x1c/0x50 [ 370.818064] [<ffffffff81297c7e>] ext4_mb_init_group+0x11e/0x200 [ 370.818066] [<ffffffff81298231>] ext4_mb_load_buddy+0x341/0x360 [ 370.818068] [<ffffffff8129a1a3>] ext4_mb_find_by_goal+0x93/0x2f0 [ 370.818070] [<ffffffff81295b54>] ? ext4_mb_normalize_request+0x1e4/0x5b0 [ 370.818072] [<ffffffff8129ab67>] ext4_mb_regular_allocator+0x67/0x460 [ 370.818074] [<ffffffff81295b54>] ? ext4_mb_normalize_request+0x1e4/0x5b0 [ 370.818076] [<ffffffff8129ca4b>] ext4_mb_new_blocks+0x4cb/0x620 [ 370.818079] [<ffffffff81290956>] ext4_ext_map_blocks+0x4c6/0x14d0 [ 370.818081] [<ffffffff812a4d4e>] ? ext4_es_lookup_extent+0x4e/0x290 [ 370.818085] [<ffffffff8126399d>] ext4_map_blocks+0x14d/0x4f0 [ 370.818088] [<ffffffff81266fbd>] ext4_writepages+0x76d/0xe50 [ 370.818094] [<ffffffff81149691>] do_writepages+0x21/0x50 [ 370.818097] [<ffffffff811d5c00>] __writeback_single_inode+0x60/0x490 [ 370.818099] [<ffffffff811d630a>] writeback_sb_inodes+0x2da/0x590 [ 370.818103] [<ffffffff811abf4b>] ? trylock_super+0x1b/0x50 [ 370.818105] [<ffffffff811abf4b>] ? trylock_super+0x1b/0x50 [ 370.818107] [<ffffffff811d665f>] __writeback_inodes_wb+0x9f/0xd0 [ 370.818109] [<ffffffff811d69db>] wb_writeback+0x34b/0x3c0 [ 370.818111] [<ffffffff811d70df>] bdi_writeback_workfn+0x23f/0x550 [ 370.818116] [<ffffffff8106bbd8>] process_one_work+0x1c8/0x570 [ 370.818117] [<ffffffff8106bb5b>] ? process_one_work+0x14b/0x570 [ 370.818119] [<ffffffff8106c09b>] worker_thread+0x11b/0x470 [ 370.818121] [<ffffffff8106bf80>] ? process_one_work+0x570/0x570 [ 370.818124] [<ffffffff81071868>] kthread+0xf8/0x110 [ 370.818126] [<ffffffff81071770>] ? kthread_create_on_node+0x210/0x210 [ 370.818129] [<ffffffff819e9322>] ret_from_fork+0x42/0x70 [ 370.818131] [<ffffffff81071770>] ? kthread_create_on_node+0x210/0x210 [ 370.818132] ---[ end trace 7b4deb71e68b6605 ]--- V2: don't change ->in_iowait Cc: NeilBrown <neilb@suse.de> Signed-off-by: Shaohua Li <shli@fb.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-18sched,perf: Fix periodic timersPeter Zijlstra
In the below two commits (see Fixes) we have periodic timers that can stop themselves when they're no longer required, but need to be (re)-started when their idle condition changes. Further complications is that we want the timer handler to always do the forward such that it will always correctly deal with the overruns, and we do not want to race such that the handler has already decided to stop, but the (external) restart sees the timer still active and we end up with a 'lost' timer. The problem with the current code is that the re-start can come before the callback does the forward, at which point the forward from the callback will WARN about forwarding an enqueued timer. Now, conceptually its easy to detect if you're before or after the fwd by comparing the expiration time against the current time. Of course, that's expensive (and racy) because we don't have the current time. Alternatively one could cache this state inside the timer, but then everybody pays the overhead of maintaining this extra state, and that is undesired. The only other option that I could see is the external timer_active variable, which I tried to kill before. I would love a nicer interface for this seemingly simple 'problem' but alas. Fixes: 272325c4821f ("perf: Fix mux_interval hrtimer wreckage") Fixes: 77a4d1a1b9a1 ("sched: Cleanup bandwidth timers") Cc: pjt@google.com Cc: tglx@linutronix.de Cc: klamm@yandex-team.ru Cc: mingo@kernel.org Cc: bsegall@google.com Cc: hpa@zytor.com Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/20150514102311.GX21418@twins.programming.kicks-ass.net
2015-05-08perf: Fix software migrate eventsPeter Zijlstra
Stephane asked about PERF_COUNT_SW_CPU_MIGRATIONS and I realized it was borken: > The problem is that the task isn't actually scheduled while its being > migrated (obviously), and if its not scheduled, the counters aren't > scheduled either, so there's no observing of the fact. > > A further problem with migrations is that many migrations happen from > softirq context, which is nested inside the 'random' task context of > whoemever happens to run at that time, similarly for the wakeup > migrations triggered from (soft)irq context. All those end up being > accounted in the task that's currently running, eg. your 'ls'. The below cures this by marking a task as migrated and accounting it on the subsequent sched_in(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08sched: Implement lockless wake-queuesPeter Zijlstra
This is useful for locking primitives that can effect multiple wakeups per operation and want to avoid lock internal lock contention by delaying the wakeups until we've released the lock internal locks. Alternatively it can be used to avoid issuing multiple wakeups, and thus save a few cycles, in packet processing. Queue all target tasks and wakeup once you've processed all packets. That way you avoid waking the target task multiple times if there were multiple packets for the same task. Properties of a wake_q are: - Lockless, as queue head must reside on the stack. - Being a queue, maintains wakeup order passed by the callers. This can be important for otherwise, in scenarios where highly contended locks could affect any reliance on lock fairness. - A queued task cannot be added again until it is woken up. This patch adds the needed infrastructure into the scheduler code and uses the new wake_list to delay the futex wakeups until after we've released the hash bucket locks. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> [tweaks, adjustments, comments, etc.] Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Chris Mason <clm@fb.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: George Spelvin <linux@horizon.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/1430494072-30283-2-git-send-email-dave@stgolabs.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08sched, timer: Convert usages of ACCESS_ONCE() in the scheduler to ↵Jason Low
READ_ONCE()/WRITE_ONCE() ACCESS_ONCE doesn't work reliably on non-scalar types. This patch removes the rest of the existing usages of ACCESS_ONCE() in the scheduler, and use the new READ_ONCE() and WRITE_ONCE() APIs as appropriate. Signed-off-by: Jason Low <jason.low2@hp.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Waiman Long <Waiman.Long@hp.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com> Cc: Scott J Norton <scott.norton@hp.com> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/1430251224-5764-2-git-send-email-jason.low2@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08sched/core: Remove unnecessary down/up conversionNicholas Mc Guire
'rt_period_us' is automatically type converted from u64 to long and then cast back to u64 - this down/up conversion is unnecessary and can be removed to improve readability. This will also help us not truncate 'rt_period_us' to 32 bits on 32-bit kernels, should we ever have so large values. (unlikely, not the least due to procfs.) Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1430643116-24049-1-git-send-email-hofrat@osadl.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08sched: Move the loadavg code to a more obvious locationPeter Zijlstra
I could not find the loadavg code.. turns out it was hidden in a file called proc.c. It further got mingled up with the cruft per rq load indexes (which we really want to get rid of). Move the per rq load indexes into the fair.c load-balance code (that's the only thing that uses them) and rename proc.c to loadavg.c so we can find it again. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Thomas Gleixner <tglx@linutronix.de> [ Did minor cleanups to the code. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08Merge branch 'sched/urgent' into sched/core, before applying new patchesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08sched/core: Remove __cpuinit section tag that crept back inPaul Gortmaker
We removed __cpuinit support (leaving no-op stubs) quite some time ago. However this one crept back in as of commit a803f0261bb2bb57aab ("sched: Initialize rq->age_stamp on processor start") Since we want to clobber the stubs too, get this removed now. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Corey Minyard <cminyard@mvista.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1430174880-27958-2-git-send-email-paul.gortmaker@windriver.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08sched/core: Fix regression in cpuset_cpu_inactive() for suspendOmar Sandoval
Commit 3c18d447b3b3 ("sched/core: Check for available DL bandwidth in cpuset_cpu_inactive()"), a SCHED_DEADLINE bugfix, had a logic error that caused a regression in setting a CPU inactive during suspend. I ran into this when a program was failing pthread_setaffinity_np() with EINVAL after a suspend+wake up. A simple reproducer: $ ./a.out sched_setaffinity: Success $ systemctl suspend $ ./a.out sched_setaffinity: Invalid argument ... where ./a.out is: #define _GNU_SOURCE #include <errno.h> #include <sched.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> int main(void) { long num_cores; cpu_set_t cpu_set; int ret; num_cores = sysconf(_SC_NPROCESSORS_ONLN); CPU_ZERO(&cpu_set); CPU_SET(num_cores - 1, &cpu_set); errno = 0; ret = sched_setaffinity(getpid(), sizeof(cpu_set), &cpu_set); perror("sched_setaffinity"); return ret ? EXIT_FAILURE : EXIT_SUCCESS; } The mistake is that suspend is handled in the action == CPU_DOWN_PREPARE_FROZEN case of the switch statement in cpuset_cpu_inactive(). However, the commit in question masked out CPU_TASKS_FROZEN from the action, making this case dead. The fix is straightforward. Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 3c18d447b3b3 ("sched/core: Check for available DL bandwidth in cpuset_cpu_inactive()") Link: http://lkml.kernel.org/r/1cb5ecb3d6543c38cce5790387f336f54ec8e2bc.1430733960.git.osandov@osandov.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-08sched: Handle priority boosted tasks proper in setscheduler()Thomas Gleixner
Ronny reported that the following scenario is not handled correctly: T1 (prio = 10) lock(rtmutex); T2 (prio = 20) lock(rtmutex) boost T1 T1 (prio = 20) sys_set_scheduler(prio = 30) T1 prio = 30 .... sys_set_scheduler(prio = 10) T1 prio = 30 The last step is wrong as T1 should now be back at prio 20. Commit c365c292d059 ("sched: Consider pi boosting in setscheduler()") only handles the case where a boosted tasks tries to lower its priority. Fix it by taking the new effective priority into account for the decision whether a change of the priority is required. Reported-by: Ronny Meeus <ronny.meeus@gmail.com> Tested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: <stable@vger.kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Fixes: c365c292d059 ("sched: Consider pi boosting in setscheduler()") Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1505051806060.4225@nanos Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-07nohz: Set isolcpus when nohz_full is setChris Metcalf
nohz_full is only useful with isolcpus are also set, since otherwise the scheduler has to run periodically to try to determine whether to steal work from other cores. Accordingly, when booting with nohz_full=xxx on the command line, we should act as if isolcpus=xxx was also set, and set (or extend) the isolcpus set to include the nohz_full cpus. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mike Galbraith <umgwanakikbuti@gmail.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Jones <davej@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1430928266-24888-5-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-07context_tracking: Inherit TIF_NOHZ through forks instead of context switchesFrederic Weisbecker
TIF_NOHZ is used by context_tracking to force syscall slow-path on every task in order to track userspace roundtrips. As such, it must be set on all running tasks. It's currently explicitly inherited through context switches. There is no need to do it in this fast-path though. The flag could simply be set once for all on all tasks, whether they are running or not. Lets do this by setting the flag for the init task on early boot, and let it propagate through fork inheritance. While at it, mark context_tracking_cpu_set() as init code, we only need it at early boot time. Suggested-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Dave Jones <davej@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1430928266-24888-3-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-27x86: pvclock: Really remove the sched notifier for cross-cpu migrationsPaolo Bonzini
This reverts commits 0a4e6be9ca17c54817cf814b4b5aa60478c6df27 and 80f7fdb1c7f0f9266421f823964fd1962681f6ce. The task migration notifier was originally introduced in order to support the pvclock vsyscall with non-synchronized TSC, but KVM only supports it with synchronized TSC. Hence, on KVM the race condition is only needed due to a bad implementation on the host side, and even then it's so rare that it's mostly theoretical. As far as KVM is concerned it's possible to fix the host, avoiding the additional complexity in the vDSO and the (re)introduction of the task migration notifier. Xen, on the other hand, hasn't yet implemented vsyscall support at all, so we do not care about its plans for non-synchronized TSC. Reported-by: Peter Zijlstra <peterz@infradead.org> Suggested-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-04-22sched: Cleanup bandwidth timersPeter Zijlstra
Roman reported a 3 cpu lockup scenario involving __start_cfs_bandwidth(). The more I look at that code the more I'm convinced its crack, that entire __start_cfs_bandwidth() thing is brain melting, we don't need to cancel a timer before starting it, *hrtimer_start*() will happily remove the timer for you if its still enqueued. Removing that, removes a big part of the problem, no more ugly cancel loop to get stuck in. So now, if I understand things right, the entire reason you have this cfs_b->lock guarded ->timer_active nonsense is to make sure we don't accidentally lose the timer. It appears to me that it should be possible to guarantee that same by unconditionally (re)starting the timer when !queued. Because regardless what hrtimer::function will return, if we beat it to (re)enqueue the timer, it doesn't matter. Now, because hrtimers don't come with any serialization guarantees we must ensure both handler and (re)start loop serialize their access to the hrtimer to avoid both trying to forward the timer at the same time. Update the rt bandwidth timer to match. This effectively reverts: 09dc4ab03936 ("sched/fair: Fix tg_set_cfs_bandwidth() deadlock on rq->lock"). Reported-by: Roman Gushchin <klamm@yandex-team.ru> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ben Segall <bsegall@google.com> Cc: Paul Turner <pjt@google.com> Link: http://lkml.kernel.org/r/20150415095011.804589208@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-22sched: core: Use hrtimer_start[_expires]()Thomas Gleixner
hrtimer_start() now enforces a timer interrupt when an already expired timer is enqueued. Get rid of the __hrtimer_start_range_ns() invocations and the loops around it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/20150414203502.531131739@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-04-14Merge branch 'timers-nohz-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull NOHZ changes from Ingo Molnar: "This tree adds full dynticks support to KVM guests (support the disabling of the timer tick on the guest). The main missing piece was the recognition of guest execution as RCU extended quiescent state and related changes" * 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: kvm,rcu,nohz: use RCU extended quiescent state when running KVM guest context_tracking: Export context_tracking_user_enter/exit context_tracking: Run vtime_user_enter/exit only when state == CONTEXT_USER context_tracking: Add stub context_tracking_is_enabled context_tracking: Generalize context tracking APIs to support user and guest context_tracking: Rename context symbols to prepare for transition state ppc: Remove unused cpp symbols in kvm headers
2015-04-13Merge branch 'for-4.1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "Nothing too interesting. Rik made cpuset cooperate better with isolcpus and there are several other cleanup patches" * 'for-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cpuset, isolcpus: document relationship between cpusets & isolcpus cpusets, isolcpus: exclude isolcpus from load balancing in cpusets sched, isolcpu: make cpu_isolated_map visible outside scheduler cpuset: initialize cpuset a bit early cgroup: Use kvfree in pidlist_free() cgroup: call cgroup_subsys->bind on cgroup subsys initialization
2015-04-13Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler changes from Ingo Molnar: "Major changes: - Reworked CPU capacity code, for better SMP load balancing on system