summaryrefslogtreecommitdiffstats
path: root/kernel/sched/rt.c
AgeCommit message (Collapse)Author
2019-09-17Merge branch 'timers-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core timer updates from Thomas Gleixner: "Timers and timekeeping updates: - A large overhaul of the posix CPU timer code which is a preparation for moving the CPU timer expiry out into task work so it can be properly accounted on the task/process. An update to the bogus permission checks will come later during the merge window as feedback was not complete before heading of for travel. - Switch the timerqueue code to use cached rbtrees and get rid of the homebrewn caching of the leftmost node. - Consolidate hrtimer_init() + hrtimer_init_sleeper() calls into a single function - Implement the separation of hrtimers to be forced to expire in hard interrupt context even when PREEMPT_RT is enabled and mark the affected timers accordingly. - Implement a mechanism for hrtimers and the timer wheel to protect RT against priority inversion and live lock issues when a (hr)timer which should be canceled is currently executing the callback. Instead of infinitely spinning, the task which tries to cancel the timer blocks on a per cpu base expiry lock which is held and released by the (hr)timer expiry code. - Enable the Hyper-V TSC page based sched_clock for Hyper-V guests resulting in faster access to timekeeping functions. - Updates to various clocksource/clockevent drivers and their device tree bindings. - The usual small improvements all over the place" * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (101 commits) posix-cpu-timers: Fix permission check regression posix-cpu-timers: Always clear head pointer on dequeue hrtimer: Add a missing bracket and hide `migration_base' on !SMP posix-cpu-timers: Make expiry_active check actually work correctly posix-timers: Unbreak CONFIG_POSIX_TIMERS=n build tick: Mark sched_timer to expire in hard interrupt context hrtimer: Add kernel doc annotation for HRTIMER_MODE_HARD x86/hyperv: Hide pv_ops access for CONFIG_PARAVIRT=n posix-cpu-timers: Utilize timerqueue for storage posix-cpu-timers: Move state tracking to struct posix_cputimers posix-cpu-timers: Deduplicate rlimit handling posix-cpu-timers: Remove pointless comparisons posix-cpu-timers: Get rid of 64bit divisions posix-cpu-timers: Consolidate timer expiry further posix-cpu-timers: Get rid of zero checks rlimit: Rewrite non-sensical RLIMIT_CPU comment posix-cpu-timers: Respect INFINITY for hard RTTIME limit posix-cpu-timers: Switch thread group sampling to array posix-cpu-timers: Restructure expiry array posix-cpu-timers: Remove cputime_expires ...
2019-08-28posix-cpu-timers: Move expiry cache into struct posix_cputimersThomas Gleixner
The expiry cache belongs into the posix_cputimers container where the other cpu timers information is. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lkml.kernel.org/r/20190821192921.014444012@linutronix.de
2019-08-08sched: Rework pick_next_task() slow-pathPeter Zijlstra
Avoid the RETRY_TASK case in the pick_next_task() slow path. By doing the put_prev_task() early, we get the rt/deadline pull done, and by testing rq->nr_running we know if we need newidle_balance(). This then gives a stable state to pick a task from. Since the fast-path is fair only; it means the other classes will always have pick_next_task(.prev=NULL, .rf=NULL) and we can simplify. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Aaron Lu <aaron.lwe@gmail.com> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: mingo@kernel.org Cc: Phil Auld <pauld@redhat.com> Cc: Julien Desfossez <jdesfossez@digitalocean.com> Cc: Nishanth Aravamudan <naravamudan@digitalocean.com> Link: https://lkml.kernel.org/r/aa34d24b36547139248f32a30138791ac6c02bd6.1559129225.git.vpillai@digitalocean.com
2019-08-08sched: Allow put_prev_task() to drop rq->lockPeter Zijlstra
Currently the pick_next_task() loop is convoluted and ugly because of how it can drop the rq->lock and needs to restart the picking. For the RT/Deadline classes, it is put_prev_task() where we do balancing, and we could do this before the picking loop. Make this possible. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: Aaron Lu <aaron.lwe@gmail.com> Cc: mingo@kernel.org Cc: Phil Auld <pauld@redhat.com> Cc: Julien Desfossez <jdesfossez@digitalocean.com> Cc: Nishanth Aravamudan <naravamudan@digitalocean.com> Link: https://lkml.kernel.org/r/e4519f6850477ab7f3d257062796e6425ee4ba7c.1559129225.git.vpillai@digitalocean.com
2019-08-08sched: Add task_struct pointer to sched_class::set_curr_taskPeter Zijlstra
In preparation of further separating pick_next_task() and set_curr_task() we have to pass the actual task into it, while there, rename the thing to better pair with put_prev_task(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Aaron Lu <aaron.lwe@gmail.com> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: mingo@kernel.org Cc: Phil Auld <pauld@redhat.com> Cc: Julien Desfossez <jdesfossez@digitalocean.com> Cc: Nishanth Aravamudan <naravamudan@digitalocean.com> Link: https://lkml.kernel.org/r/a96d1bcdd716db4a4c5da2fece647a1456c0ed78.1559129225.git.vpillai@digitalocean.com
2019-08-08sched/{rt,deadline}: Fix set_next_task vs pick_next_taskPeter Zijlstra
Because pick_next_task() implies set_curr_task() and some of the details haven't mattered too much, some of what _should_ be in set_curr_task() ended up in pick_next_task, correct this. This prepares the way for a pick_next_task() variant that does not affect the current state; allowing remote picking. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Aaron Lu <aaron.lwe@gmail.com> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: mingo@kernel.org Cc: Phil Auld <pauld@redhat.com> Cc: Julien Desfossez <jdesfossez@digitalocean.com> Cc: Nishanth Aravamudan <naravamudan@digitalocean.com> Link: https://lkml.kernel.org/r/38c61d5240553e043c27c5e00b9dd0d184dd6081.1559129225.git.vpillai@digitalocean.com
2019-08-01sched: Mark hrtimers to expire in hard interrupt contextSebastian Andrzej Siewior
The scheduler related hrtimers need to expire in hard interrupt context even on PREEMPT_RT enabled kernels. Mark then as such. No functional change. [ tglx: Split out from larger combo patch. Add changelog. ] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20190726185753.077004842@linutronix.de
2019-06-24sched/cpufreq, sched/uclamp: Add clamps for FAIR and RT tasksPatrick Bellasi
Each time a frequency update is required via schedutil, a frequency is selected to (possibly) satisfy the utilization reported by each scheduling class and irqs. However, when utilization clamping is in use, the frequency selection should consider userspace utilization clamping hints. This will allow, for example, to: - boost tasks which are directly affecting the user experience by running them at least at a minimum "requested" frequency - cap low priority tasks not directly affecting the user experience by running them only up to a maximum "allowed" frequency These constraints are meant to support a per-task based tuning of the frequency selection thus supporting a fine grained definition of performance boosting vs energy saving strategies in kernel space. Add support to clamp the utilization of RUNNABLE FAIR and RT tasks within the boundaries defined by their aggregated utilization clamp constraints. Do that by considering the max(min_util, max_util) to give boosted tasks the performance they need even when they happen to be co-scheduled with other capped tasks. Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alessio Balsini <balsini@android.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Steve Muckle <smuckle@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Todd Kjos <tkjos@google.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Link: https://lkml.kernel.org/r/20190621084217.8167-10-patrick.bellasi@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-03sched/core: Provide a pointer to the valid CPU maskSebastian Andrzej Siewior
In commit: 4b53a3412d66 ("sched/core: Remove the tsk_nr_cpus_allowed() wrapper") the tsk_nr_cpus_allowed() wrapper was removed. There was not much difference in !RT but in RT we used this to implement migrate_disable(). Within a migrate_disable() section the CPU mask is restricted to single CPU while the "normal" CPU mask remains untouched. As an alternative implementation Ingo suggested to use: struct task_struct { const cpumask_t *cpus_ptr; cpumask_t cpus_mask; }; with t->cpus_ptr = &t->cpus_mask; In -RT we then can switch the cpus_ptr to: t->cpus_ptr = &cpumask_of(task_cpu(p)); in a migration disabled region. The rules are simple: - Code that 'uses' ->cpus_allowed would use the pointer. - Code that 'modifies' ->cpus_allowed would use the direct mask. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lkml.kernel.org/r/20190423142636.14347-1-bigeasy@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-19sched/rt: Check integer overflow at usec to nsec conversionKonstantin Khlebnikov
Example of unhandled overflows: # echo 18446744073709651 > cpu.rt_runtime_us # cat cpu.rt_runtime_us 99 # echo 18446744073709900 > cpu.rt_period_us # cat cpu.rt_period_us 348 After this patch they will fail with -EINVAL. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/155125501739.293431.5252197504404771496.stgit@buzz Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-04sched/fair: Update scale invariance of PELTVincent Guittot
The current implementation of load tracking invariance scales the contribution with current frequency and uarch performance (only for utilization) of the CPU. One main result of this formula is that the figures are capped by current capacity of CPU. Another one is that the load_avg is not invariant because not scaled with uarch. The util_avg of a periodic task that runs r time slots every p time slots varies in the range : U * (1-y^r)/(1-y^p) * y^i < Utilization < U * (1-y^r)/(1-y^p) with U is the max util_avg value = SCHED_CAPACITY_SCALE At a lower capacity, the range becomes: U * C * (1-y^r')/(1-y^p) * y^i' < Utilization < U * C * (1-y^r')/(1-y^p) with C reflecting the compute capacity ratio between current capacity and max capacity. so C tries to compensate changes in (1-y^r') but it can't be accurate. Instead of scaling the contribution value of PELT algo, we should scale the running time. The PELT signal aims to track the amount of computation of tasks and/or rq so it seems more correct to scale the running time to reflect the effective amount of computation done since the last update. In order to be fully invariant, we need to apply the same amount of running time and idle time whatever the current capacity. Because running at lower capacity implies that the task will run longer, we have to ensure that the same amount of idle time will be applied when system becomes idle and no idle time has been "stolen". But reaching the maximum utilization value (SCHED_CAPACITY_SCALE) means that the task is seen as an always-running task whatever the capacity of the CPU (even at max compute capacity). In this case, we can discard this "stolen" idle times which becomes meaningless. In order to achieve this time scaling, a new clock_pelt is created per rq. The increase of this clock scales with current capacity when something is running on rq and synchronizes with clock_task when rq is idle. With this mechanism, we ensure the same running and idle time whatever the current capacity. This also enables to simplify the pelt algorithm by removing all references of uarch and frequency and applying the same contribution to utilization and loads. Furthermore, the scaling is done only once per update of clock (update_rq_clock_task()) instead of during each update of sched_entities and cfs/rt/dl_rq of the rq like the current implementation. This is interesting when cgroup are involved as shown in the results below: On a hikey (octo Arm64 platform). Performance cpufreq governor and only shallowest c-state to remove variance generated by those power features so we only track the impact of pelt algo. each test runs 16 times: ./perf bench sched pipe (higher is better) kernel tip/sched/core + patch ops/seconds ops/seconds diff cgroup root 59652(+/- 0.18%) 59876(+/- 0.24%) +0.38% level1 55608(+/- 0.27%) 55923(+/- 0.24%) +0.57% level2 52115(+/- 0.29%) 52564(+/- 0.22%) +0.86% hackbench -l 1000 (lower is better) kernel tip/sched/core + patch duration(sec) duration(sec) diff cgroup root 4.453(+/- 2.37%) 4.383(+/- 2.88%) -1.57% level1 4.859(+/- 8.50%) 4.830(+/- 7.07%) -0.60% level2 5.063(+/- 9.83%) 4.928(+/- 9.66%) -2.66% Then, the responsiveness of PELT is improved when CPU is not running at max capacity with this new algorithm. I have put below some examples of duration to reach some typical load values according to the capacity of the CPU with current implementation and with this patch. These values has been computed based on the geometric series and the half period value: Util (%) max capacity half capacity(mainline) half capacity(w/ patch) 972 (95%) 138ms not reachable 276ms 486 (47.5%) 30ms 138ms 60ms 256 (25%) 13ms 32ms 26ms On my hikey (octo Arm64 platform) with schedutil governor, the time to reach max OPP when starting from a null utilization, decreases from 223ms with current scale invariance down to 121ms with the new algorithm. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bsegall@google.com Cc: dietmar.eggemann@arm.com Cc: patrick.bellasi@arm.com Cc: pjt@google.com Cc: pkondeti@codeaurora.org Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: srinivas.pandruvada@linux.intel.com Cc: thara.gopinath@linaro.org Link: https://lkml.kernel.org/r/1548257214-13745-3-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-12-11sched/core: Remove unnecessary unlikely() in push_*_task()Yangtao Li
WARN_ON() already contains an unlikely(), so it's not necessary to use WARN_ON(1). Signed-off-by: Yangtao Li <tiny.windzz@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20181103172602.1917-1-tiny.windzz@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-11-04sched/core: Introduce set_next_task() helper for better code readabilityMuchun Song
When we pick the next task, we will do the following for the task: 1) p->se.exec_start = rq_clock_task(rq); 2) dequeue_pushable(_dl)_task(rq, p); When we call set_curr_task(), we also need to do the same thing above. In rt.c, the code at 1) is in the _pick_next_task_rt() and the code at 2) is in the pick_next_task_rt(). If we put two operations in one function, maybe better. So, we introduce a new function set_next_task(), which is responsible for doing the above. By introducing the function we can get rid of calling the dequeue_pushable(_dl)_task() directly(We can call set_next_task()) in pick_next_task() and have better code readability and reuse. In set_curr_task_rt(), we also can call set_next_task(). Do this things such that we end up with: static struct task_struct *pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) { /* do something else ... */ put_prev_task(rq, prev); /* pick next task p */ set_next_task(rq, p); /* do something else ... */ } put_prev_task() can match set_next_task(), which can make the code more readable. Signed-off-by: Muchun Song <smuchun@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20181026131743.21786-1-smuchun@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-29sched/rt: Update comment in pick_next_task_rt()Muchun Song
Commit: f4ebcbc0d7e0 ("sched/rt: Substract number of tasks of throttled queues from rq->nr_running") added a new rt_rq->rt_queued field, which is used to indicate the status of rq->rt enqueue or dequeue. So, the ->rt_nr_running check was removed and we now check ->rt_queued instead. Fix the comment in pick_next_task_rt() as well, which was still referencing the old logic. Signed-off-by: Muchun Song <smuchun@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20181027030517.23292-1-smuchun@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-25Merge branch 'sched/urgent' into sched/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-25sched/rt: Restore rt_runtime after disabling RT_RUNTIME_SHAREHailong Liu
NO_RT_RUNTIME_SHARE feature is used to prevent a CPU borrow enough runtime with a spin-rt-task. However, if RT_RUNTIME_SHARE feature is enabled and rt_rq has borrowd enough rt_runtime at the beginning, rt_runtime can't be restored to its initial bandwidth rt_runtime after we disable RT_RUNTIME_SHARE. E.g. on my PC with 4 cores, procedure to reproduce: 1) Make sure RT_RUNTIME_SHARE is enabled cat /sys/kernel/debug/sched_features GENTLE_FAIR_SLEEPERS START_DEBIT NO_NEXT_BUDDY LAST_BUDDY CACHE_HOT_BUDDY WAKEUP_PREEMPTION NO_HRTICK NO_DOUBLE_TICK LB_BIAS NONTASK_CAPACITY TTWU_QUEUE NO_SIS_AVG_CPU SIS_PROP NO_WARN_DOUBLE_CLOCK RT_PUSH_IPI RT_RUNTIME_SHARE NO_LB_MIN ATTACH_AGE_LOAD WA_IDLE WA_WEIGHT WA_BIAS 2) Start a spin-rt-task ./loop_rr & 3) set affinity to the last cpu taskset -p 8 $pid_of_loop_rr 4) Observe that last cpu have borrowed enough runtime. cat /proc/sched_debug | grep rt_runtime .rt_runtime : 950.000000 .rt_runtime : 900.000000 .rt_runtime : 950.000000 .rt_runtime : 1000.000000 5) Disable RT_RUNTIME_SHARE echo NO_RT_RUNTIME_SHARE > /sys/kernel/debug/sched_features 6) Observe that rt_runtime can not been restored cat /proc/sched_debug | grep rt_runtime .rt_runtime : 950.000000 .rt_runtime : 900.000000 .rt_runtime : 950.000000 .rt_runtime : 1000.000000 This patch help to restore rt_runtime after we disable RT_RUNTIME_SHARE. Signed-off-by: Hailong Liu <liu.hailong6@zte.com.cn> Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: zhong.weidong@zte.com.cn Link: http://lkml.kernel.org/r/1531874815-39357-1-git-send-email-liu.hailong6@zte.com.cn Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-16sched/core: Use PELT for scale_rt_capacity()Vincent Guittot
The utilization of the CPU by RT, DL and IRQs are now tracked with PELT so we can use these metrics instead of rt_avg to evaluate the remaining capacity available for CFS class. scale_rt_capacity() behavior has been changed and now returns the remaining capacity available for CFS instead of a scaling factor because RT, DL and IRQ provide now absolute utilization value. The same formula as schedutil is used: IRQ util_avg + (1 - IRQ util_avg / max capacity ) * /Sum rq util_avg but the implementation is different because it doesn't return the same value and doesn't benefit of the same optimization. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: claudio@evidence.eu.com Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: luca.abeni@santannapisa.it Cc: patrick.bellasi@arm.com Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: valentin.schneider@arm.com Cc: viresh.kumar@linaro.org Link: http://lkml.kernel.org/r/1530200714-4504-10-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-15sched/rt: Add rt_rq utilization trackingVincent Guittot
schedutil governor relies on cfs_rq's util_avg to choose the OPP when CFS tasks are running. When the CPU is overloaded by CFS and RT tasks, CFS tasks are preempted by RT tasks and in this case util_avg reflects the remaining capacity but not what CFS want to use. In such case, schedutil can select a lower OPP whereas the CPU is overloaded. In order to have a more accurate view of the utilization of the CPU, we track the utilization of RT tasks. Only util_avg is correctly tracked but not load_avg and runnable_load_avg which are useless for rt_rq. rt_rq uses rq_clock_task and cfs_rq uses cfs_rq_clock_task but they are the same at the root group level, so the PELT windows of the util_sum are aligned. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten.Rasmussen@arm.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: claudio@evidence.eu.com Cc: daniel.lezcano@linaro.org Cc: dietmar.eggemann@arm.com Cc: joel@joelfernandes.org Cc: juri.lelli@redhat.com Cc: luca.abeni@santannapisa.it Cc: patrick.bellasi@arm.com Cc: quentin.perret@arm.com Cc: rjw@rjwysocki.net Cc: valentin.schneider@arm.com Cc: viresh.kumar@linaro.org Link: http://lkml.kernel.org/r/1530200714-4504-3-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-07-03sched/rt: Fix call to cpufreq_update_util()Vincent Guittot
With commit: 8f111bc357aa ("cpufreq/schedutil: Rewrite CPUFREQ_RT support") the schedutil governor uses rq->rt.rt_nr_running to detect whether an RT task is currently running on the CPU and to set frequency to max if necessary. cpufreq_update_util() is called in enqueue/dequeue_top_rt_rq() but rq->rt.rt_nr_running has not been updated yet when dequeue_top_rt_rq() is called so schedutil still considers that an RT task is running when the last task is dequeued. The update of rq->rt.rt_nr_running happens later in dequeue_rt_stack(). In fact, we can take advantage of the sequence that the dequeue then re-enqueue rt entities when a rt task is enqueued or dequeued; As a result enqueue_top_rt_rq() is always called when a task is enqueued or dequeued and also when groups are throttled or unthrottled. The only place that not use enqueue_top_rt_rq() is when root rt_rq is throttled. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: efault@gmx.de Cc: juri.lelli@redhat.com Cc: patrick.bellasi@arm.com Cc: viresh.kumar@linaro.org Fixes: 8f111bc357aa ('cpufreq/schedutil: Rewrite CPUFREQ_RT support') Link: http://lkml.kernel.org/r/1530021202-21695-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-06-12treewide: kzalloc() -> kcalloc()Kees Cook
The kzalloc() function has a 2-factor argument form, kcalloc(). This patch replaces cases of: kzalloc(a * b, gfp) with: kcalloc(a * b, gfp) as well as handling cases of: kzalloc(a * b * c, gfp) with: kzalloc(array3_size(a, b, c), gfp) as it's slightly less ugly than: kzalloc_array(array_size(a, b), c, gfp) This does, however, attempt to ignore constant size factors like: kzalloc(4 * 1024, gfp) though any constants defined via macros get caught up in the conversion. Any factors with a sizeof() of "unsigned char", "char", and "u8" were dropped, since they're redundant. The Coccinelle script used for this was: // Fix redundant parens around sizeof(). @@ type TYPE; expression THING, E; @@ ( kzalloc( - (sizeof(TYPE)) * E + sizeof(TYPE) * E , ...) | kzalloc( - (sizeof(THING)) * E + sizeof(THING) * E , ...) ) // Drop single-byte sizes and redundant parens. @@ expression COUNT; typedef u8; typedef __u8; @@ ( kzalloc( - sizeof(u8) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(__u8) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(char) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(unsigned char) * (COUNT) + COUNT , ...) | kzalloc( - sizeof(u8) * COUNT + COUNT , ...) | kzalloc( - sizeof(__u8) * COUNT + COUNT , ...) | kzalloc( - sizeof(char) * COUNT + COUNT , ...) | kzalloc( - sizeof(unsigned char) * COUNT + COUNT , ...) ) // 2-factor product with sizeof(type/expression) and identifier or constant. @@ type TYPE; expression THING; identifier COUNT_ID; constant COUNT_CONST; @@ ( - kzalloc + kcalloc ( - sizeof(TYPE) * (COUNT_ID) + COUNT_ID, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * COUNT_ID + COUNT_ID, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * (COUNT_CONST) + COUNT_CONST, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * COUNT_CONST + COUNT_CONST, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * (COUNT_ID) + COUNT_ID, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * COUNT_ID + COUNT_ID, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * (COUNT_CONST) + COUNT_CONST, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * COUNT_CONST + COUNT_CONST, sizeof(THING) , ...) ) // 2-factor product, only identifiers. @@ identifier SIZE, COUNT; @@ - kzalloc + kcalloc ( - SIZE * COUNT + COUNT, SIZE , ...) // 3-factor product with 1 sizeof(type) or sizeof(expression), with // redundant parens removed. @@ expression THING; identifier STRIDE, COUNT; type TYPE; @@ ( kzalloc( - sizeof(TYPE) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(TYPE) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(TYPE) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(TYPE) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc( - sizeof(THING) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kzalloc( - sizeof(THING) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kzalloc( - sizeof(THING) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kzalloc( - sizeof(THING) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) ) // 3-factor product with 2 sizeof(variable), with redundant parens removed. @@ expression THING1, THING2; identifier COUNT; type TYPE1, TYPE2; @@ ( kzalloc( - sizeof(TYPE1) * sizeof(TYPE2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kzalloc( - sizeof(THING1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kzalloc( - sizeof(THING1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kzalloc( - sizeof(TYPE1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) | kzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) ) // 3-factor product, only identifiers, with redundant parens removed. @@ identifier STRIDE, SIZE, COUNT; @@ ( kzalloc( - (COUNT) * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - (COUNT) * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - (COUNT) * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - (COUNT) * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc( - COUNT * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) ) // Any remaining multi-factor products, first at least 3-factor products, // when they're not all constants... @@ expression E1, E2, E3; constant C1, C2, C3; @@ ( kzalloc(C1 * C2 * C3, ...) | kzalloc( - (E1) * E2 * E3 + array3_size(E1, E2, E3) , ...) | kzalloc( - (E1) * (E2) * E3 + array3_size(E1, E2, E3) , ...) | kzalloc( - (E1) * (E2) * (E3) + array3_size(E1, E2, E3) , ...) | kzalloc( - E1 * E2 * E3 + array3_size(E1, E2, E3) , ...) ) // And then all remaining 2 factors products when they're not all constants, // keeping sizeof() as the second factor argument. @@ expression THING, E1, E2; type TYPE; constant C1, C2, C3; @@ ( kzalloc(sizeof(THING) * C2, ...) | kzalloc(sizeof(TYPE) * C2, ...) | kzalloc(C1 * C2 * C3, ...) | kzalloc(C1 * C2, ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * (E2) + E2, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(TYPE) * E2 + E2, sizeof(TYPE) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * (E2) + E2, sizeof(THING) , ...) | - kzalloc + kcalloc ( - sizeof(THING) * E2 + E2, sizeof(THING) , ...) | - kzalloc + kcalloc ( - (E1) * E2 + E1, E2 , ...) | - kzalloc + kcalloc ( - (E1) * (E2) + E1, E2 , ...) | - kzalloc + kcalloc ( - E1 * E2 + E1, E2 , ...) ) Signed-off-by: Kees Cook <keescook@chromium.org>
2018-05-18sched/debug: Move the print_rt_rq() and print_dl_rq() declarations to ↵Mathieu Malaterre
kernel/sched/sched.h In the following commit: 6b55c9654fcc ("sched/debug: Move print_cfs_rq() declaration to kernel/sched/sched.h") the print_cfs_rq() prototype was added to <kernel/sched/sched.h>, right next to the prototypes for print_cfs_stats(), print_rt_stats() and print_dl_stats(). Finish this previous commit and also move related prototypes for print_rt_rq() and print_dl_rq(). Remove existing extern declarations now that they not needed anymore. Silences the following GCC warning, triggered by W=1: kernel/sched/debug.c:573:6: warning: no previous prototype for ‘print_rt_rq’ [-Wmissing-prototypes] kernel/sched/debug.c:603:6: warning: no previous prototype for ‘print_dl_rq’ [-Wmissing-prototypes] Signed-off-by: Mathieu Malaterre <malat@debian.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180516195348.30426-1-malat@debian.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-04-05sched/core: Simplify helpers for rq clock update skip requestsDavidlohr Bueso
By renaming the functions we can get rid of the skip parameter and have better code redability. It makes zero sense to have things such as: rq_clock_skip_update(rq, false) When the skip request is in fact not going to happen. Ever. Rename things such that we end up with: rq_clock_skip_update(rq) rq_clock_cancel_skipupdate(rq) Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Cc: matt@codeblueprint.co.uk Cc: rostedt@goodmis.org Link: http://lkml.kernel.org/r/20180404161539.nhadkff2aats74jh@linux-n805 Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-04-05sched/rt: Fix rq->clock_update_flags < RQCF_ACT_SKIP warningDavidlohr Bueso
While running rt-tests' pi_stress program I got the following splat: rq->clock_update_flags < RQCF_ACT_SKIP WARNING: CPU: 27 PID: 0 at kernel/sched/sched.h:960 assert_clock_updated.isra.38.part.39+0x13/0x20 [...] <IRQ> enqueue_top_rt_rq+0xf4/0x150 ? cpufreq_dbs_governor_start+0x170/0x170 sched_rt_rq_enqueue+0x65/0x80 sched_rt_period_timer+0x156/0x360 ? sched_rt_rq_enqueue+0x80/0x80 __hrtimer_run_queues+0xfa/0x260 hrtimer_interrupt+0xcb/0x220 smp_apic_timer_interrupt+0x62/0x120 apic_timer_interrupt+0xf/0x20 </IRQ> [...] do_idle+0x183/0x1e0 cpu_startup_entry+0x5f/0x70 start_secondary+0x192/0x1d0 secondary_startup_64+0xa5/0xb0 We can get rid of it be the "traditional" means of adding an update_rq_clock() call after acquiring the rq->lock in do_sched_rt_period_timer(). The case for the RT task throttling (which this workload also hits) can be ignored in that the skip_update call is actually bogus and quite the contrary (the request bits are removed/reverted). By setting RQCF_UPDATED we really don't care if the skip is happening or not and will therefore make the assert_clock_updated() check happy. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dave@stgolabs.net Cc: linux-kernel@vger.kernel.org Cc: rostedt@goodmis.org Link: http://lkml.kernel.org/r/20180402164954.16255-1-dave@stgolabs.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-09cpufreq/schedutil: Rewrite CPUFREQ_RT supportPeter Zijlstra
Instead of trying to duplicate scheduler state to track if an RT task is running, directly use the scheduler runqueue state for it. This vastly simplifies things and fixes a number of bugs related to sugov and the scheduler getting out of sync wrt this state. As a consequence we not also update the remove cfs/dl state when iterating the shared mask. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-04sched/deadline, rt: Rename queue_push_tasks/queue_pull_task to create ↵Ingo Molnar
separate namespace There are similarly named functions in both of these modules: kernel/sched/deadline.c:static inline void queue_push_tasks(struct rq *rq) kernel/sched/deadline.c:static inline void queue_pull_task(struct rq *rq) kernel/sched/deadline.c:static inline void queue_push_tasks(struct rq *rq) kernel/sched/deadline.c:static inline void queue_pull_task(struct rq *rq) kernel/sched/deadline.c: queue_push_tasks(rq); kernel/sched/deadline.c: queue_pull_task(rq); kernel/sched/deadline.c: queue_push_tasks(rq); kernel/sched/deadline.c: queue_pull_task(rq); kernel/sched/rt.c:static inline void queue_push_tasks(struct rq *rq) kernel/sched/rt.c:static inline void queue_pull_task(struct rq *rq) kernel/sched/rt.c:static inline void queue_push_tasks(struct rq *rq) kernel/sched/rt.c: queue_push_tasks(rq); kernel/sched/rt.c: queue_pull_task(rq); kernel/sched/rt.c: queue_push_tasks(rq); kernel/sched/rt.c: queue_pull_task(rq); ... which makes it harder to grep for them. Prefix them with deadline_ and rt_, respectively. Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-04sched/headers: Simplify and clean up header usage in the schedulerIngo Molnar
Do the following cleanups and simplifications: - sched/sched.h already includes <asm/paravirt.h>, so no need to include it in sched/core.c again. - order the <linux/sched/*.h> headers alphabetically - add all <linux/sched/*.h> headers to kernel/sched/sched.h - remove all unnecessary includes from the .c files that are already included in kernel/sched/sched.h. Finally, make all scheduler .c files use a single common header: #include "sched.h" ... which now contains a union of the relied upon headers. This makes the various .c files easier to read and easier to handle. Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-03sched: Clean up and harmonize the coding style of the scheduler code baseIngo Molnar
A good number of small style inconsistencies have accumulated in the scheduler core, so do a pass over them to harmonize all these details: - fix speling in comments, - use curly braces for multi-line statements, - remove unnecessary parentheses from integer literals, - capitalize consistently, - remove stray newlines, - add comments where necessary, - remove invalid/unnecessary comments, - align structure definitions and other data types vertically, - add missing newlines for increased readability, - fix vertical tabulation where it's misaligned, - harmonize preprocessor conditional block labeling and vertical alignment, - remove line-breaks where they uglify the code, - add newline after local variable definitions, No change in functionality: md5: 1191fa0a890cfa8132156d2959d7e9e2 built-in.o.before.asm 1191fa0a890cfa8132156d2959d7e9e2 built-in.o.after.asm Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-21sched/isolation: Offload residual 1Hz scheduler tickFrederic Weisbecker
When a CPU runs in full dynticks mode, a 1Hz tick remains in order to keep the scheduler stats alive. However this residual tick is a burden for bare metal tasks that can't stand any interruption at all, or want to minimize them. The usual boot parameters "nohz_full=" or "isolcpus=nohz" will now outsource these scheduler ticks to the global workqueue so that a housekeeping CPU handles those remotely. The sched_class::task_tick() implementations have been audited and look safe to be called remotely as the target runqueue and its current task are passed in parameter and don't seem to be accessed locally. Note that in the case of using isolcpus, it's still up to the user to affine the global workqueues to the housekeeping CPUs through /sys/devices/virtual/workqueue/cpumask or domains isolation "isolcpus=nohz,domain". Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Christoph Lameter <cl@linux.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wanpeng Li <kernellwp@gmail.com> Link: http://lkml.kernel.org/r/1519186649-3242-6-git-send-email-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-13sched/rt: Make update_curr_rt() more accurateWen Yang
rq->clock_task may be updated between the two calls of rq_clock_task() in update_curr_rt(). Calling rq_clock_task() only once makes it more accurate and efficient, taking update_curr() as reference. Signed-off-by: Wen Yang <wen.yang99@zte.com.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: zhong.weidong@zte.com.cn Link: http://lkml.kernel.org/r/1517882008-44552-1-git-send-email-wen.yang99@zte.com.cn Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-06sched/rt: Make update_curr_rt() more accurateWen Yang
rq->clock_task may be updated between the two calls of rq_clock_task() in update_curr_rt(). Calling rq_clock_task() only once makes it more accurate and efficient, taking update_curr() as reference. Signed-off-by: Wen Yang <wen.yang99@zte.com.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: zhong.weidong@zte.com.cn Link: http://lkml.kernel.org/r/1517800721-42092-1-git-send-email-wen.yang99@zte.com.cn Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-06sched/rt: Up the root domain ref count when passing it around via IPIsSteven Rostedt (VMware)
When issuing an IPI RT push, where an IPI is sent to each CPU that has more than one RT task scheduled on it, it references the root domain's rto_mask, that contains all the CPUs within the root domain that has more than one RT task in the runable state. The problem is, after the IPIs are initiated, the rq->lock is released. This means that the root domain that is associated to the run queue could be freed while the IPIs are going around. Add a sched_get_rd() and a sched_put_rd() that will increment and decrement the root domain's ref count respectively. This way when initiating the IPIs, the scheduler will up the root domain's ref count before releasing the rq->lock, ensuring that the root domain does not go away until the IPI round is complete. Reported-by: Pavan Kondeti <pkondeti@codeaurora.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 4bdced5c9a292 ("sched/rt: Simplify the IPI based RT balancing logic") Link: http://lkml.kernel.org/r/CAEU1=PkiHO35Dzna8EQqNSKW1fr1y1zRQ5y66X117MG06sQtNA@mail.gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-02-06sched/rt: Use container_of() to get root domain in rto_push_irq_work_func()Steven Rostedt (VMware)
When the rto_push_irq_work_func() is called, it looks at the RT overloaded bitmask in the root domain via the runqueue (rq->rd). The problem is that during CPU up and down, nothing here stops rq->rd from changing between taking the rq->rd->rto_lock and releasing it. That means the lock that is released is not the same lock that was taken. Instead of using this_rq()->rd to get the root domain, as the irq work is part of the root domain, we can simply get the root domain from the irq work that is passed to the routine: container_of(work, struct root_domain, rto_push_work) This keeps the root domain consistent. Reported-by: Pavan Kondeti <pkondeti@codeaurora.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 4bdced5c9a292 ("sched/rt: Simplify the IPI based RT balancing logic") Link: http://lkml.kernel.org/r/CAEU1=PkiHO35Dzna8EQqNSKW1fr1y1zRQ5y66X117MG06sQtNA@mail.gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-03Merge branch 'for-mingo' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull RCU updates from Paul E. McKenney: - Updates to use cond_resched() instead of cond_resched_rcu_qs() where feasible (currently everywhere except in kernel/rcu and in kernel/torture.c). Also a couple of fixes to avoid sending IPIs to offline CPUs. - Updates to simplify RCU's dyntick-idle handling. - Updates to remove almost all uses of smp_read_barrier_depends() and read_barrier_depends(). - Miscellaneous fixes. - Torture-test updates. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-12-15sched/rt: Do not pull from current CPU if only one CPU to pullSteven Rostedt
Daniel Wagner reported a crash on the BeagleBone Black SoC. This is a single CPU architecture, and does not have a functional arch_send_call_function_single_ipi() implementation which can crash the kernel if that is called. As it only has one CPU, it shouldn't be called, but if the kernel is compiled for SMP, the push/pull RT scheduling logic now calls it for irq_work if the one CPU is overloaded, it can use that function to call itself and crash the kernel. Ideally, we should disable the SCHED_FEAT(RT_PUSH_IPI) if the system only has a single CPU. But SCHED_FEAT is a constant if sched debugging is turned off. Another fix can also be used, and this should also help with normal SMP machines. That is, do not initiate the pull code if there's only one RT overloaded CPU, and that CPU happens to be the current CPU that is scheduling in a lower priority task. Even on a system with many CPUs, if there's many RT tasks waiting to run on a single CPU, and that CPU schedules in another RT task of lower priority, it will initiate the PULL logic in case there's a higher priority RT task on another CPU that is waiting to run. But if there is no other CPU with waiting RT tasks, it will initiate the RT pull logic on itself (as it still has RT tasks waiting to run). This is a wasted effort. Not only does this help with SMP code where the current CPU is the only one with RT overloaded tasks, it should also solve the issue that Daniel encountered, because it will prevent the PULL logic from executing, as there's only one CPU on the system, and the check added here will cause it to exit the RT pull code. Reported-by: Daniel Wagner <wagi@monom.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-rt-users <linux-rt-users@vger.kernel.org> Cc: stable@vger.kernel.org Fixes: 4bdced5c9 ("sched/rt: Simplify the IPI based RT balancing logic") Link: http://lkml.kernel.org/r/20171202130454.4cbbfe8d@vmware.local.home Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-11-28sched: Stop switched_to_rt() from sending IPIs to offline CPUsPaul E. McKenney
The rcutorture test suite occasionally provokes a splat due to invoking rt_mutex_lock() which needs to boost the priority of a task currently sitting on a runqueue that belongs to an offline CPU: WARNING: CPU: 0 PID: 12 at /home/paulmck/public_git/linux-rcu/arch/x86/kernel/smp.c:128 native_smp_send_reschedule+0x37/0x40 Modules linked in: CPU: 0 PID: 12 Comm: rcub/7 Not tainted 4.14.0-rc4+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 task: ffff9ed3de5f8cc0 task.stack: ffffbbf80012c000 RIP: 0010:native_smp_send_reschedule+0x37/0x40 RSP: 0018:ffffbbf80012fd10 EFLAGS: 00010082 RAX: 000000000000002f RBX: ffff9ed3dd9cb300 RCX: 0000000000000004 RDX: 0000000080000004 RSI: 0000000000000086 RDI: 00000000ffffffff RBP: ffffbbf80012fd10 R08: 000000000009da7a R09: 0000000000007b9d R10: 0000000000000001 R11: ffffffffbb57c2cd R12: 000000000000000d R13: ffff9ed3de5f8cc0 R14: 0000000000000061 R15: ffff9ed3ded59200 FS: 0000000000000000(0000) GS:ffff9ed3dea00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000080686f0 CR3: 000000001b9e0000 CR4: 00000000000006f0 Call Trace: resched_curr+0x61/0xd0 switched_to_rt+0x8f/0xa0 rt_mutex_setprio+0x25c/0x410 task_blocks_on_rt_mutex+0x1b3/0x1f0 rt_mutex_slowlock+0xa9/0x1e0 rt_mutex_lock+0x29/0x30 rcu_boost_kthread+0x127/0x3c0 kthread+0x104/0x140 ? rcu_report_unblock_qs_rnp+0x90/0x90 ? kthread_create_on_node+0x40/0x40 ret_from_fork+0x22/0x30 Code: f0 00 0f 92 c0 84 c0 74 14 48 8b 05 34 74 c5 00 be fd 00 00 00 ff 90 a0 00 00 00 5d c3 89 fe 48 c7 c7 a0 c6 fc b9 e8 d5 b5 06 00