From 19209bbb8612004bc20a1f70ff12926f99fe2643 Mon Sep 17 00:00:00 2001 From: "Srivatsa S. Bhat" Date: Mon, 30 Apr 2012 12:26:56 +0530 Subject: x86/sched: Make mwait_usable() heed to "idle=" kernel parameters properly The checks that exist in mwait_usable() for "idle=" kernel parameters are insufficient. As a result, mwait_usable() can return 1 even if "idle=nomwait" or "idle=poll" or "idle=halt" parameters are passed. Of these cases, incorrect handling of idle=nomwait is a universal problem since mwait can get used for usual CPU idling. However the rest of the cases are problematic only during CPU Hotplug (offline) because, in the CPU offline path, the function mwait_play_dead() is called, which might result in mwait being used in the offline CPUs, if mwait_usable() happens to return 1. Fix these issues by checking for the boot time "idle=" kernel parameter properly in mwait_usable(). The first issue (usual cpu idling) is demonstrated below: Before applying the patch (dmesg snippet): [ 0.000000] Command line: [...] idle=nomwait [ 0.000000] Kernel command line: [...] idle=nomwait [ 0.000000] RCU dyntick-idle grace-period acceleration is enabled. [ 0.140606] using mwait in idle threads. <======= mwait being used [ 4.303986] cpuidle: using governor ladder [ 4.308232] cpuidle: using governor menu After applying the patch: [ 0.000000] Command line: [...] idle=nomwait [ 0.000000] Kernel command line: [...] idle=nomwait [ 0.000000] RCU dyntick-idle grace-period acceleration is enabled. [ 4.264100] cpuidle: using governor ladder [ 4.268342] cpuidle: using governor menu Signed-off-by: Srivatsa S. Bhat Acked-by: Deepthi Dharwar Acked-by: Thomas Gleixner Cc: venki@google.com Cc: suresh.b.siddha@intel.com Cc: Borislav Petkov Cc: lenb@kernel.org Cc: Rafael J. Wysocki Link: http://lkml.kernel.org/r/4F9E37B8.30001@linux.vnet.ibm.com Signed-off-by: Ingo Molnar --- arch/x86/kernel/process.c | 8 ++++++++ 1 file changed, 8 insertions(+) (limited to 'arch') diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index 1d92a5ab6e8b..ad57d832d96f 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -594,9 +594,17 @@ int mwait_usable(const struct cpuinfo_x86 *c) { u32 eax, ebx, ecx, edx; + /* Use mwait if idle=mwait boot option is given */ if (boot_option_idle_override == IDLE_FORCE_MWAIT) return 1; + /* + * Any idle= boot option other than idle=mwait means that we must not + * use mwait. Eg: idle=halt or idle=poll or idle=nomwait + */ + if (boot_option_idle_override != IDLE_NO_OVERRIDE) + return 0; + if (c->cpuid_level < MWAIT_INFO) return 0; -- cgit v1.2.3 From 94c0dd3278dd3eae52eabf0fb77d472d0dd3e373 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Wed, 18 Apr 2012 19:04:17 +0200 Subject: x86/numa: Allow specifying node_distance() for numa=fake Allows emulating more interesting NUMA configurations like a quad socket AMD Magny-Cour: "numa=fake=8:10,16,16,22,16,22,16,22, 16,10,22,16,22,16,22,16, 16,22,10,16,16,22,16,22, 22,16,16,10,22,16,22,16, 16,22,16,22,10,16,16,22, 22,16,22,16,16,10,22,16, 16,22,16,22,16,22,10,16, 22,16,22,16,22,16,16,10" Which has a non-fully-connected topology. Signed-off-by: Peter Zijlstra Cc: Tejun Heo Cc: Yinghai Lu Cc: x86@kernel.org Link: http://lkml.kernel.org/n/tip-e1136ef7kdffj7yf9tjhydln@git.kernel.org Signed-off-by: Ingo Molnar --- arch/x86/mm/numa_emulation.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) (limited to 'arch') diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c index 53489ff6bf82..871dd8868170 100644 --- a/arch/x86/mm/numa_emulation.c +++ b/arch/x86/mm/numa_emulation.c @@ -339,9 +339,11 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt) } else { unsigned long n; - n = simple_strtoul(emu_cmdline, NULL, 0); + n = simple_strtoul(emu_cmdline, &emu_cmdline, 0); ret = split_nodes_interleave(&ei, &pi, 0, max_addr, n); } + if (*emu_cmdline == ':') + emu_cmdline++; if (ret < 0) goto no_emu; @@ -418,7 +420,9 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt) int physj = emu_nid_to_phys[j]; int dist; - if (physi >= numa_dist_cnt || physj >= numa_dist_cnt) + if (get_option(&emu_cmdline, &dist) == 2) + ; + else if (physi >= numa_dist_cnt || physj >= numa_dist_cnt) dist = physi == physj ? LOCAL_DISTANCE : REMOTE_DISTANCE; else -- cgit v1.2.3 From 0acbb440f06302058e1515861dd534594521e892 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Wed, 18 Apr 2012 19:04:09 +0200 Subject: x86/numa: Hard partition cpu topology masks on node boundaries When using numa=fake= you can get weird topologies where LLCs can span nodes and other such nonsense. Cure this by hard partitioning these masks on node boundaries. Signed-off-by: Peter Zijlstra Cc: Tejun Heo Cc: Yinghai Lu Cc: x86@kernel.org Link: http://lkml.kernel.org/n/tip-di5vwjm96q5vrb76opwuflwx@git.kernel.org Signed-off-by: Ingo Molnar --- arch/x86/kernel/smpboot.c | 11 +++++++++++ 1 file changed, 11 insertions(+) (limited to 'arch') diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 6e1e406038c2..edfd03a9e390 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -337,6 +337,11 @@ void __cpuinit set_cpu_sibling_map(int cpu) for_each_cpu(i, cpu_sibling_setup_mask) { struct cpuinfo_x86 *o = &cpu_data(i); +#ifdef CONFIG_NUMA_EMU + if (cpu_to_node(cpu) != cpu_to_node(i)) + continue; +#endif + if (cpu_has(c, X86_FEATURE_TOPOEXT)) { if (c->phys_proc_id == o->phys_proc_id && per_cpu(cpu_llc_id, cpu) == per_cpu(cpu_llc_id, i) && @@ -360,11 +365,17 @@ void __cpuinit set_cpu_sibling_map(int cpu) } for_each_cpu(i, cpu_sibling_setup_mask) { +#ifdef CONFIG_NUMA_EMU + if (cpu_to_node(cpu) != cpu_to_node(i)) + continue; +#endif + if (per_cpu(cpu_llc_id, cpu) != BAD_APICID && per_cpu(cpu_llc_id, cpu) == per_cpu(cpu_llc_id, i)) { cpumask_set_cpu(i, cpu_llc_shared_mask(cpu)); cpumask_set_cpu(cpu, cpu_llc_shared_mask(i)); } + if (c->phys_proc_id == cpu_data(i).phys_proc_id) { cpumask_set_cpu(i, cpu_core_mask(cpu)); cpumask_set_cpu(cpu, cpu_core_mask(i)); -- cgit v1.2.3 From ad7687dde8780a0d618a3e3b5a62bb383696fc22 Mon Sep 17 00:00:00 2001 From: Ingo Molnar Date: Wed, 9 May 2012 13:31:47 +0200 Subject: x86/numa: Check for nonsensical topologies on real hw as well Instead of only checking nonsensical topologies on numa-emu, do it on real hardware as well, and print a warning. Acked-by: Peter Zijlstra Cc: Tejun Heo Cc: Yinghai Lu Cc: x86@kernel.org Link: http://lkml.kernel.org/n/tip-re15l0jqjtpz709oxozt2zoh@git.kernel.org Signed-off-by: Ingo Molnar --- arch/x86/kernel/smpboot.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) (limited to 'arch') diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index edfd03a9e390..7c53d96d44ab 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -337,10 +337,10 @@ void __cpuinit set_cpu_sibling_map(int cpu) for_each_cpu(i, cpu_sibling_setup_mask) { struct cpuinfo_x86 *o = &cpu_data(i); -#ifdef CONFIG_NUMA_EMU - if (cpu_to_node(cpu) != cpu_to_node(i)) + if (cpu_to_node(cpu) != cpu_to_node(i)) { + WARN_ONCE(1, "sched: CPU #%d's thread-sibling CPU #%d not on the same node! [node %d != %d]. Ignoring sibling dependency.\n", cpu, i, cpu_to_node(cpu), cpu_to_node(i)); continue; -#endif + } if (cpu_has(c, X86_FEATURE_TOPOEXT)) { if (c->phys_proc_id == o->phys_proc_id && @@ -365,10 +365,10 @@ void __cpuinit set_cpu_sibling_map(int cpu) } for_each_cpu(i, cpu_sibling_setup_mask) { -#ifdef CONFIG_NUMA_EMU - if (cpu_to_node(cpu) != cpu_to_node(i)) + if (cpu_to_node(cpu) != cpu_to_node(i)) { + WARN_ONCE(1, "sched: CPU #%d's core-sibling CPU #%d not on the same node! [node %d != %d]. Ignoring sibling dependency.\n", cpu, i, cpu_to_node(cpu), cpu_to_node(i)); continue; -#endif + } if (per_cpu(cpu_llc_id, cpu) != BAD_APICID && per_cpu(cpu_llc_id, cpu) == per_cpu(cpu_llc_id, i)) { -- cgit v1.2.3 From cb83b629bae0327cf9f44f096adc38d150ceb913 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Tue, 17 Apr 2012 15:49:36 +0200 Subject: sched/numa: Rewrite the CONFIG_NUMA sched domain support The current code groups up to 16 nodes in a level and then puts an ALLNODES domain spanning the entire tree on top of that. This doesn't reflect the numa topology and esp for the smaller not-fully-connected machines out there today this might make a difference. Therefore, build a proper numa topology based on node_distance(). Since there's no fixed numa layers anymore, the static SD_NODE_INIT and SD_ALLNODES_INIT aren't usable anymore, the new code tries to construct something similar and scales some values either on the number of cpus in the domain and/or the node_distance() ratio. Signed-off-by: Peter Zijlstra Cc: Anton Blanchard Cc: Benjamin Herrenschmidt Cc: Chris Metcalf Cc: David Howells Cc: "David S. Miller" Cc: Fenghua Yu Cc: "H. Peter Anvin" Cc: Ivan Kokshaysky Cc: linux-alpha@vger.kernel.org Cc: linux-ia64@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-mips@linux-mips.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-sh@vger.kernel.org Cc: Matt Turner Cc: Paul Mackerras Cc: Paul Mundt Cc: Ralf Baechle Cc: Richard Henderson Cc: sparclinux@vger.kernel.org Cc: Tony Luck Cc: x86@kernel.org Cc: Dimitri Sivanich Cc: Greg Pearson Cc: KAMEZAWA Hiroyuki Cc: bob.picco@oracle.com Cc: chris.mason@oracle.com Cc: Linus Torvalds Cc: Andrew Morton Link: http://lkml.kernel.org/n/tip-r74n3n8hhuc2ynbrnp3vt954@git.kernel.org Signed-off-by: Ingo Molnar --- arch/ia64/include/asm/topology.h | 25 -------------------- arch/mips/include/asm/mach-ip27/topology.h | 17 ------------- arch/powerpc/include/asm/topology.h | 36 ---------------------------- arch/sh/include/asm/topology.h | 25 -------------------- arch/sparc/include/asm/topology_64.h | 19 --------------- arch/tile/include/asm/topology.h | 26 -------------------- arch/x86/include/asm/topology.h | 38 ------------------------------ 7 files changed, 186 deletions(-) (limited to 'arch') diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h index 09f646753d1a..a2496e449b75 100644 --- a/arch/ia64/include/asm/topology.h +++ b/arch/ia64/include/asm/topology.h @@ -70,31 +70,6 @@ void build_cpu_to_node_map(void); .nr_balance_failed = 0, \ } -/* sched_domains SD_NODE_INIT for IA64 NUMA machines */ -#define SD_NODE_INIT (struct sched_domain) { \ - .parent = NULL, \ - .child = NULL, \ - .groups = NULL, \ - .min_interval = 8, \ - .max_interval = 8*(min(num_online_cpus(), 32U)), \ - .busy_factor = 64, \ - .imbalance_pct = 125, \ - .cache_nice_tries = 2, \ - .busy_idx = 3, \ - .idle_idx = 2, \ - .newidle_idx = 0, \ - .wake_idx = 0, \ - .forkexec_idx = 0, \ - .flags = SD_LOAD_BALANCE \ - | SD_BALANCE_NEWIDLE \ - | SD_BALANCE_EXEC \ - | SD_BALANCE_FORK \ - | SD_SERIALIZE, \ - .last_balance = jiffies, \ - .balance_interval = 64, \ - .nr_balance_failed = 0, \ -} - #endif /* CONFIG_NUMA */ #ifdef CONFIG_SMP diff --git a/arch/mips/include/asm/mach-ip27/topology.h b/arch/mips/include/asm/mach-ip27/topology.h index 1b1a7d1632b9..b2cf641f206f 100644 --- a/arch/mips/include/asm/mach-ip27/topology.h +++ b/arch/mips/include/asm/mach-ip27/topology.h @@ -36,23 +36,6 @@ extern unsigned char __node_distances[MAX_COMPACT_NODES][MAX_COMPACT_NODES]; #define node_distance(from, to) (__node_distances[(from)][(to)]) -/* sched_domains SD_NODE_INIT for SGI IP27 machines */ -#define SD_NODE_INIT (struct sched_domain) { \ - .parent = NULL, \ - .child = NULL, \ - .groups = NULL, \ - .min_interval = 8, \ - .max_interval = 32, \ - .busy_factor = 32, \ - .imbalance_pct = 125, \ - .cache_nice_tries = 1, \ - .flags = SD_LOAD_BALANCE | \ - SD_BALANCE_EXEC, \ - .last_balance = jiffies, \ - .balance_interval = 1, \ - .nr_balance_failed = 0, \ -} - #include #endif /* _ASM_MACH_TOPOLOGY_H */ diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h index c97185885c6d..852ed1b384f6 100644 --- a/arch/powerpc/include/asm/topology.h +++ b/arch/powerpc/include/asm/topology.h @@ -18,12 +18,6 @@ struct device_node; */ #define RECLAIM_DISTANCE 10 -/* - * Avoid creating an extra level of balancing (SD_ALLNODES) on the largest - * POWER7 boxes which have a maximum of 32 nodes. - */ -#define SD_NODES_PER_DOMAIN 32 - #include static inline int cpu_to_node(int cpu) @@ -51,36 +45,6 @@ static inline int pcibus_to_node(struct pci_bus *bus) cpu_all_mask : \ cpumask_of_node(pcibus_to_node(bus))) -/* sched_domains SD_NODE_INIT for PPC64 machines */ -#define SD_NODE_INIT (struct sched_domain) { \ - .min_interval = 8, \ - .max_interval = 32, \ - .busy_factor = 32, \ - .imbalance_pct = 125, \ - .cache_nice_tries = 1, \ - .busy_idx = 3, \ - .idle_idx = 1, \ - .newidle_idx = 0, \ - .wake_idx = 0, \ - .forkexec_idx = 0, \ - \ - .flags = 1*SD_LOAD_BALANCE \ - | 0*SD_BALANCE_NEWIDLE \ - | 1*SD_BALANCE_EXEC \ - | 1*SD_BALANCE_FORK \ - | 0*SD_BALANCE_WAKE \ - | 1*SD_WAKE_AFFINE \ - | 0*SD_PREFER_LOCAL \ - | 0*SD_SHARE_CPUPOWER \ - | 0*SD_POWERSAVINGS_BALANCE \ - | 0*SD_SHARE_PKG_RESOURCES \ - | 1*SD_SERIALIZE \ - | 0*SD_PREFER_SIBLING \ - , \ - .last_balance = jiffies, \ - .balance_interval = 1, \ -} - extern int __node_distance(int, int); #define node_distance(a, b) __node_distance(a, b) diff --git a/arch/sh/include/asm/topology.h b/arch/sh/include/asm/topology.h index 88e734069fa6..b0a282d65f6a 100644 --- a/arch/sh/include/asm/topology.h +++ b/arch/sh/include/asm/topology.h @@ -3,31 +3,6 @@ #ifdef CONFIG_NUMA -/* sched_domains SD_NODE_INIT for sh machines */ -#define SD_NODE_INIT (struct sched_domain) { \ - .parent = NULL, \ - .child = NULL, \ - .groups = NULL, \ - .min_interval = 8, \ - .max_interval = 32, \ - .busy_factor = 32, \ - .imbalance_pct = 125, \ - .cache_nice_tries = 2, \ - .busy_idx = 3, \ - .idle_idx = 2, \ - .newidle_idx = 0, \ - .wake_idx = 0, \ - .forkexec_idx = 0, \ - .flags = SD_LOAD_BALANCE \ - | SD_BALANCE_FORK \ - | SD_BALANCE_EXEC \ - | SD_BALANCE_NEWIDLE \ - | SD_SERIALIZE, \ - .last_balance = jiffies, \ - .balance_interval = 1, \ - .nr_balance_failed = 0, \ -} - #define cpu_to_node(cpu) ((void)(cpu),0) #define parent_node(node) ((void)(node),0) diff --git a/arch/sparc/include/asm/topology_64.h b/arch/sparc/include/asm/topology_64.h index 8b9c556d630b..1754390a426f 100644 --- a/arch/sparc/include/asm/topology_64.h +++ b/arch/sparc/include/asm/topology_64.h @@ -31,25 +31,6 @@ static inline int pcibus_to_node(struct pci_bus *pbus) cpu_all_mask : \ cpumask_of_node(pcibus_to_node(bus))) -#define SD_NODE_INIT (struct sched_domain) { \ - .min_interval = 8, \ - .max_interval = 32, \ - .busy_factor = 32, \ - .imbalance_pct = 125, \ - .cache_nice_tries = 2, \ - .busy_idx = 3, \ - .idle_idx = 2, \ - .newidle_idx = 0, \ - .wake_idx = 0, \ - .forkexec_idx = 0, \ - .flags = SD_LOAD_BALANCE \ - | SD_BALANCE_FORK \ - | SD_BALANCE_EXEC \ - | SD_SERIALIZE, \ - .last_balance = jiffies, \ - .balance_interval = 1, \ -} - #else /* CONFIG_NUMA */ #include diff --git a/arch/tile/include/asm/topology.h b/arch/tile/include/asm/topology.h index 6fdd0c860193..7a7ce390534f 100644 --- a/arch/tile/include/asm/topology.h +++ b/arch/tile/include/asm/topology.h @@ -78,32 +78,6 @@ static inline const struct cpumask *cpumask_of_node(int node) .balance_interval = 32, \ } -/* sched_domains SD_NODE_INIT for TILE architecture */ -#define SD_NODE_INIT (struct sched_domain) { \ - .min_interval = 16, \ - .max_interval = 512, \ - .busy_factor = 32, \ - .imbalance_pct = 125, \ - .cache_nice_tries = 1, \ - .busy_idx = 3, \ - .idle_idx = 1, \ - .newidle_idx = 2, \ - .wake_idx = 1, \ - .flags = 1*SD_LOAD_BALANCE \ - | 1*SD_BALANCE_NEWIDLE \ - | 1*SD_BALANCE_EXEC \ - | 1*SD_BALANCE_FORK \ - | 0*SD_BALANCE_WAKE \ - | 0*SD_WAKE_AFFINE \ - | 0*SD_PREFER_LOCAL \ - | 0*SD_SHARE_CPUPOWER \ - | 0*SD_SHARE_PKG_RESOURCES \ - | 1*SD_SERIALIZE \ - , \ - .last_balance = jiffies, \ - .balance_interval = 128, \ -} - /* By definition, we create nodes based on online memory. */ #define node_has_online_mem(nid) 1 diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h index b9676ae37ada..095b21507b6a 100644 --- a/arch/x86/include/asm/topology.h +++ b/arch/x86/include/asm/topology.h @@ -92,44 +92,6 @@ extern void setup_node_to_cpumask_map(void); #define pcibus_to_node(bus) __pcibus_to_node(bus) -#ifdef CONFIG_X86_32 -# define SD_CACHE_NICE_TRIES 1 -# define SD_IDLE_IDX 1 -#else -# define SD_CACHE_NICE_TRIES 2 -# define SD_IDLE_IDX 2 -#endif - -/* sched_domains SD_NODE_INIT for NUMA machines */ -#define SD_NODE_INIT (struct sched_domain) { \ - .min_interval = 8, \ - .max_interval = 32, \ - .busy_factor = 32, \ - .imbalance_pct = 125, \ - .cache_nice_tries = SD_CACHE_NICE_TRIES, \ - .busy_idx = 3, \ - .idle_idx = SD_IDLE_IDX, \ - .newidle_idx = 0, \ - .wake_idx = 0, \ - .forkexec_idx = 0, \ - \ - .flags = 1*SD_LOAD_BALANCE \ - | 1*SD_BALANCE_NEWIDLE \ - | 1*SD_BALANCE_EXEC \ - | 1*SD_BALANCE_FORK \ - | 0*SD_BALANCE_WAKE \ - | 1*SD_WAKE_AFFINE \ - | 0*SD_PREFER_LOCAL \ - | 0*SD_SHARE_CPUPOWER \ - | 0*SD_POWERSAVINGS_BALANCE \ - | 0*SD_SHARE_PKG_RESOURCES \ - | 1*SD_SERIALIZE \ - | 0*SD_PREFER_SIBLING \ - , \ - .last_balance = jiffies, \ - .balance_interval = 1, \ -} - extern int __node_distance(int, int); #define node_distance(a, b) __node_distance(a, b) -- cgit v1.2.3 From 316ad248307fba13be40f01e92a22b89457c32bc Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Fri, 11 May 2012 13:05:59 +0200 Subject: sched/x86: Rewrite set_cpu_sibling_map() Commit ad7687dde ("x86/numa: Check for nonsensical topologies on real hw as well") is broken in that the condition can trigger for valid setups but only changes the end result for invalid setups with no real means of discerning between those. Rewrite set_cpu_sibling_map() to make the code clearer and make sure to only warn when the check changes the end result. Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/n/tip-klcwahu3gx467uhfiqjyhdcs@git.kernel.org Signed-off-by: Ingo Molnar --- arch/x86/kernel/smpboot.c | 112 +++++++++++++++++++++++++++------------------- 1 file changed, 66 insertions(+), 46 deletions(-) (limited to 'arch') diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 7c53d96d44ab..e84c1bbea339 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -315,70 +315,90 @@ void __cpuinit smp_store_cpu_info(int id) identify_secondary_cpu(c); } -static void __cpuinit link_thread_siblings(int cpu1, int cpu2) +static bool __cpuinit +topology_sane(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o, const char *name) { - cpumask_set_cpu(cpu1, cpu_sibling_mask(cpu2)); - cpumask_set_cpu(cpu2, cpu_sibling_mask(cpu1)); - cpumask_set_cpu(cpu1, cpu_core_mask(cpu2)); - cpumask_set_cpu(cpu2, cpu_core_mask(cpu1)); - cpumask_set_cpu(cpu1, cpu_llc_shared_mask(cpu2)); - cpumask_set_cpu(cpu2, cpu_llc_shared_mask(cpu1)); + int cpu1 = c->cpu_index, cpu2 = o->cpu_index; + + return !WARN_ONCE(cpu_to_node(cpu1) != cpu_to_node(cpu2), + "sched: CPU #%d's %s-sibling CPU #%d is not on the same node! " + "[node: %d != %d]. Ignoring dependency.\n", + cpu1, name, cpu2, cpu_to_node(cpu1), cpu_to_node(cpu2)); } +#define link_mask(_m, c1, c2) \ +do { \ + cpumask_set_cpu((c1), cpu_##_m##_mask(c2)); \ + cpumask_set_cpu((c2), cpu_##_m##_mask(c1)); \ +} while (0) -void __cpuinit set_cpu_sibling_map(int cpu) +static bool __cpuinit match_smt(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o) { - int i; - struct cpuinfo_x86 *c = &cpu_data(cpu); + if (cpu_has(c, X86_FEATURE_TOPOEXT)) { + int cpu1 = c->cpu_index, cpu2 = o->cpu_index; - cpumask_set_cpu(cpu, cpu_sibling_setup_mask); + if (c->phys_proc_id == o->phys_proc_id && + per_cpu(cpu_llc_id, cpu1) == per_cpu(cpu_llc_id, cpu2) && + c->compute_unit_id == o->compute_unit_id) + return topology_sane(c, o, "smt"); - if (smp_num_siblings > 1) { - for_each_cpu(i, cpu_sibling_setup_mask) { - struct cpuinfo_x86 *o = &cpu_data(i); + } else if (c->phys_proc_id == o->phys_proc_id && + c->cpu_core_id == o->cpu_core_id) { + return topology_sane(c, o, "smt"); + } - if (cpu_to_node(cpu) != cpu_to_node(i)) { - WARN_ONCE(1, "sched: CPU #%d's thread-sibling CPU #%d not on the same node! [node %d != %d]. Ignoring sibling dependency.\n", cpu, i, cpu_to_node(cpu), cpu_to_node(i)); - continue; - } + return false; +} - if (cpu_has(c, X86_FEATURE_TOPOEXT)) { - if (c->phys_proc_id == o->phys_proc_id && - per_cpu(cpu_llc_id, cpu) == per_cpu(cpu_llc_id, i) && - c->compute_unit_id == o->compute_unit_id) - link_thread_siblings(cpu, i); - } else if (c->phys_proc_id == o->phys_proc_id && - c->cpu_core_id == o->cpu_core_id) { - link_thread_siblings(cpu, i); - } - } - } else { - cpumask_set_cpu(cpu, cpu_sibling_mask(cpu)); - } +static bool __cpuinit match_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o) +{ + int cpu1 = c->cpu_index, cpu2 = o->cpu_index; + + if (per_cpu(cpu_llc_id, cpu1) != BAD_APICID && + per_cpu(cpu_llc_id, cpu1) == per_cpu(cpu_llc_id, cpu2)) + return topology_sane(c, o, "llc"); + + return false; +} + +static bool __cpuinit match_mc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o) +{ + if (c->phys_proc_id == o->phys_proc_id) + return topology_sane(c, o, "mc"); + + return false; +} + +void __cpuinit set_cpu_sibling_map(int cpu) +{ + bool has_mc = boot_cpu_data.x86_max_cores > 1; + bool has_smt = smp_num_siblings > 1; + struct cpuinfo_x86 *c = &cpu_data(cpu); + struct cpuinfo_x86 *o; + int i; - cpumask_set_cpu(cpu, cpu_llc_shared_mask(cpu)); + cpumask_set_cpu(cpu, cpu_sibling_setup_mask); - if (__this_cpu_read(cpu_info.x86_max_cores) == 1) { - cpumask_copy(cpu_core_mask(cpu), cpu_sibling_mask(cpu)); + if (!has_smt && !has_mc) { + cpumask_set_cpu(cpu, cpu_sibling_mask(cpu)); + cpumask_set_cpu(cpu, cpu_llc_shared_mask(cpu)); + cpumask_set_cpu(cpu, cpu_core_mask(cpu)); c->booted_cores = 1; return; } for_each_cpu(i, cpu_sibling_setup_mask) { - if (cpu_to_node(cpu) != cpu_to_node(i)) { - WARN_ONCE(1, "sched: CPU #%d's core-sibling CPU #%d not on the same node! [node %d != %d]. Ignoring sibling dependency.\n", cpu, i, cpu_to_node(cpu), cpu_to_node(i)); - continue; - } + o = &cpu_data(i); - if (per_cpu(cpu_llc_id, cpu) != BAD_APICID && - per_cpu(cpu_llc_id, cpu) == per_cpu(cpu_llc_id, i)) { - cpumask_set_cpu(i, cpu_llc_shared_mask(cpu)); - cpumask_set_cpu(cpu, cpu_llc_shared_mask(i)); - } + if ((i == cpu) || (has_smt && match_smt(c, o))) + link_mask(sibling, cpu, i); + + if ((i == cpu) || (has_mc && match_llc(c, o))) + link_mask(llc_shared, cpu, i); + + if ((i == cpu) || (has_mc && match_mc(c, o))) { + link_mask(core, cpu, i); - if (c->phys_proc_id == cpu_data(i).phys_proc_id) { - cpumask_set_cpu(i, cpu_core_mask(cpu)); - cpumask_set_cpu(cpu, cpu_core_mask(i)); /* * Does this new cpu bringup a new core? */ -- cgit v1.2.3 From 8e7fbcbc22c12414bcc9dfdd683637f58fb32759 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Mon, 9 Jan 2012 11:28:35 +0100 Subject: sched: Remove stale power aware scheduling remnants and dysfunctional knobs It's been broken forever (i.e. it's not scheduling in a power aware fashion), as reported by Suresh and others sending patches, and nobody cares enough to fix it properly ... so remove it to make space free for something better. There's various problems with the code as it stands today, first and foremost the user interface which is bound to topology levels and has multiple values per level. This results in a state explosion which the administrator or distro needs to master and almost nobody does. Furthermore large configuration state spaces aren't good, it means the thing doesn't just work right because it's either under so many impossibe to meet constraints, or even if there's an achievable state workloads have to be aware of it precisely and can never meet it for dynamic workloads. So pushing this kind of decision to user-space was a bad idea even with a single knob - it's exponentially worse with knobs on every node of the topology. There is a proposal to replace the user interface with a single 3 state knob: sched_balance_policy := { performance, power, auto } where 'auto' would be the preferred default which looks at things like Battery/AC mode and possible cpufreq state or whatever the hw exposes to show us power use expectations - but there's been no progress on it in the past many months. Aside from that, the actual implementation of the various knobs is known to be broken. There have been sporadic attempts at fixing things but these always stop short of reaching a mergable state. Therefore this wholesale removal with the hopes of spurring people who care to come forward once again and work on a coherent replacement. Signed-off-by: Peter Zijlstra Cc: Suresh Siddha Cc: Arjan van de Ven Cc: Vincent Guittot Cc: Vaidyanathan Srinivasan Cc: Linus Torvalds Cc: Andrew Morton Link: http://lkml.kernel.org/r/1326104915.2442.53.camel@twins Signed-off-by: Ingo Molnar --- arch/x86/kernel/smpboot.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'arch') diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index e84c1bbea339..256c20cc5e96 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -429,8 +429,7 @@ const struct cpumask *cpu_coregroup_mask(int cpu) * For perf, we return last level cache shared map. * And for power savings, we return cpu_core_map */ - if ((sched_mc_power_savings || sched_smt_power_savings) && - !(cpu_has(c, X86_FEATURE_AMD_DCM))) + if (!(cpu_has(c, X86_FEATURE_AMD_DCM))) return cpu_core_mask(cpu); else return cpu_llc_shared_mask(cpu); -- cgit v1.2.3