summaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)Author
2015-02-11mm: fix false-positive warning on exit due mm_nr_pmds(mm)Kirill A. Shutemov
The problem is that we check nr_ptes/nr_pmds in exit_mmap() which happens *before* pgd_free(). And if an arch does pte/pmd allocation in pgd_alloc() and frees them in pgd_free() we see offset in counters by the time of the checks. We tried to workaround this by offsetting expected counter value according to FIRST_USER_ADDRESS for both nr_pte and nr_pmd in exit_mmap(). But it doesn't work in some cases: 1. ARM with LPAE enabled also has non-zero USER_PGTABLES_CEILING, but upper addresses occupied with huge pmd entries, so the trick with offsetting expected counter value will get really ugly: we will have to apply it nr_pmds, but not nr_ptes. 2. Metag has non-zero FIRST_USER_ADDRESS, but doesn't do allocation pte/pmd page tables allocation in pgd_alloc(), just setup a pgd entry which is allocated at boot and shared accross all processes. The proposal is to move the check to check_mm() which happens *after* pgd_free() and do proper accounting during pgd_alloc() and pgd_free() which would bring counters to zero if nothing leaked. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Tyler Baker <tyler.baker@linaro.org> Tested-by: Tyler Baker <tyler.baker@linaro.org> Tested-by: Nishanth Menon <nm@ti.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: James Hogan <james.hogan@imgtec.com> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11mm: account pmd page tables to the processKirill A. Shutemov
Dave noticed that unprivileged process can allocate significant amount of memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and memory cgroup. The trick is to allocate a lot of PMD page tables. Linux kernel doesn't account PMD tables to the process, only PTE. The use-cases below use few tricks to allocate a lot of PMD page tables while keeping VmRSS and VmPTE low. oom_score for the process will be 0. #include <errno.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/mman.h> #include <sys/prctl.h> #define PUD_SIZE (1UL << 30) #define PMD_SIZE (1UL << 21) #define NR_PUD 130000 int main(void) { char *addr = NULL; unsigned long i; prctl(PR_SET_THP_DISABLE); for (i = 0; i < NR_PUD ; i++) { addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); if (addr == MAP_FAILED) { perror("mmap"); break; } *addr = 'x'; munmap(addr, PMD_SIZE); mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0); if (addr == MAP_FAILED) perror("re-mmap"), exit(1); } printf("PID %d consumed %lu KiB in PMD page tables\n", getpid(), i * 4096 >> 10); return pause(); } The patch addresses the issue by account PMD tables to the process the same way we account PTE. The main place where PMD tables is accounted is __pmd_alloc() and free_pmd_range(). But there're few corner cases: - HugeTLB can share PMD page tables. The patch handles by accounting the table to all processes who share it. - x86 PAE pre-allocates few PMD tables on fork. - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity check on exit(2). Accounting only happens on configuration where PMD page table's level is present (PMD is not folded). As with nr_ptes we use per-mm counter. The counter value is used to calculate baseline for badness score by oom-killer. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: David Rientjes <rientjes@google.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11oom, PM: make OOM detection in the freezer path racelessMichal Hocko
Commit 5695be142e20 ("OOM, PM: OOM killed task shouldn't escape PM suspend") has left a race window when OOM killer manages to note_oom_kill after freeze_processes checks the counter. The race window is quite small and really unlikely and partial solution deemed sufficient at the time of submission. Tejun wasn't happy about this partial solution though and insisted on a full solution. That requires the full OOM and freezer's task freezing exclusion, though. This is done by this patch which introduces oom_sem RW lock and turns oom_killer_disable() into a full OOM barrier. oom_killer_disabled check is moved from the allocation path to the OOM level and we take oom_sem for reading for both the check and the whole OOM invocation. oom_killer_disable() takes oom_sem for writing so it waits for all currently running OOM killer invocations. Then it disable all the further OOMs by setting oom_killer_disabled and checks for any oom victims. Victims are counted via mark_tsk_oom_victim resp. unmark_oom_victim. The last victim wakes up all waiters enqueued by oom_killer_disable(). Therefore this function acts as the full OOM barrier. The page fault path is covered now as well although it was assumed to be safe before. As per Tejun, "We used to have freezing points deep in file system code which may be reacheable from page fault." so it would be better and more robust to not rely on freezing points here. Same applies to the memcg OOM killer. out_of_memory tells the caller whether the OOM was allowed to trigger and the callers are supposed to handle the situation. The page allocation path simply fails the allocation same as before. The page fault path will retry the fault (more on that later) and Sysrq OOM trigger will simply complain to the log. Normally there wouldn't be any unfrozen user tasks after try_to_freeze_tasks so the function will not block. But if there was an OOM killer racing with try_to_freeze_tasks and the OOM victim didn't finish yet then we have to wait for it. This should complete in a finite time, though, because - the victim cannot loop in the page fault handler (it would die on the way out from the exception) - it cannot loop in the page allocator because all the further allocation would fail and __GFP_NOFAIL allocations are not acceptable at this stage - it shouldn't be blocked on any locks held by frozen tasks (try_to_freeze expects lockless context) and kernel threads and work queues are not frozen yet Signed-off-by: Michal Hocko <mhocko@suse.cz> Suggested-by: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11PM: convert printk to pr_* equivalentMichal Hocko
While touching this area let's convert printk to pr_*. This also makes the printing of continuation lines done properly. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11oom: add helpers for setting and clearing TIF_MEMDIEMichal Hocko
This patchset addresses a race which was described in the changelog for 5695be142e20 ("OOM, PM: OOM killed task shouldn't escape PM suspend"): : PM freezer relies on having all tasks frozen by the time devices are : getting frozen so that no task will touch them while they are getting : frozen. But OOM killer is allowed to kill an already frozen task in order : to handle OOM situtation. In order to protect from late wake ups OOM : killer is disabled after all tasks are frozen. This, however, still keeps : a window open when a killed task didn't manage to die by the time : freeze_processes finishes. The original patch hasn't closed the race window completely because that would require a more complex solution as it can be seen by this patchset. The primary motivation was to close the race condition between OOM killer and PM freezer _completely_. As Tejun pointed out, even though the race condition is unlikely the harder it would be to debug weird bugs deep in the PM freezer when the debugging options are reduced considerably. I can only speculate what might happen when a task is still runnable unexpectedly. On a plus side and as a side effect the oom enable/disable has a better (full barrier) semantic without polluting hot paths. I have tested the series in KVM with 100M RAM: - many small tasks (20M anon mmap) which are triggering OOM continually - s2ram which resumes automatically is triggered in a loop echo processors > /sys/power/pm_test while true do echo mem > /sys/power/state sleep 1s done - simple module which allocates and frees 20M in 8K chunks. If it sees freezing(current) then it tries another round of allocation before calling try_to_freeze - debugging messages of PM stages and OOM killer enable/disable/fail added and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before it wakes up waiters. - rebased on top of the current mmotm which means some necessary updates in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but I think this should be OK because __thaw_task shouldn't interfere with any locking down wake_up_process. Oleg? As expected there are no OOM killed tasks after oom is disabled and allocations requested by the kernel thread are failing after all the tasks are frozen and OOM disabled. I wasn't able to catch a race where oom_killer_disable would really have to wait but I kinda expected the race is really unlikely. [ 242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB [ 243.628071] Unmarking 2992 OOM victim. oom_victims: 1 [ 243.636072] (elapsed 2.837 seconds) done. [ 243.641985] Trying to disable OOM killer [ 243.643032] Waiting for concurent OOM victims [ 243.644342] OOM killer disabled [ 243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done. [ 243.652983] Suspending console(s) (use no_console_suspend to debug) [ 243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010 [...] [ 243.992600] PM: suspend of devices complete after 336.667 msecs [ 243.993264] PM: late suspend of devices complete after 0.660 msecs [ 243.994713] PM: noirq suspend of devices complete after 1.446 msecs [ 243.994717] ACPI: Preparing to enter system sleep state S3 [ 243.994795] PM: Saving platform NVS memory [ 243.994796] Disabling non-boot CPUs ... The first 2 patches are simple cleanups for OOM. They should go in regardless the rest IMO. Patches 3 and 4 are trivial printk -> pr_info conversion and they should go in ditto. The main patch is the last one and I would appreciate acks from Tejun and Rafael. I think the OOM part should be OK (except for __thaw_task vs. task_lock where a look from Oleg would appreciated) but I am not so sure I haven't screwed anything in the freezer code. I have found several surprises there. This patch (of 5): This patch is just a preparatory and it doesn't introduce any functional change. Note: I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to wait for the oom victim and to prevent from new killing. This is just a side effect of the flag. The primary meaning is to give the oom victim access to the memory reserves and that shouldn't be necessary here. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds
Pull networking updates from David Miller: 1) More iov_iter conversion work from Al Viro. [ The "crypto: switch af_alg_make_sg() to iov_iter" commit was wrong, and this pull actually adds an extra commit on top of the branch I'm pulling to fix that up, so that the pre-merge state is ok. - Linus ] 2) Various optimizations to the ipv4 forwarding information base trie lookup implementation. From Alexander Duyck. 3) Remove sock_iocb altogether, from CHristoph Hellwig. 4) Allow congestion control algorithm selection via routing metrics. From Daniel Borkmann. 5) Make ipv4 uncached route list per-cpu, from Eric Dumazet. 6) Handle rfs hash collisions more gracefully, also from Eric Dumazet. 7) Add xmit_more support to r8169, e1000, and e1000e drivers. From Florian Westphal. 8) Transparent Ethernet Bridging support for GRO, from Jesse Gross. 9) Add BPF packet actions to packet scheduler, from Jiri Pirko. 10) Add support for uniqu flow IDs to openvswitch, from Joe Stringer. 11) New NetCP ethernet driver, from Muralidharan Karicheri and Wingman Kwok. 12) More sanely handle out-of-window dupacks, which can result in serious ACK storms. From Neal Cardwell. 13) Various rhashtable bug fixes and enhancements, from Herbert Xu, Patrick McHardy, and Thomas Graf. 14) Support xmit_more in be2net, from Sathya Perla. 15) Group Policy extensions for vxlan, from Thomas Graf. 16) Remove Checksum Offload support for vxlan, from Tom Herbert. 17) Like ipv4, support lockless transmit over ipv6 UDP sockets. From Vlad Yasevich. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1494+1 commits) crypto: fix af_alg_make_sg() conversion to iov_iter ipv4: Namespecify TCP PMTU mechanism i40e: Fix for stats init function call in Rx setup tcp: don't include Fast Open option in SYN-ACK on pure SYN-data openvswitch: Only set TUNNEL_VXLAN_OPT if VXLAN-GBP metadata is set ipv6: Make __ipv6_select_ident static ipv6: Fix fragment id assignment on LE arches. bridge: Fix inability to add non-vlan fdb entry net: Mellanox: Delete unnecessary checks before the function call "vunmap" cxgb4: Add support in cxgb4 to get expansion rom version via ethtool ethtool: rename reserved1 memeber in ethtool_drvinfo for expansion ROM version net: dsa: Remove redundant phy_attach() IB/mlx4: Reset flow support for IB kernel ULPs IB/mlx4: Always use the correct port for mirrored multicast attachments net/bonding: Fix potential bad memory access during bonding events tipc: remove tipc_snprintf tipc: nl compat add noop and remove legacy nl framework tipc: convert legacy nl stats show to nl compat tipc: convert legacy nl net id get to nl compat tipc: convert legacy nl net id set to nl compat ...
2015-02-10Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree changes from Jiri Kosina: "Patches from trivial.git that keep the world turning around. Mostly documentation and comment fixes, and a two corner-case code fixes from Alan Cox" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: kexec, Kconfig: spell "architecture" properly mm: fix cleancache debugfs directory path blackfin: mach-common: ints-priority: remove unused function doubletalk: probe failure causes OOPS ARM: cache-l2x0.c: Make it clear that cache-l2x0 handles L310 cache controller msdos_fs.h: fix 'fields' in comment scsi: aic7xxx: fix comment ARM: l2c: fix comment ibmraid: fix writeable attribute with no store method dynamic_debug: fix comment doc: usbmon: fix spelling s/unpriviledged/unprivileged/ x86: init_mem_mapping(): use capital BIOS in comment
2015-02-10Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching Pull live patching infrastructure from Jiri Kosina: "Let me provide a bit of history first, before describing what is in this pile. Originally, there was kSplice as a standalone project that implemented stop_machine()-based patching for the linux kernel. This project got later acquired, and the current owner is providing live patching as a proprietary service, without any intentions to have their implementation merged. Then, due to rising user/customer demand, both Red Hat and SUSE started working on their own implementation (not knowing about each other), and announced first versions roughly at the same time [1] [2]. The principle difference between the two solutions is how they are making sure that the patching is performed in a consistent way when it comes to different execution threads with respect to the semantic nature of the change that is being introduced. In a nutshell, kPatch is issuing stop_machine(), then looking at stacks of all existing processess, and if it decides that the system is in a state that can be patched safely, it proceeds insterting code redirection machinery to the patched functions. On the other hand, kGraft provides a per-thread consistency during one single pass of a process through the kernel and performs a lazy contignuous migration of threads from "unpatched" universe to the "patched" one at safe checkpoints. If interested in a more detailed discussion about the consistency models and its possible combinations, please see the thread that evolved around [3]. It pretty quickly became obvious to the interested parties that it's absolutely impractical in this case to have several isolated solutions for one task to co-exist in the kernel. During a dedicated Live Kernel Patching track at LPC in Dusseldorf, all the interested parties sat together and came up with a joint aproach that would work for both distro vendors. Steven Rostedt took notes [4] from this meeting. And the foundation for that aproach is what's present in this pull request. It provides a basic infrastructure for function "live patching" (i.e. code redirection), including API for kernel modules containing the actual patches, and API/ABI for userspace to be able to operate on the patches (look up what patches are applied, enable/disable them, etc). It's relatively simple and minimalistic, as it's making use of existing kernel infrastructure (namely ftrace) as much as possible. It's also self-contained, in a sense that it doesn't hook itself in any other kernel subsystem (it doesn't even touch any other code). It's now implemented for x86 only as a reference architecture, but support for powerpc, s390 and arm is already in the works (adding arch-specific support basically boils down to teaching ftrace about regs-saving). Once this common infrastructure gets merged, both Red Hat and SUSE have agreed to immediately start porting their current solutions on top of this, abandoning their out-of-tree code. The plan basically is that each patch will be marked by flag(s) that would indicate which consistency model it is willing to use (again, the details have been sketched out already in the thread at [3]). Before this happens, the current codebase can be used to patch a large group of secruity/stability problems the patches for which are not too complex (in a sense that they don't introduce non-trivial change of function's return value semantics, they don't change layout of data structures, etc) -- this corresponds to LEAVE_FUNCTION && SWITCH_FUNCTION semantics described at [3]. This tree has been in linux-next since December. [1] https://lkml.org/lkml/2014/4/30/477 [2] https://lkml.org/lkml/2014/7/14/857 [3] https://lkml.org/lkml/2014/11/7/354 [4] http://linuxplumbersconf.org/2014/wp-content/uploads/2014/10/LPC2014_LivePatching.txt [ The core code is introduced by the three commits authored by Seth Jennings, which got a lot of changes incorporated during numerous respins and reviews of the initial implementation. All the followup commits have materialized only after public tree has been created, so they were not folded into initial three commits so that the public tree doesn't get rebased ]" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching: livepatch: add missing newline to error message livepatch: rename config to CONFIG_LIVEPATCH livepatch: fix uninitialized return value livepatch: support for repatching a function livepatch: enforce patch stacking semantics livepatch: change ARCH_HAVE_LIVE_PATCHING to HAVE_LIVE_PATCHING livepatch: fix deferred module patching order livepatch: handle ancient compilers with more grace livepatch: kconfig: use bool instead of boolean livepatch: samples: fix usage example comments livepatch: MAINTAINERS: add git tree location livepatch: use FTRACE_OPS_FL_IPMODIFY livepatch: move x86 specific ftrace handler code to arch/x86 livepatch: samples: add sample live patching module livepatch: kernel: add support for live patching livepatch: kernel: add TAINT_LIVEPATCH
2015-02-10Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc updates from Andrew Morton: "Bite-sized chunks this time, to avoid the MTA ratelimiting woes. - fs/notify updates - ocfs2 - some of MM" That laconic "some MM" is mainly the removal of remap_file_pages(), which is a big simplification of the VM, and which gets rid of a *lot* of random cruft and special cases because we no longer support the non-linear mappings that it used. From a user interface perspective, nothing has changed, because the remap_file_pages() syscall still exists, it's just done by emulating the old behavior by creating a lot of individual small mappings instead of one non-linear one. The emulation is slower than the old "native" non-linear mappings, but nobody really uses or cares about remap_file_pages(), and simplifying the VM is a big advantage. * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (78 commits) memcg: zap memcg_slab_caches and memcg_slab_mutex memcg: zap memcg_name argument of memcg_create_kmem_cache memcg: zap __memcg_{charge,uncharge}_slab mm/page_alloc.c: place zone_id check before VM_BUG_ON_PAGE check mm: hugetlb: fix type of hugetlb_treat_as_movable variable mm, hugetlb: remove unnecessary lower bound on sysctl handlers"? mm: memory: merge shared-writable dirtying branches in do_wp_page() mm: memory: remove ->vm_file check on shared writable vmas xtensa: drop _PAGE_FILE and pte_file()-related helpers x86: drop _PAGE_FILE and pte_file()-related helpers unicore32: drop pte_file()-related helpers um: drop _PAGE_FILE and pte_file()-related helpers tile: drop pte_file()-related helpers sparc: drop pte_file()-related helpers sh: drop _PAGE_FILE and pte_file()-related helpers score: drop _PAGE_FILE and pte_file()-related helpers s390: drop pte_file()-related helpers parisc: drop _PAGE_FILE and pte_file()-related helpers openrisc: drop _PAGE_FILE and pte_file()-related helpers nios2: drop _PAGE_FILE and pte_file()-related helpers ...
2015-02-10Merge tag 'pm+acpi-3.20-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI and power management updates from Rafael Wysocki: "We have a few new features this time, including a new SFI-based cpufreq driver, a new devfreq driver for Tegra Activity Monitor, a new devfreq class for providing its governors with raw utilization data and a new ACPI driver for AMD SoCs. Still, the majority of changes here are reworks of existing code to make it more straightforward or to prepare it for implementing new features on top of it. The primary example is the rework of ACPI resources handling from Jiang Liu, Thomas Gleixner and Lv Zheng with support for IOAPIC hotplug implemented on top of it, but there is quite a number of changes of this kind in the cpufreq core, ACPICA, ACPI EC driver, ACPI processor driver and the generic power domains core code too. The most active developer is Viresh Kumar with his cpufreq changes. Specifics: - Rework of the core ACPI resources parsing code to fix issues in it and make using resource offsets more convenient and consolidation of some resource-handing code in a couple of places that have grown analagous data structures and code to cover the the same gap in the core (Jiang Liu, Thomas Gleixner, Lv Zheng). - ACPI-based IOAPIC hotplug support on top of the resources handling rework (Jiang Liu, Yinghai Lu). - ACPICA update to upstream release 20150204 including an interrupt handling rework that allows drivers to install raw handlers for ACPI GPEs which then become entirely responsible for the given GPE and the ACPICA core code won't touch it (Lv Zheng, David E Box, Octavian Purdila). - ACPI EC driver rework to fix several concurrency issues and other problems related to events handling on top of the ACPICA's new support for raw GPE handlers (Lv Zheng). - New ACPI driver for AMD SoCs analogous to the LPSS (Low-Power Subsystem) driver for Intel chips (Ken Xue). - Two minor fixes of the ACPI LPSS driver (Heikki Krogerus, Jarkko Nikula). - Two new blacklist entries for machines (Samsung 730U3E/740U3E and 510R) where the native backlight interface doesn't work correctly while the ACPI one does (Hans de Goede). - Rework of the ACPI processor driver's handling of idle states to make the code more straightforward and less bloated overall (Rafael J Wysocki). - Assorted minor fixes related to ACPI and SFI (Andreas Ruprecht, Andy Shevchenko, Hanjun Guo, Jan Beulich, Rafael J Wysocki, Yaowei Bai). - PCI core power management modification to avoid resuming (some) runtime-suspended devices during system suspend if they are in the right states already (Rafael J Wysocki). - New SFI-based cpufreq driver for Intel platforms using SFI (Srinidhi Kasagar). - cpufreq core fixes, cleanups and simplifications (Viresh Kumar, Doug Anderson, Wolfram Sang). - SkyLake CPU support and other updates for the intel_pstate driver (Kristen Carlson Accardi, Srinivas Pandruvada). - cpufreq-dt driver cleanup (Markus Elfring). - Init fix for the ARM big.LITTLE cpuidle driver (Sudeep Holla). - Generic power domains core code fixes and cleanups (Ulf Hansson). - Operating Performance Points (OPP) core code cleanups and kernel documentation update (Nishanth Menon). - New dabugfs interface to make the list of PM QoS constraints available to user space (Nishanth Menon). - New devfreq driver for Tegra Activity Monitor (Tomeu Vizoso). - New devfreq class (devfreq_event) to provide raw utilization data to devfreq governors (Chanwoo Choi). - Assorted minor fixes and cleanups related to power management (Andreas Ruprecht, Krzysztof Kozlowski, Rickard Strandqvist, Pavel Machek, Todd E Brandt, Wonhong Kwon). - turbostat updates (Len Brown) and cpupower Makefile improvement (Sriram Raghunathan)" * tag 'pm+acpi-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (151 commits) tools/power turbostat: relax dependency on APERF_MSR tools/power turbostat: relax dependency on invariant TSC Merge branch 'pci/host-generic' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci into acpi-resources tools/power turbostat: decode MSR_*_PERF_LIMIT_REASONS tools/power turbostat: relax dependency on root permission ACPI / video: Add disable_native_backlight quirk for Samsung 510R ACPI / PM: Remove unneeded nested #ifdef USB / PM: Remove unneeded #ifdef and associated dead code intel_pstate: provide option to only use intel_pstate with HWP ACPI / EC: Add GPE reference counting debugging messages ACPI / EC: Add query flushing support ACPI / EC: Refine command storm prevention support ACPI / EC: Add command flushing support. ACPI / EC: Introduce STARTED/STOPPED flags to replace BLOCKED flag ACPI: add AMD ACPI2Platform device support for x86 system ACPI / table: remove duplicate NULL check for the handler of acpi_table_parse() ACPI / EC: Update revision due to raw handler mode. ACPI / EC: Reduce ec_poll() by referencing the last register access timestamp. ACPI / EC: Fix several GPE handling issues by deploying ACPI_GPE_DISPATCH_RAW_HANDLER mode. ACPICA: Events: Enable APIs to allow interrupt/polling adaptive request based GPE handling model ...
2015-02-10mm, hugetlb: remove unnecessary lower bound on sysctl handlers"?Andrey Ryabinin
Commit ed4d4902ebdd ("mm, hugetlb: remove hugetlb_zero and hugetlb_infinity") replaced 'unsigned long hugetlb_zero' with 'int zero' leading to out-of-bounds access in proc_doulongvec_minmax(). Use '.extra1 = NULL' instead of '.extra1 = &zero'. Passing NULL is equivalent to passing minimal value, which is 0 for unsigned types. Fixes: ed4d4902ebdd ("mm, hugetlb: remove hugetlb_zero and hugetlb_infinity") Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Suggested-by: Manfred Spraul <manfred@colorfullife.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-10rmap: drop support of non-linear mappingsKirill A. Shutemov
We don't create non-linear mappings anymore. Let's drop code which handles them in rmap. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-10Merge branches 'pm-sleep' and 'pm-runtime'Rafael J. Wysocki
* pm-sleep: PM / hibernate: exclude freed pages from allocated pages printout PM / sleep: export suspend_resume trace event PM / sleep: Mention async suspend in PM_TRACE documentation PM / hibernate: Remove unused function * pm-runtime: ACPI / PM: Remove unneeded nested #ifdef USB / PM: Remove unneeded #ifdef and associated dead code
2015-02-10Merge branches 'pm-qos', 'pm-opp' and 'pm-devfreq'Rafael J. Wysocki
* pm-qos: PM / QoS: Use lockdep asserts to find missing hold of power.lock PM / QoS: Add debugfs support to view the list of constraints * pm-opp: PM / OPP: Assert RCU lock in exported functions PM / OPP: Update kernel documentation PM / OPP: Ensure consistent naming of static functions PM / OPP: export dev_pm_opp_get_notifier * pm-devfreq: PM / devfreq: event: Add documentation for exynos-ppmu devfreq-event driver devfreq: Fix build break of devfreq-event class PM / devfreq: event: Add devfreq_event class PM / devfreq: tegra: add devfreq driver for Tegra Activity Monitor
2015-02-10Merge branch 'acpi-resources'Rafael J. Wysocki
* acpi-resources: (23 commits) Merge branch 'pci/host-generic' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci into acpi-resources x86/irq, ACPI: Implement ACPI driver to support IOAPIC hotplug ACPI: Add interfaces to parse IOAPIC ID for IOAPIC hotplug x86/PCI: Refine the way to release PCI IRQ resources x86/PCI/ACPI: Use common ACPI resource interfaces to simplify implementation x86/PCI: Fix the range check for IO resources PCI: Use common resource list management code instead of private implementation resources: Move struct resource_list_entry from ACPI into resource core ACPI: Introduce helper function acpi_dev_filter_resource_type() ACPI: Add field offset to struct resource_list_entry ACPI: Translate resource into master side address for bridge window resources ACPI: Return translation offset when parsing ACPI address space resources ACPI: Enforce stricter checks for address space descriptors ACPI: Set flag IORESOURCE_UNSET for unassigned resources ACPI: Normalize return value of resource parser functions ACPI: Fix a bug in parsing ACPI Memory24 resource ACPI: Add prefetch decoding to the address space parser ACPI: Move the window flag logic to the combined parser ACPI: Unify the parsing of address_space and ext_address_space ACPI: Let the parser return false for disabled resources ...
2015-02-09Merge branch 'timers-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer updates from Ingo Molnar: "The main changes in this cycle were: - rework hrtimer expiry calculation in hrtimer_interrupt(): the previous code had a subtle bug where expiry caching would miss an expiry, resulting in occasional bogus (late) expiry of hrtimers. - continuing Y2038 fixes - ktime division optimization - misc smaller fixes and cleanups" * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: hrtimer: Make __hrtimer_get_next_event() static rtc: Convert rtc_set_ntp_time() to use timespec64 rtc: Remove redundant rtc_valid_tm() from rtc_hctosys() rtc: Modify rtc_hctosys() to address y2038 issues rtc: Update rtc-dev to use y2038-safe time interfaces rtc: Update interface.c to use y2038-safe time interfaces time: Expose get_monotonic_boottime64 for in-kernel use time: Expose getboottime64 for in-kernel uses ktime: Optimize ktime_divns for constant divisors hrtimer: Prevent stale expiry time in hrtimer_interrupt() ktime.h: Introduce ktime_ms_delta
2015-02-09Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "The main scheduler changes in this cycle were: - various sched/deadline fixes and enhancements - rescheduling latency fixes/cleanups - rework the rq->clock code to be more consistent and more robust. - minor micro-optimizations - ->avg.decay_count fixes - add a stack overflow check to might_sleep() - idle-poll handler fix, possibly resulting in power savings - misc smaller updates and fixes" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/Documentation: Remove unneeded word sched/wait: Introduce wait_on_bit_timeout() sched: Pull resched loop to __schedule() callers sched/deadline: Remove cpu_active_mask from cpudl_find() sched: Fix hrtick_start() on UP sched/deadline: Avoid pointless __setscheduler() sched/deadline: Fix stale yield state sched/deadline: Fix hrtick for a non-leftmost task sched/deadline: Modify cpudl::free_cpus to reflect rd->online sched/idle: Add missing checks to the exit condition of cpu_idle_poll() sched: Fix missing preemption opportunity sched/rt: Reduce rq lock contention by eliminating locking of non-feasible target sched/debug: Print rq->clock_task sched/core: Rework rq->clock update skips sched/core: Validate rq_clock*() serialization sched/core: Remove check of p->sched_class sched/fair: Fix sched_entity::avg::decay_count initialization sched/debug: Fix potential call to __ffs(0) in sched_show_task() sched/debug: Check for stack overflow in ___might_sleep() sched/fair: Fix the dealing with decay_count in __synchronize_entity_decay()
2015-02-09Merge branch 'perf-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf updates from Ingo Molnar: "Kernel side changes: - AMD range breakpoints support: Extend breakpoint tools and core to support address range through perf event with initial backend support for AMD extended breakpoints. The syntax is: perf record -e mem:addr/len:type For example set write breakpoint from 0x1000 to 0x1200 (0x1000 + 512) perf record -e mem:0x1000/512:w - event throttling/rotating fixes - various event group handling fixes, cleanups and general paranoia code to be more robust against bugs in the future. - kernel stack overhead fixes User-visible tooling side changes: - Show precise number of samples in at the end of a 'record' session, if processing build ids, since we will then traverse the whole perf.data file and see all the PERF_RECORD_SAMPLE records, otherwise stop showing the previous off-base heuristicly counted number of "samples" (Namhyung Kim). - Support to read compressed module from build-id cache (Namhyung Kim) - Enable sampling loads and stores simultaneously in 'perf mem' (Stephane Eranian) - 'perf diff' output improvements (Namhyung Kim) - Fix error reporting for evsel pgfault constructor (Arnaldo Carvalho de Melo) Tooling side infrastructure changes: - Cache eh/debug frame offset for dwarf unwind (Namhyung Kim) - Support parsing parameterized events (Cody P Schafer) - Add support for IP address formats in libtraceevent (David Ahern) Plus other misc fixes" * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits) perf: Decouple unthrottling and rotating perf: Drop module reference on event init failure perf: Use POLLIN instead of POLL_IN for perf poll data in flag perf: Fix put_event() ctx lock perf: Fix move_group() order perf: Fix event->ctx locking perf: Add a bit of paranoia perf symbols: Convert lseek + read to pread perf tools: Use perf_data_file__fd() consistently perf symbols: Support to read compressed module from build-id cache perf evsel: Set attr.task bit for a tracking event perf header: Set header version correctly perf record: Show precise number of samples perf tools: Do not use __perf_session__process_events() directly perf callchain: Cache eh/debug frame offset for dwarf unwind perf tools: Provide stub for missing pthread_attr_setaffinity_np perf evsel: Don't rely on malloc working for sz 0 tools lib traceevent: Add support for IP address formats perf ui/tui: Show fatal error message only if exists perf tests: Fix typo in sample-parsing.c ...
2015-02-09Merge branch 'locking-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core locking updates from Ingo Molnar: "The main changes are: - mutex, completions and rtmutex micro-optimizations - lock debugging fix - various cleanups in the MCS and the futex code" * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/rtmutex: Optimize setting task running after being blocked locking/rwsem: Use task->state helpers sched/completion: Add lock-free checking of the blocking case sched/completion: Remove unnecessary ->wait.lock serialization when reading completion state locking/mutex: Explicitly mark task as running after wakeup futex: Fix argument handling in futex_lock_pi() calls doc: Fix misnamed FUTEX_CMP_REQUEUE_PI op constants locking/Documentation: Update code path softirq/preempt: Add missing current->preempt_disable_ip update locking/osq: No need for load/acquire when acquire-polling locking/mcs: Better differentiate between MCS variants locking/mutex: Introduce ww_mutex_set_context_slowpath() locking/mutex: Move MCS related comments to proper location locking/mutex: Checking the stamp is WW only
2015-02-09Merge branch 'core-rcu-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU updates from Ingo Molnar: "The main RCU changes in this cycle are: - Documentation updates. - Miscellaneous fixes. - Preemptible-RCU fixes, including fixing an old bug in the interaction of RCU priority boosting and CPU hotplug. - SRCU updates. - RCU CPU stall-warning updates. - RCU torture-test updates" * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits) rcu: Initialize tiny RCU stall-warning timeouts at boot rcu: Fix RCU CPU stall detection in tiny implementation rcu: Add GP-kthread-starvation checks to CPU stall warnings rcu: Make cond_resched_rcu_qs() apply to normal RCU flavors rcu: Optionally run grace-period kthreads at real-time priority ksoftirqd: Use new cond_resched_rcu_qs() function ksoftirqd: Enable IRQs and call cond_resched() before poking RCU rcutorture: Add more diagnostics in rcu_barrier() test failure case torture: Flag console.log file to prevent holdovers from earlier runs torture: Add "-enable-kvm -soundhw pcspk" to qemu command line rcutorture: Handle different mpstat versions rcutorture: Check from beginning to end of grace period rcu: Remove redundant rcu_batches_completed() declaration rcutorture: Drop rcu_torture_completed() and friends rcu: Provide rcu_batches_completed_sched() for TINY_RCU rcutorture: Use unsigned for Reader Batch computations rcutorture: Make build-output parsing correctly flag RCU's warnings rcu: Make _batches_completed() functions return unsigned long rcutorture: Issue warnings on close calls due to Reader Batch blows documentation: Fix smp typo in memory-barriers.txt ...
2015-02-06Merge branches 'timers-urgent-for-linus' and 'x86-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer and x86 fix from Ingo Molnar: "A CLOCK_TAI early expiry fix and an x86 microcode driver oops fix" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: hrtimer: Fix incorrect tai offset calculation for non high-res timer systems * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86, microcode: Return error from driver init code when loader is disabled
2015-02-06Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Misc fixes" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/deadline: Fix deadline parameter modification handling sched/wait: Remove might_sleep() from wait_event_cmd() sched: Fix crash if cpuset_cpumask_can_shrink() is passed an empty cpumask sched/fair: Avoid using uninitialized variable in preferred_group_nid()
2015-02-06Merge branch 'core-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core kernel fixes from Ingo Molnar: "Two liblockdep fixes and a CPU hotplug race fix" * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: tools/liblockdep: don't include host headers tools/liblockdep: ignore generated .so file smpboot: Add missing get_online_cpus() in smpboot_register_percpu_thread()
2015-02-06livepatch: add missing newline to error messageJosh Poimboeuf
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2015-02-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/vxlan.c drivers/vhost/net.c include/linux/if_vlan.h net/core/dev.c The net/core/dev.c conflict was the overlap of one commit marking an existing function static whilst another was adding a new function. In the include/linux/if_vlan.h case, the type used for a local variable was changed in 'net', whereas the function got rewritten to fix a stacked vlan bug in 'net-next'. In drivers/vhost/net.c, Al Viro's iov_iter conversions in 'net-next' overlapped with an endainness fix for VHOST 1.0 in 'net'. In drivers/net/vxlan.c, vxlan_find_vni() added a 'flags' parameter in 'net-next' whereas in 'net' there was a bug fix to pass in the correct network namespace pointer in calls to this function. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05resources: Move struct resource_list_entry from ACPI into resource coreJiang Liu
Currently ACPI, PCI and pnp all implement the same resource list management with different data structure. We need to transfer from one data structure into another when passing resources from one subsystem into another subsystem. So move struct resource_list_entry from ACPI into resource core and rename it as resource_entry, then it could be reused by different subystems and avoid the data structure conversion. Introduce dedicated header file resource_ext.h instead of embedding it into ioport.h to avoid header file inclusion order issues. Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com> Acked-by: Vinod Koul <vinod.koul@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2015-02-05hrtimer: Fix incorrect tai offset calculation for non high-res timer systemsJohn Stultz
I noticed some CLOCK_TAI timer test failures on one of my less-frequently used configurations. And after digging in I found in 76f4108892d9 (Cleanup hrtimer accessors to the timekepeing state), the hrtimer_get_softirq_time tai offset calucation was incorrectly rewritten, as the tai offset we return shold be from CLOCK_MONOTONIC, and not CLOCK_REALTIME. This results in CLOCK_TAI timers expiring early on non-highres capable machines. This patch fixes the issue, calculating the tai time properly from the monotonic base. Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable <stable@vger.kernel.org> # 3.17+ Link: http://lkml.kernel.org/r/1423097126-10236-1-git-send-email-john.stultz@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-04livepatch: rename config to CONFIG_LIVEPATCHJosh Poimboeuf
Rename CONFIG_LIVE_PATCHING to CONFIG_LIVEPATCH to make the naming of the config and the code more consistent. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Reviewed-by: Jingoo Han <jg1.han@samsung.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2015-02-04perf: Decouple unthrottling and rotatingMark Rutland
Currently the adjusments made as part of perf_event_task_tick() use the percpu rotation lists to iterate over any active PMU contexts, but these are not used by the context rotation code, having been replaced by separate (per-context) hrtimer callbacks. However, some manipulation of the rotation lists (i.e. removal of contexts) has remained in perf_rotate_context(). This leads to the following issues: * Contexts are not always removed from the rotation lists. Removal of PMUs which have been placed in rotation lists, but have not been removed by a hrtimer callback can result in corruption of the rotation lists (when memory backing the context is freed). This has been observed to result in hangs when PMU drivers built as modules are inserted and removed around the creation of events for said PMUs. * Contexts which do not require rotation may be removed from the rotation lists as a result of a hrtimer, and will not be considered by the unthrottling code in perf_event_task_tick. This patch fixes the issue by updating the rotation ist when events are scheduled in/out, ensuring that each rotation list stays in sync with the HW state. As each event holds a refcount on the module of its PMU, this ensures that when a PMU module is unloaded none of its CPU contexts can be in a rotation list. By maintaining a list of perf_event_contexts rather than perf_event_cpu_contexts, we don't need separate paths to handle the cpu and task contexts, which also makes the code a little simpler. As the rotation_list variables are not used for rotation, these are renamed to active_ctx_list, which better matches their current function. perf_pmu_rotate_{start,stop} are renamed to perf_pmu_ctx_{activate,deactivate}. Reported-by: Johannes Jensen <johannes.jensen@arm.com> Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Will Deacon <Will.Deacon@arm.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20150129134511.GR17721@leverpostej Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-04perf: Drop module reference on event init failureMark Rutland
When initialising an event, perf_init_event will call try_module_get() to ensure that the PMU's module cannot be removed for the lifetime of the event, with __free_event() dropping the reference when the event is finally destroyed. If something fails after the event has been initialised, but before the event is installed, perf_event_alloc will drop the reference on the module. However, if we fail to initialise an event for some reason (e.g. we ask an uncore PMU to perform sampling, and it refuses to initialise the event), we do not drop the refcount. If we try to open such a bogus event without a precise IDR type, we will loop over each PMU in the pmus list, incrementing each of their refcounts without decrementing them. This patch adds a module_put when pmu->event_init(event) fails, ensuring that the refcounts are balanced in failure cases. As the innards of the precise and search based initialisation look very similar, this logic is hoisted out into a new helper function. While the early return for the failed try_module_get is removed from the search case, this is handled by the remaining return when ret is not -ENOENT. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1420642611-22667-1-git-send-email-mark.rutland@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-04perf: Use POLLIN instead of POLL_IN for perf poll data in flagJiri Olsa
Currently we flag available data (via poll syscall) on perf fd with POLL_IN macro, which is normally used for SIGIO interface. We've been lucky, because POLLIN (0x1) is subset of POLL_IN (0x20001) and sys_poll (do_pollfd function) cut the extra bit out (0x20000). Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1422467678-22341-1-git-send-email-jolsa@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-04perf: Fix put_event() ctx lockPeter Zijlstra
So what I suspect; but I'm in zombie mode today it seems; is that while I initially thought that it was impossible for ctx to change when refcount dropped to 0, I now suspect its possible. Note that until perf_remove_from_context() the event is still active and visible on the lists. So a concurrent sys_perf_event_open() from another task into this task can race. Reported-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Stephane Eranian <eranian@gmail.com> Cc: mark.rutland@arm.com Cc: Jiri Olsa <jolsa@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20150129134434.GB26304@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-04perf: Fix move_group() orderPeter Zijlstra (Intel)
Jiri reported triggering the new WARN_ON_ONCE in event_sched_out over the weekend: event_sched_out.isra.79+0x2b9/0x2d0 group_sched_out+0x69/0xc0 ctx_sched_out+0x106/0x130 task_ctx_sched_out+0x37/0x70 __perf_install_in_context+0x70/0x1a0 remote_function+0x48/0x60 generic_exec_single+0x15b/0x1d0 smp_call_function_single+0x67/0xa0 task_function_call+0x53/0x80 perf_install_in_context+0x8b/0x110 I think the below should cure this; if we install a group leader it will iterate the (still intact) group list and find its siblings and try and install those too -- even though those still have the old event->ctx -- in the new ctx. Upon installing the first group sibling we'd try and schedule ou