summaryrefslogtreecommitdiffstats
path: root/drivers/acpi/apei/ghes.c
AgeCommit message (Collapse)Author
2014-06-16ACPICA: Restore error table definitions to reduce code differences between ↵Lv Zheng
Linux and ACPICA upstream. The following commit has changed ACPICA table header definitions: Commit: 88f074f4871a8c212b212b725e4dcdcdb09613c1 Subject: ACPI, CPER: Update cper info While such definitions are currently maintained in ACPICA. As the modifications applying to the table definitions affect other OSPMs' drivers, it is very difficult for ACPICA to initiate a process to complete the merge. Thus this commit finally only leaves us divergences. Revert such naming modifications to reduce the source code differecnes between Linux and ACPICA upstream. No functional changes. Signed-off-by: Lv Zheng <lv.zheng@intel.com> Cc: Bob Moore <robert.moore@intel.com> Cc: Chen, Gong <gong.chen@linux.intel.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Borislav Petkov <bp@alien8.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-01-24Merge tag 'pm+acpi-3.14-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI and power management updates from Rafael Wysocki: "As far as the number of commits goes, the top spot belongs to ACPI this time with cpufreq in the second position and a handful of PM core, PNP and cpuidle updates. They are fixes and cleanups mostly, as usual, with a couple of new features in the mix. The most visible change is probably that we will create struct acpi_device objects (visible in sysfs) for all devices represented in the ACPI tables regardless of their status and there will be a new sysfs attribute under those objects allowing user space to check that status via _STA. Consequently, ACPI device eject or generally hot-removal will not delete those objects, unless the table containing the corresponding namespace nodes is unloaded, which is extremely rare. Also ACPI container hotplug will be handled quite a bit differently and cpufreq will support CPU boost ("turbo") generically and not only in the acpi-cpufreq driver. Specifics: - ACPI core changes to make it create a struct acpi_device object for every device represented in the ACPI tables during all namespace scans regardless of the current status of that device. In accordance with this, ACPI hotplug operations will not delete those objects, unless the underlying ACPI tables go away. - On top of the above, new sysfs attribute for ACPI device objects allowing user space to check device status by triggering the execution of _STA for its ACPI object. From Srinivas Pandruvada. - ACPI core hotplug changes reducing code duplication, integrating the PCI root hotplug with the core and reworking container hotplug. - ACPI core simplifications making it use ACPI_COMPANION() in the code "glueing" ACPI device objects to "physical" devices. - ACPICA update to upstream version 20131218. This adds support for the DBG2 and PCCT tables to ACPICA, fixes some bugs and improves debug facilities. From Bob Moore, Lv Zheng and Betty Dall. - Init code change to carry out the early ACPI initialization earlier. That should allow us to use ACPI during the timekeeping initialization and possibly to simplify the EFI initialization too. From Chun-Yi Lee. - Clenups of the inclusions of ACPI headers in many places all over from Lv Zheng and Rashika Kheria (work in progress). - New helper for ACPI _DSM execution and rework of the code in drivers that uses _DSM to execute it via the new helper. From Jiang Liu. - New Win8 OSI blacklist entries from Takashi Iwai. - Assorted ACPI fixes and cleanups from Al Stone, Emil Goode, Hanjun Guo, Lan Tianyu, Masanari Iida, Oliver Neukum, Prarit Bhargava, Rashika Kheria, Tang Chen, Zhang Rui. - intel_pstate driver updates, including proper Baytrail support, from Dirk Brandewie and intel_pstate documentation from Ramkumar Ramachandra. - Generic CPU boost ("turbo") support for cpufreq from Lukasz Majewski. - powernow-k6 cpufreq driver fixes from Mikulas Patocka. - cpufreq core fixes and cleanups from Viresh Kumar, Jane Li, Mark Brown. - Assorted cpufreq drivers fixes and cleanups from Anson Huang, John Tobias, Paul Bolle, Paul Walmsley, Sachin Kamat, Shawn Guo, Viresh Kumar. - cpuidle cleanups from Bartlomiej Zolnierkiewicz. - Support for hibernation APM events from Bin Shi. - Hibernation fix to avoid bringing up nonboot CPUs with ACPI EC disabled during thaw transitions from Bjørn Mork. - PM core fixes and cleanups from Ben Dooks, Leonardo Potenza, Ulf Hansson. - PNP subsystem fixes and cleanups from Dmitry Torokhov, Levente Kurusa, Rashika Kheria. - New tool for profiling system suspend from Todd E Brandt and a cpupower tool cleanup from One Thousand Gnomes" * tag 'pm+acpi-3.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (153 commits) thermal: exynos: boost: Automatic enable/disable of BOOST feature (at Exynos4412) cpufreq: exynos4x12: Change L0 driver data to CPUFREQ_BOOST_FREQ Documentation: cpufreq / boost: Update BOOST documentation cpufreq: exynos: Extend Exynos cpufreq driver to support boost cpufreq / boost: Kconfig: Support for software-managed BOOST acpi-cpufreq: Adjust the code to use the common boost attribute cpufreq: Add boost frequency support in core intel_pstate: Add trace point to report internal state. cpufreq: introduce cpufreq_generic_get() routine ARM: SA1100: Create dummy clk_get_rate() to avoid build failures cpufreq: stats: create sysfs entries when cpufreq_stats is a module cpufreq: stats: free table and remove sysfs entry in a single routine cpufreq: stats: remove hotplug notifiers cpufreq: stats: handle cpufreq_unregister_driver() and suspend/resume properly cpufreq: speedstep: remove unused speedstep_get_state platform: introduce OF style 'modalias' support for platform bus PM / tools: new tool for suspend/resume performance optimization ACPI: fix module autoloading for ACPI enumerated devices ACPI: add module autoloading support for ACPI enumerated devices ACPI: fix create_modalias() return value handling ...
2013-12-21ACPI, APEI, GHES: Cleanup ghes memory error handlingChen, Gong
Cleanup the logic in ghes_handle_memory_failure(). While at it, add proper PFN validity check for UC error and cleanup the code logic to make it simpler and cleaner. Signed-off-by: Chen, Gong <gong.chen@linux.intel.com> Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/1385363701-12387-2-git-send-email-gong.chen@linux.intel.com [ Boris: massage commit message. ] Signed-off-by: Borislav Petkov <bp@suse.de>
2013-12-21ACPI, APEI, GHES: Do not report only correctable errors with SCIChen, Gong
Currently SCI is employed to handle corrected errors - memory corrected errors, more specifically but in fact SCI still can be used to handle any errors, e.g. uncorrected or even fatal ones if enabled by the BIOS. Enable logging for those kinds of errors too. Signed-off-by: Chen, Gong <gong.chen@linux.intel.com> Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Tony Luck <tony.luck@intel.com> Link: http://lkml.kernel.org/r/1385363701-12387-1-git-send-email-gong.chen@linux.intel.com [ Boris: massage commit message, rename function arg. ] Signed-off-by: Borislav Petkov <bp@suse.de>
2013-12-07ACPI / i915: Fix incorrect <acpi/acpi.h> inclusions via <linux/acpi_io.h>Lv Zheng
To avoid build problems and breaking dependencies between ACPI header files, <acpi/acpi.h> should not be included directly by code outside of the ACPI core subsystem. However, that is possible if <linux/acpi_io.h> is included, because that file contains a direct inclusion of <acpi/acpi.h>. For this reason, remove the direct <acpi/acpi.h> inclusion from <linux/acpi_io.h>, move that file from include/linux/ to include/acpi/ and make <linux/acpi.h> include it for CONFIG_ACPI set along with the other ACPI header files. Accordingly, Remove the inclusions of <linux/acpi_io.h> from everywhere. Of course, that causes the contents of the new <acpi/acpi_io.h> file to be available for CONFIG_ACPI set only, so intel_opregion.o that depends on it should also depend on CONFIG_ACPI (and it really should not be compiled for CONFIG_ACPI unset anyway). References: https://01.org/linuxgraphics/sites/default/files/documentation/acpi_igd_opregion_spec.pdf Cc: Matthew Garrett <mjg59@srcf.ucam.org> Signed-off-by: Lv Zheng <lv.zheng@intel.com> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> [rjw: Subject and changelog] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-10-23ACPI, APEI, CPER: Add UEFI 2.4 support for memory errorChen, Gong
In latest UEFI spec(by now it is 2.4) memory error definition for CPER (UEFI 2.4 Appendix N Common Platform Error Record) adds some new fields. These fields help people to locate memory error to an actual DIMM location. Original-author: Tony Luck <tony.luck@intel.com> Signed-off-by: Chen, Gong <gong.chen@linux.intel.com> Reviewed-by: Borislav Petkov <bp@suse.de> Reviewed-by: Mauro Carvalho Chehab <m.chehab@samsung.com> Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2013-10-21ACPI, CPER: Update cper infoChen, Gong
We have a lot of confusing names of functions and data structures in amongs the the error reporting code. In particular the "apei" prefix has been applied to many objects that are not part of APEI. Since we will be using these routines for extended error log reporting it will be clearer if we fix up the names first. Signed-off-by: Chen, Gong <gong.chen@linux.intel.com> Acked-by: Borislav Petkov <bp@suse.de> Reviewed-by: Mauro Carvalho Chehab <m.chehab@samsung.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2013-08-12Merge branch 'x86/mce' into x86/rasIngo Molnar
Pursue a single RAS/MCE topic branch on x86. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-07-10mce: acpi/apei: Soft-offline a page on firmware GHES notificationNaveen N. Rao
If the firmware indicates in GHES error data entry that the error threshold has exceeded for a corrected error event, then we try to soft-offline the page. This could be called in interrupt context, so we queue this up similar to how we handle memory failure scenarios. Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Acked-by: Borislav Petkov <bp@suse.de> Signed-off-by: Tony Luck <tony.luck@intel.com>
2013-07-03Merge tag 'pci-v3.11-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull PCI changes from Bjorn Helgaas: "PCI device hotplug - Add pci_alloc_dev() interface (Gu Zheng) - Add pci_bus_get()/put() for reference counting (Jiang Liu) - Fix SR-IOV reference count issues (Jiang Liu) - Remove unused acpi_pci_roots list (Jiang Liu) MSI - Conserve interrupt resources on x86 (Alexander Gordeev) AER - Force fatal severity when component has been reset (Betty Dall) - Reset link below Root Port as well as Downstream Port (Betty Dall) - Fix "Firmware first" flag setting (Bjorn Helgaas) - Don't parse HEST for non-PCIe devices (Bjorn Helgaas) ASPM - Warn when we can't disable ASPM as driver requests (Bjorn Helgaas) Miscellaneous - Add CircuitCo PCI IDs (Darren Hart) - Add AMD CZ SATA and SMBus PCI IDs (Shane Huang) - Work around Ivytown NTB BAR size issue (Jon Mason) - Detect invalid initial BAR values (Kevin Hao) - Add pcibios_release_device() (Sebastian Ott) - Fix powerpc & sparc PCI_UNKNOWN power state usage (Bjorn Helgaas)" * tag 'pci-v3.11-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (51 commits) MAINTAINERS: Add ACPI folks for ACPI-related things under drivers/pci PCI: Add CircuitCo vendor ID and subsystem ID PCI: Use pdev->pm_cap instead of pci_find_capability(..,PCI_CAP_ID_PM) PCI: Return early on allocation failures to unindent mainline code PCI: Simplify IOV implementation and fix reference count races PCI: Drop redundant setting of bus->is_added in virtfn_add_bus() unicore32/PCI: Remove redundant call of pci_bus_add_devices() m68k/PCI: Remove redundant call of pci_bus_add_devices() PCI / ACPI / PM: Use correct power state strings in messages PCI: Fix comment typo for pcie_pme_remove() PCI: Rename pci_release_bus_bridge_dev() to pci_release_host_bridge_dev() PCI: Fix refcount issue in pci_create_root_bus() error recovery path ia64/PCI: Clean up pci_scan_root_bus() usage PCI/AER: Reset link for devices below Root Port or Downstream Port ACPI / APEI: Force fatal AER severity when component has been reset PCI/AER: Remove "extern" from function declarations PCI/AER: Move AER severity defines to aer.h PCI/AER: Set dev->__aer_firmware_first only for matching devices PCI/AER: Factor out HEST device type matching PCI/AER: Don't parse HEST table for non-PCIe devices ...
2013-06-07Merge branch 'acpi-fixes'Rafael J. Wysocki
* acpi-fixes: ACPI / PM: Do not execute _PS0 for devices without _PSC during initialization ACPI / scan: do not match drivers against objects having scan handlers ACPI / APEI: fix error return code in ghes_probe() ACPI / video: ignore BIOS initial backlight value for HP Pavilion g6 ACPI / video: ignore BIOS initial backlight value for HP m4 x86 / platform / hp_wmi: Fix bluetooth_rfkill misuse in hp_wmi_rfkill_setup()
2013-06-06ACPI / APEI: Force fatal AER severity when component has been resetBetty Dall
The CPER error record has a reset bit that indicates that the platform has reset the component. The reset bit can be set for any severity error including recoverable. From the AER code path's perspective, any error is fatal if the component has been reset. This patch upgrades the severity of the AER recovery to AER_FATAL whenever the CPER error record indicates that the component has been reset. [bhelgaas: s/bus has been reset/component has been reset/] Signed-off-by: Betty Dall <betty.dall@hp.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2013-06-05ACPI / APEI: fix error return code in ghes_probe()Wei Yongjun
Fix to return a negative error code in the acpi_gsi_to_irq() and request_irq() error handling case instead of 0, as done elsewhere in this function. Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Reviewed-by: Huang Ying <ying.huang@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-05-30aerdrv: Move cper_print_aer() call out of interrupt contextLance Ortiz
The following warning was seen on 3.9 when a corrected PCIe error was being handled by the AER subsystem. WARNING: at .../drivers/pci/search.c:214 pci_get_dev_by_id+0x8a/0x90() This occurred because a call to pci_get_domain_bus_and_slot() was added to cper_print_pcie() to setup for the call to cper_print_aer(). The warning showed up because cper_print_pcie() is called in an interrupt context and pci_get* functions are not supposed to be called in that context. The solution is to move the cper_print_aer() call out of the interrupt context and into aer_recover_work_func() to avoid any warnings when calling pci_get* functions. Signed-off-by: Lance Ortiz <lance.ortiz@hp.com> Acked-by: Borislav Petkov <bp@suse.de> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2013-02-25ghes: add the needed hooks for EDAC error reportMauro Carvalho Chehab
In order to allow reporting errors via EDAC, add hooks for: 1) register an EDAC driver; 2) unregister an EDAC driver; 3) report errors via EDAC. As the EDAC driver will need to access the ghes structure, adds it as one of the parameters for ghes_do_proc. Acked-by: Huang Ying <ying.huang@intel.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2013-02-21ghes: move structures/enum to a header fileMauro Carvalho Chehab
As a ghes_edac driver will need to access ghes structures, in order to properly handle the errors, move those structures to a separate header file. No functional changes. Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2012-12-11Merge tag 'driver-core-3.8-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg Kroah-Hartman: "Here's the large driver core updates for 3.8-rc1. The biggest thing here is the various __dev* marking removals. This is going to be a pain for the merge with different subsystem trees, I know, but all of the patches included here have been ACKed by their various subsystem maintainers, as they wanted them to go through here. If this is too much of a pain, I can pull all of them out of this tree and just send you one with the other fixes/updates and then, after 3.8-rc1 is out, do the rest of the removals to ensure we catch them all, it's up to you. The merges should all be trivial, and Stephen has been doing them all in linux-next for a few weeks now quite easily. Other than the __dev* marking removals, there's nothing major here, some firmware loading updates and other minor things in the driver core. All of these have (much to Stephen's annoyance), been in linux-next for a while. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>" Fixed up trivial conflicts in drivers/gpio/gpio-{em,stmpe}.c due to gpio update. * tag 'driver-core-3.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (93 commits) modpost.c: Stop checking __dev* section mismatches init.h: Remove __dev* sections from the kernel acpi: remove use of __devinit PCI: Remove __dev* markings PCI: Always build setup-bus when PCI is enabled PCI: Move pci_uevent into pci-driver.c PCI: Remove CONFIG_HOTPLUG ifdefs unicore32/PCI: Remove CONFIG_HOTPLUG ifdefs sh/PCI: Remove CONFIG_HOTPLUG ifdefs powerpc/PCI: Remove CONFIG_HOTPLUG ifdefs mips/PCI: Remove CONFIG_HOTPLUG ifdefs microblaze/PCI: Remove CONFIG_HOTPLUG ifdefs dma: remove use of __devinit dma: remove use of __devexit_p firewire: remove use of __devinitdata firewire: remove use of __devinit leds: remove use of __devexit leds: remove use of __devinit leds: remove use of __devexit_p mmc: remove use of __devexit ...
2012-11-28acpi: remove use of __devinitBill Pemberton
CONFIG_HOTPLUG is going away as an option so __devinit is no longer needed. Signed-off-by: Bill Pemberton <wfp5p@virginia.edu> Acked-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-11-21ACPI: remove use of __devexitBill Pemberton
CONFIG_HOTPLUG is going away as an option so __devexit is no longer needed. Signed-off-by: Bill Pemberton <wfp5p@virginia.edu> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2012-06-12ACPI, APEI, Avoid too much error reporting in runtimeHuang Ying
This patch fixed the following bug. https://bugzilla.kernel.org/show_bug.cgi?id=43282 This is caused by a firmware bug checking (checking generic address register provided by firmware) in runtime. The checking should be done in address mapping time instead of runtime to avoid too much error reporting in runtime. Reported-by: Pawel Sikora <pluto@agmk.net> Signed-off-by: Huang Ying <ying.huang@intel.com> Tested-by: Jean Delvare <khali@linux-fr.org> Cc: stable@vger.kernel.org Signed-off-by: Len Brown <len.brown@intel.com>
2012-01-18Merge branch 'release' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux This includes initial support for the recently published ACPI 5.0 spec. In particular, support for the "hardware-reduced" bit that eliminates the dependency on legacy hardware. APEI has patches resulting from testing on real hardware. Plus other random fixes. * 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux: (52 commits) acpi/apei/einj: Add extensions to EINJ from rev 5.0 of acpi spec intel_idle: Split up and provide per CPU initialization func ACPI processor: Remove unneeded variable passed by acpi_processor_hotadd_init V2 ACPI processor: Remove unneeded cpuidle_unregister_driver call intel idle: Make idle driver more robust intel_idle: Fix a cast to pointer from integer of different size warning in intel_idle ACPI: kernel-parameters.txt : Add intel_idle.max_cstate intel_idle: remove redundant local_irq_disable() call ACPI processor: Fix error path, also remove sysdev link ACPI: processor: fix acpi_get_cpuid for UP processor intel_idle: fix API misuse ACPI APEI: Convert atomicio routines ACPI: Export interfaces for ioremapping/iounmapping ACPI registers ACPI: Fix possible alignment issues with GAS 'address' references ACPI, ia64: Use SRAT table rev to use 8bit or 16/32bit PXM fields (ia64) ACPI, x86: Use SRAT table rev to use 8bit or 32bit PXM fields (x86/x86-64) ACPI: Store SRAT table revision ACPI, APEI, Resolve false conflict between ACPI NVS and APEI ACPI, Record ACPI NVS regions ACPI, APEI, EINJ, Refine the fix of resource conflict ...
2012-01-17ACPI APEI: Convert atomicio routinesMyron Stowe
APEI needs memory access in interrupt context. The obvious choice is acpi_read(), but originally it couldn't be used in interrupt context because it makes temporary mappings with ioremap(). Therefore, we added drivers/acpi/atomicio.c, which provides: acpi_pre_map_gar() -- ioremap in process context acpi_atomic_read() -- memory access in interrupt context acpi_post_unmap_gar() -- iounmap Later we added acpi_os_map_generic_address() (2971852) and enhanced acpi_read() so it works in interrupt context as long as the address has been previously mapped (620242a). Now this sequence: acpi_os_map_generic_address() -- ioremap in process context acpi_read()/apei_read() -- now OK in interrupt context acpi_os_unmap_generic_address() is equivalent to what atomicio.c provides. This patch introduces apei_read() and apei_write(), which currently are functional equivalents of acpi_read() and acpi_write(). This is mainly proactive, to prevent APEI breakages if acpi_read() and acpi_write() are ever augmented to support the 'bit_offset' field of GAS, as APEI's __apei_exec_write_register() precludes splitting up functionality related to 'bit_offset' and APEI's 'mask' (see its APEI_EXEC_PRESERVE_REGISTER block). With apei_read() and apei_write() in place, usages of atomicio routines are converted to apei_read()/apei_write() and existing calls within osl.c and the CA, based on the re-factoring that was done in an earlier patch series - http://marc.info/?l=linux-acpi&m=128769263327206&w=2: acpi_pre_map_gar() --> acpi_os_map_generic_address() acpi_post_unmap_gar() --> acpi_os_unmap_generic_address() acpi_atomic_read() --> apei_read() acpi_atomic_write() --> apei_write() Note that acpi_read() and acpi_write() currently use 'bit_width' for accessing GARs which seems incorrect. 'bit_width' is the size of the register, while 'access_width' is the size of the access the processor must generate on the bus. The 'access_width' may be larger, for example, if the hardware only supports 32-bit or 64-bit reads. I wanted to minimize any possible impacts with this patch series so I did *not* change this behavior. Signed-off-by: Myron Stowe <myron.stowe@redhat.com> Signed-off-by: Len Brown <len.brown@intel.com>
2012-01-17ACPI, APEI, Printk queued error record before panicHuang Ying
Because printk is not safe inside NMI handler, the recoverable error records received in NMI handler will be queued to be printked in a delayed IRQ context via irq_work. If a fatal error occurs after the recoverable error and before the irq_work processed, we lost a error report. To solve the issue, the queued error records are printked in NMI handler if system will go panic. Signed-off-by: Huang Ying <ying.huang@intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
2012-01-17ACPI, APEI, GHES, Distinguish interleaved error report in kernel logHuang Ying
In most cases, printk only guarantees messages from different printk calling will not be interleaved between each other. But, one APEI GHES hardware error report will involve multiple printk calling, normally each for one line. So it is possible that the hardware error report comes from different generic hardware error source will be interleaved. In this patch, a sequence number is prefixed to each line of error report. So that, even if they are interleaved, they still can be distinguished by the prefixed sequence number. Signed-off-by: Huang Ying <ying.huang@intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
2012-01-17ACPI, APEI, GHES: Add PCIe AER recovery supportHuang Ying
aer_recover_queue() is called when recoverable PCIe AER errors are notified by firmware to do the recovery work. Signed-off-by: Huang Ying <ying.huang@intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
2012-01-13module_param: make bool parameters really bool (drivers & misc)Rusty Russell
module_param(bool) used to counter-intuitively take an int. In fddd5201 (mid-2009) we allowed bool or int/unsigned int using a messy trick. It's time to remove the int/unsigned int option. For this version it'll simply give a warning, but it'll break next kernel version. Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2011-10-10x86, nmi: Wire up NMI handlers to new routinesDon Zickus
Just convert all the files that have an nmi handler to the new routines. Most of it is straight forward conversion. A couple of places needed some tweaking like kgdb which separates the debug notifier from the nmi handler and mce removes a call to notify_die. [Thanks to Ying for finding out the history behind that mce call https://lkml.org/lkml/2010/5/27/114 And Boris responding that he would like to remove that call because of it https://lkml.org/lkml/2011/9/21/163] The things that get converted are the registeration/unregistration routines and the nmi handler itself has its args changed along with code removal to check which list it is on (most are on one NMI list except for kgdb which has both an NMI routine and an NMI Unknown routine). Signed-off-by: Don Zickus <dzickus@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Corey Minyard <minyard@acm.org> Cc: Jason Wessel <jason.wessel@windriver.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Robert Richter <robert.richter@amd.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Corey Minyard <minyard@acm.org> Cc: Jack Steiner <steiner@sgi.com> Link: http://lkml.kernel.org/r/1317409584-23662-4-git-send-email-dzickus@redhat.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-03APEI GHES: 32-bit buildfixLen Brown
drivers/acpi/apei/ghes.c:542: warning: integer overflow in expression drivers/acpi/apei/ghes.c:619: warning: integer overflow in expression ghes.c:(.text+0x46289): undefined reference to `__udivdi3'   in function ghes_estatus_cache_add(). Reported-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Len Brown <len.brown@intel.com>
2011-08-03ACPI, APEI, GHES: Add hardware memory error recovery supportHuang Ying
memory_failure_queue() is called when recoverable memory errors are notified by firmware to do the recovery work. Signed-off-by: Huang Ying <ying.huang@intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
2011-08-03ACPI, APEI, GHES, Error records content based throttleHuang Ying
printk is used by GHES to report hardware errors. Ratelimit is enforced on the printk to avoid too many hardware error reports in kernel log. Because there may be thousands or even millions of corrected hardware errors during system running. Currently, a simple scheme is used. That is, the total number of hardware error reporting is ratelimited. This may cause some issues in practice. For example, there are two kinds of hardware errors occurred in system. One is corrected memory error, because the fault memory address is accessed frequently, there may be hundreds error report per-second. The other is corrected PCIe AER error, it will be reported once per-second. Because they share one ratelimit control structure, it is highly possible that only memory error is reported. To avoid the above issue, an error record content based throttle algorithm is implemented in the patch. Where after the first successful reporting, all error records that are same are throttled for some time, to let other kinds of error records have the opportunity to be reported. In above example, the memory errors will be throttled for some time, after being printked. Then the PCIe AER error will be printked successfully. Signed-off-by: Huang Ying <ying.huang@intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
2011-08-03ACPI, APEI, GHES, printk support for recoverable error via NMIHuang Ying
Some APEI GHES recoverable errors are reported via NMI, but printk is not safe in NMI context. To solve the issue, a lock-less memory allocator is used to allocate memory in NMI handler, save the error record into the allocated memory, put the error record into a lock-less list. On the other hand, an irq_work is used to delay the operation from NMI context to IRQ context. The irq_work IRQ handler will remove nodes from lock-less list, printk the error record and do some further processing include recovery operation, then free the memory. Signed-off-by: Huang Ying <ying.huang@intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
2011-07-13ACPI, APEI, Add WHEA _OSC supportHuang Ying
APEI firmware first mode must be turned on explicitly on some machines, otherwise there may be no GHES hardware error record for hardware error notification. APEI bit in generic _OSC call can be used to do that, but on some machine, a special WHEA _OSC call must be used. This patch adds the support to that WHEA _OSC call. Signed-off-by: Huang Ying <ying.huang@intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Matthew Garrett <mjg@redhat.com> Signed-off-by: Len Brown <len.brown@intel.com>
2011-07-13ACPI, APEI, GHES, Support disable GHES at boot timeHuang Ying
Some machine may have broken firmware so that GHES and firmware first mode should be disabled. This patch adds support to that. Signed-off-by: Huang Ying <ying.huang@intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Matthew Garrett <mjg@redhat.com> Signed-off-by: Len Brown <len.brown@intel.com>
2011-07-13ACPI, APEI, GHES, Do not ratelimit fatal error printk before panicHuang Ying
printk is used by GHES to report hardware errors. Normally, the printk will be ratelimited to avoid too many hardware error reports in kernel log. Because there may be thousands or even millions of corrected hardware errors during system running. That is different for fatal hardware error, because system will go panic as soon as possible, there will be no more than several error records. And these error records are valuable for system fault diagnosis, so they should not be ratelimited. Signed-off-by: Huang Ying <ying.huang@intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
2011-03-31Fix common misspellingsLucas De Marchi
Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2011-01-12ACPI, APEI, Generic Hardware Error Source POLL/IRQ/NMI notification type supportHuang Ying
Generic Hardware Error Source provides a way to report platform hardware errors (such as that from chipset). It works in so called "Firmware First" mode, that is, hardware errors are reported to firmware firstly, then reported to Linux by firmware. This way, some non-standard hardware error registers or non-standard hardware link can be checked by firmware to produce more valuable hardware error information for Linux. This patch adds POLL/IRQ/NMI notification types support. Because the memory area used to transfer hardware error information from BIOS to Linux can be determined only in NMI, IRQ or timer handler, but general ioremap can not be used in atomic context, so a special version of atomic ioremap is implemented for that. Known issue: - Error information can not be printed for recoverable errors notified via NMI, because printk is not NMI-safe. Will fix this via delay printing to IRQ context via irq_work or make printk NMI-safe. v2: - adjust printk format per comments. Signed-off-by: Huang Ying <ying.huang@intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
2010-12-13ACPI, APEI, Report GHES error information via printkHuang Ying
printk is one of the methods to report hardware errors to user space. This patch implements hardware error reporting for GHES via printk. Signed-off-by: Huang Ying <ying.huang@intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
2010-09-29ACPI, APEI, HEST Fix the unsuitable usage of platform_dataJin Dongming
platform_data in hest_parse_ghes() is used for saving the address of entry information of erst_tab. When the device is failed to be added, platform_data will be freed by platform_device_put(). But the value saved in platform_data should not be freed here. If it is done, it will make system panic. So I think platform_data should save the address of allocated memory which saves entry information of erst_tab. This patch fixed it and I confirmed it on x86_64 next-tree. v2: Transport the pointer of hest_hdr to platform_data using platform_device_add_data() Signed-off-by: Jin Dongming <jin.dongming@np.css.fujitsu.com> Signed-off-by: Huang Ying <ying.huang@intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
2010-08-08ACPI, APEI, Manage GHES as platform devicesHuang Ying
Register GHES during HEST initialization as platform devices. And make GHES driver into platform device driver. So that the GHES driver module can be loaded automatically when there are GHES available. Signed-off-by: Huang Ying <ying.huang@intel.com> Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
2010-08-08ACPI, APEI, Rename CPER and GHES severity constantsHuang Ying
The abbreviation of severity should be SEV instead of SER, so the CPER severity constants are renamed accordingly. GHES severity constants are renamed in the same way too. Signed-off-by: Huang Ying <ying.huang@intel.com> Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
2010-05-19ACPI, APEI, Generic Hardware Error Source memory error supportHuang Ying
Generic Hardware Error Source provides a way to report platform hardware errors (such as that from chipset). It works in so called "Firmware First" mode, that is, hardware errors are reported to firmware firstly, then reported to Linux by firmware. This way, some non-standard hardware error registers or non-standard hardware link can be checked by firmware to produce more valuable hardware error information for Linux. Now, only SCI notification type and memory errors are supported. More notification type and hardware error type will be added later. These memory errors are reported to user space through /dev/mcelog via faking a corrected Machine Check, so that the error memory page can be offlined by /sbin/mcelog if the error count for one page is beyond the threshold. On some machines, Machine Check can not report physical address for some corrected memory errors, but GHES can do that. So this simplified GHES is implemented firstly. Signed-off-by: Huang Ying <ying.huang@intel.com> Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Len Brown <len.brown@intel.com>