summaryrefslogtreecommitdiffstats
path: root/drivers/edac/skx_common.c
Commit message (Collapse)AuthorAgeFilesLines
* EDAC/i10nm: Add Intel Sapphire Rapids server supportQiuxu Zhuo2020-11-191-5/+18
| | | | | | | | | | | | | | The Sapphire Rapids CPU model shares the same memory controller architecture with Ice Lake server. There are some configurations different from Ice Lake server as below: - The device ID for configuration agent. - The size for per channel memory-mapped I/O. - The DDR5 memory support. So add the above configurations and the Sapphire Rapids CPU model ID for EDAC support. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
* EDAC/{i7core,sb,pnd2,skx}: Fix error event severityTony Luck2020-08-181-2/+2
| | | | | | | | | | | | | | | | | | IA32_MCG_STATUS.RIPV indicates whether the return RIP value pushed onto the stack as part of machine check delivery is valid or not. Various drivers copied a code fragment that uses the RIPV bit to determine the severity of the error as either HW_EVENT_ERR_UNCORRECTED or HW_EVENT_ERR_FATAL, but this check is reversed (marking errors where RIPV is set as "FATAL"). Reverse the tests so that the error is marked fatal when RIPV is not set. Reported-by: Gabriele Paoloni <gabriele.paoloni@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200707194324.14884-1-tony.luck@intel.com
* Merge branch 'x86/entry' into ras/coreThomas Gleixner2020-06-111-9/+8
|\ | | | | | | | | | | to fixup conflicts in arch/x86/kernel/cpu/mce/core.c so MCE specific follow up patches can be applied without creating a horrible merge conflict afterwards.
| *-. Merge branches 'edac-i10nm' and 'edac-misc' into edac-updates-for-5.8Borislav Petkov2020-06-011-9/+8
| |\ \ | | | | | | | | | | | | Signed-off-by: Borislav Petkov <bp@suse.de>
| | | * EDAC/skx: Use the mcmtr register to retrieve close_pg/bank_xor_enableQiuxu Zhuo2020-05-191-3/+3
| | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The skx_edac driver wrongly uses the mtr register to retrieve two fields close_pg and bank_xor_enable. Fix it by using the correct mcmtr register to get the two fields. Cc: <stable@vger.kernel.org> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Reported-by: Matthew Riley <mattdr@google.com> Acked-by: Aristeu Rozanski <aris@redhat.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20200515210146.1337-1-tony.luck@intel.com
| | * EDAC, {skx,i10nm}: Make some configurations CPU model specificQiuxu Zhuo2020-04-271-6/+5
| |/ | | | | | | | | | | | | | | | | | | | | | | The device ID for configuration agent PCI device and the offset for bus number configuration register can be CPU model specific. So add a new structure res_config to make them configurable and pass res_config to {skx,i10nm}_init() and skx_get_all_bus_mappings() for use. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20200427083246.GB11036@zn.tnic
* | EDAC: Drop the EDAC report status checksTony Luck2020-04-141-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When acpi_extlog was added, we were worried that the same error would be reported more than once by different subsystems. But in the ensuing years I've seen complaints that people could not find an error log (because this mechanism suppressed the log they were looking for). Rip it all out. People are smart enough to notice the same address from different reporting mechanisms. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20200214222720.13168-8-tony.luck@intel.com
* | x86/mce: Fix all mce notifiers to update the mce->kflags bitmaskTony Luck2020-04-141-0/+4
|/ | | | | | | | | | | | | | If the handler took any action to log or deal with the error, set a bit in mce->kflags so that the default handler on the end of the machine check chain can see what has been done. Get rid of NOTIFY_STOP returns. Make the EDAC and dev-mcelog handlers skip over errors already processed by CEC. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20200214222720.13168-5-tony.luck@intel.com
* EDAC: skx_common: downgrade message importance on missing PCI deviceAristeu Rozanski2019-12-101-1/+1
| | | | | | | | | | | | | | | Both skx_edac and i10nm_edac drivers are loaded based on the matching CPU being available which leads the module to be automatically loaded in virtual machines as well. That will fail due the missing PCI devices. In both drivers the first function to make use of the PCI devices is skx_get_hi_lo() will simply print EDAC skx: Can't get tolm/tohm for each CPU core, which is noisy. This patch makes it a debug message. Signed-off-by: Aristeu Rozanski <aris@redhat.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20191204212325.c4k47p5hrnn3vpb5@redhat.com
* EDAC, skx: Retrieve and print retry_rd_err_log registersTony Luck2019-10-181-3/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Skylake logs some additional useful information in per-channel registers in addition the the architectural status/addr/misc logged in the machine check bank. Pick up this information and add it to the EDAC log: retry_rd_err_[five 32-bit register values] Sorry, no definitions for these registers. OEMs and DIMM vendors will be able to use them to isolate which cells in the DIMM are causing problems. correrrcnt[per rank corrected error counts] Note that if additional errors are logged while these registers are being read, you may see a jumble of values some from earlier errors, others from later errors (since the registers report the most recent logged error). The correrrcnt registers provide error counts per possible rank. If these counts only change by one since the previous error logged for this channel, then it is safe to assume that the registers logged provide a coherent view of one error. With this change EDAC logs look like this: EDAC MC4: 1 CE memory read error on CPU_SrcID#2_MC#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x8f26018 offset:0x0 grain:32 syndrome:0x0 - err_code:0x0101:0x0091 socket:2 imc:0 rank:0 bg:0 ba:0 row:0x1f880 col:0x200 retry_rd_err_log[0001a209 00000000 00000001 04800001 0001f880] correrrcnt[0001 0000 0000 0000 0000 0000 0000 0000]) Acked-by: Aristeu Rozanski <aris@redhat.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
* EDAC, skx_common: Refactor so that we initialize "dev" in result of adxl decode.Tony Luck2019-10-181-25/+23
| | | | | | | Simplifies the code a little. Acked-by: Aristeu Rozanski <aris@redhat.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
* EDAC: skx_common: get rid of unused type varMauro Carvalho Chehab2019-09-301-4/+1
| | | | | | | | | | | drivers/edac/skx_common.c: In function ‘skx_mce_output_error’: drivers/edac/skx_common.c:478:8: warning: variable ‘type’ set but not used [-Wunused-but-set-variable] 478 | char *type, *optype; | ^~~~ Acked-by: Borislav Petkov <bp@alien8.de> Acked-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
* EDAC, skx, i10nm: Fix source ID register offsetQiuxu Zhuo2019-06-261-2/+2
| | | | | | | | The source ID register offset for Skylake server is 0xf0, while for Icelake server is 0xf8. Pass the correct offset to get the source ID. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
* EDAC, skx, i10nm: Make skx_common.c a pure libraryQiuxu Zhuo2019-03-231-47/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following Kconfig constellations fail randconfig builds: CONFIG_ACPI_NFIT=y CONFIG_EDAC_DEBUG=y CONFIG_EDAC_SKX=m CONFIG_EDAC_I10NM=y or CONFIG_ACPI_NFIT=y CONFIG_EDAC_DEBUG=y CONFIG_EDAC_SKX=y CONFIG_EDAC_I10NM=m with: ... CC [M] drivers/edac/skx_common.o ... .../skx_common.o:.../skx_common.c:672: undefined reference to `__this_module' That is because if one of the two drivers - skx_edac or i10nm_edac - is built-in and the other one is a module, the shared file skx_common.c gets linked into a module object by kbuild. Therefore, when linking that same file into vmlinux, the '__this_module' symbol used in debugfs isn't defined, leading to the above error. Fix it by moving all debugfs code from skx_common.c to both skx_base.c and i10nm_base.c respectively. Thus, skx_common.c doesn't refer to the '__this_module' symbol anymore. Clarify skx_common.c's purpose at the top of the file for future reference, while at it. [ bp: Make text more readable. ] Fixes: d4dc89d069aa ("EDAC, i10nm: Add a driver for Intel 10nm server processors") Reported-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: James Morse <james.morse@arm.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: linux-edac <linux-edac@vger.kernel.org> Link: https://lkml.kernel.org/r/20190321221339.GA32323@agluck-desk
* EDAC, skx_common: Add code to recognise new compound error codeTony Luck2019-02-061-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | A new error code for systems that use DRAM as an extra level of cache looks like: 000F 0010 1MMM CCCC where the MMM and CCCC bits are used for the same purpose as the original code. For this new class of errors the ADXL translation will provide details of both the DIMM used as cache for the error location and the component that is being cached. Note: This new error code is first supported in Skylake. Older EDAC drivers do not need to be updated. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Aristeu Rozanski <aris@redhat.com> Cc: James Morse <james.morse@arm.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: https://lkml.kernel.org/r/20190205182109.27828-1-tony.luck@intel.com
* EDAC, skx_common: Separate common code out from skx_edacQiuxu Zhuo2019-02-021-0/+689
Parts of skx_edac can be shared with the Intel 10nm server EDAC driver. Carve out the common parts from skx_edac in preparation to support both skx_edac driver and i10nm_edac drivers. Co-developed-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: James Morse <james.morse@arm.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: linux-edac <linux-edac@vger.kernel.org> Link: https://lkml.kernel.org/r/20190130191519.15393-3-tony.luck@intel.com