Hi,
we have a server (Supermicro X7DBE+ Board with 16 GB ECC RAM) that is running
Linux (Debian Etch). Since we upgraded to a kernel >2.6.18, we see this
exception:
HARDWARE ERROR
CPU 1: Machine Check Exception: 0 Bank 5: 1000001004000e0f
TSC 0
This is not a software problem!
Run through mcelog --ascii to decode and contact your hardware vendor
Kernel panic - not syncing: Machine check
decoded with mcelog:
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 BANK 5 MCG status:
MCi status:
Invalid log
BQ_DCU_READ_TYPE BQ_ERR_AERR2_TYPE BQ_ERR_AERR2_TYPE response parity
error
STATUS 1000001004000e0f MCGSTATUS 0
memtest runs without problem for 24+h. We updated the BIOS and changed the the
DIMMS from bank 1<->2 and 3<->4. Still the same MCE.
I found out that sometimes the BIOS setting for PCI-e I/O performance may lead
to MCE's and changed the corresponding setting.
PCI Configuration -> PCI-e I/O performance -> Colasce
This didn't help either.
Any idea which part of the server (RAM, CPU, Bus) might be the cause of the
problem?
Ralf
|