mcelog: CPU 6 on socket 1 received Bus and Interconnect Errors in Other-transaction

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

mcelog: CPU 6 on socket 1 received Bus and Interconnect Errors in Other-transaction

Michael Hirmke
Hi *,

on my main server after a cold boot I see the following messages in my
journal:

...
kernel: mce: [Hardware Error]: CPU 6: Machine Check: 0 Bank 20: c8012a4000200e0f
kernel: mce: [Hardware Error]: TSC 0 mce: MISC 800000 mce:
kernel: mce: [Hardware Error]: PROCESSOR 0:306f2 TIME 1508767394 SOCKET 1 APIC 10 microcode 36
mcelog[2635]: CPU 6 on socket 1 received Bus and Interconnect Errors in Other-transaction
mcelog[2636]: Location: CPU 6 on socket 1
...
systemd[1]: Starting Machine Check Exception Logging Daemon...
systemd[1]: Started Machine Check Exception Logging Daemon.
mcelog[2628]: Hardware event. This is not a software error.
mcelog[2628]: MCE 0
mcelog[2628]: CPU 6 BANK 20
mcelog[2628]: MISC 800000
mcelog[2628]: TIME 1508767394 Mon Oct 23 16:03:14 2017
mcelog[2628]: MCG status:
mcelog[2628]: MCi status:
mcelog[2628]: Error overflow
mcelog[2628]: Corrected error
mcelog[2628]: MCi_MISC register valid
mcelog[2628]: MCA: BUS error: 1 6 Level-3 Generic Generic Other-transaction Request-did-not-timeout
mcelog[2628]: Running trigger `bus-error-trigger'
mcelog[2628]: QPI:
mcelog[2628]: Intel QPI physical layer detected a QPI in-band reset but aborted initialization
mcelog[2628]: STATUS c8012a4000200e0f MCGSTATUS 0
mcelog[2628]: MCGCAP 7000c16 APICID 10 SOCKETID 1
mcelog[2628]: CPUID Vendor Intel Family 6 Model 63
mcelog[2628]: <27>Oct 23 16:04:47 mcelog: CPU 6 on socket 1 received Bus and Interconnect Errors in Other-transaction
mcelog[2628]: <27>Oct 23 16:04:47 mcelog: Location: CPU 6 on socket 1
...

This is with kernel 4.4.90-28 on openSuSE Leap 42.3, but after checking
older journal entries I saw, that it also happened with 4.4.87-25.
Machine specs:

- Supermicro X10DRi/X10DRi, BIOS 2.0 12/28/2015
- 2 x 6 core CPU Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
- 128 GB RAM

The error does not happen when restarting the os, only after a cold boot
of the machine.

I couldn't find appropriate information on the net.
Is cpu 1 damaged?
Can I do anything to correct the problem - or just ignore it?

--
Michael Hirmke

--
To unsubscribe, e-mail: [hidden email]
To contact the owner, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: mcelog: CPU 6 on socket 1 received Bus and Interconnect Errors in Other-transaction

Chuck Davis
Hi Michael:

I got a Dell Precision 7720 laptop and got these mce messages.  Dell
support told me it was because the folks doing mce have not included
the machine chip in their recognized chips and that there is nothing
to worry about.  It just means mce does not recognize the chip and
when mce gets updated the messages will go away.

The laptop I got also has Xeon chip.

HTH.

On Mon, Oct 23, 2017 at 8:46 AM, Chuck Davis <[hidden email]> wrote:

> Hi Michael:
>
> I got a Dell Precision 7720 laptop and got these mce messages.  Dell support
> told me it was because the folks doing mce have not included the machine
> chip in their recognized chips and that there is nothing to worry about.  It
> just means mce does not recognize the chip and when mce gets updated the
> messages will go away.
>
> The laptop I got also has Xeon chip.
>
> HTH.
>
> On Mon, Oct 23, 2017 at 8:10 AM, Michael Hirmke <[hidden email]> wrote:
>>
>> Hi *,
>>
>> on my main server after a cold boot I see the following messages in my
>> journal:
>>
>> ...
>> kernel: mce: [Hardware Error]: CPU 6: Machine Check: 0 Bank 20:
>> c8012a4000200e0f
>> kernel: mce: [Hardware Error]: TSC 0 mce: MISC 800000 mce:
>> kernel: mce: [Hardware Error]: PROCESSOR 0:306f2 TIME 1508767394 SOCKET 1
>> APIC 10 microcode 36
>> mcelog[2635]: CPU 6 on socket 1 received Bus and Interconnect Errors in
>> Other-transaction
>> mcelog[2636]: Location: CPU 6 on socket 1
>> ...
>> systemd[1]: Starting Machine Check Exception Logging Daemon...
>> systemd[1]: Started Machine Check Exception Logging Daemon.
>> mcelog[2628]: Hardware event. This is not a software error.
>> mcelog[2628]: MCE 0
>> mcelog[2628]: CPU 6 BANK 20
>> mcelog[2628]: MISC 800000
>> mcelog[2628]: TIME 1508767394 Mon Oct 23 16:03:14 2017
>> mcelog[2628]: MCG status:
>> mcelog[2628]: MCi status:
>> mcelog[2628]: Error overflow
>> mcelog[2628]: Corrected error
>> mcelog[2628]: MCi_MISC register valid
>> mcelog[2628]: MCA: BUS error: 1 6 Level-3 Generic Generic
>> Other-transaction Request-did-not-timeout
>> mcelog[2628]: Running trigger `bus-error-trigger'
>> mcelog[2628]: QPI:
>> mcelog[2628]: Intel QPI physical layer detected a QPI in-band reset but
>> aborted initialization
>> mcelog[2628]: STATUS c8012a4000200e0f MCGSTATUS 0
>> mcelog[2628]: MCGCAP 7000c16 APICID 10 SOCKETID 1
>> mcelog[2628]: CPUID Vendor Intel Family 6 Model 63
>> mcelog[2628]: <27>Oct 23 16:04:47 mcelog: CPU 6 on socket 1 received Bus
>> and Interconnect Errors in Other-transaction
>> mcelog[2628]: <27>Oct 23 16:04:47 mcelog: Location: CPU 6 on socket 1
>> ...
>>
>> This is with kernel 4.4.90-28 on openSuSE Leap 42.3, but after checking
>> older journal entries I saw, that it also happened with 4.4.87-25.
>> Machine specs:
>>
>> - Supermicro X10DRi/X10DRi, BIOS 2.0 12/28/2015
>> - 2 x 6 core CPU Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
>> - 128 GB RAM
>>
>> The error does not happen when restarting the os, only after a cold boot
>> of the machine.
>>
>> I couldn't find appropriate information on the net.
>> Is cpu 1 damaged?
>> Can I do anything to correct the problem - or just ignore it?
>>
>> --
>> Michael Hirmke
>>
>> --
>> To unsubscribe, e-mail: [hidden email]
>> To contact the owner, e-mail: [hidden email]
>>
>

--
To unsubscribe, e-mail: [hidden email]
To contact the owner, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: mcelog: CPU 6 on socket 1 received Bus and Interconnect Errors in Other-transaction

Michael Hirmke
In reply to this post by Michael Hirmke
push

>Hi *,

>on my main server after a cold boot I see the following messages in my
>journal:

>...
>kernel: mce: [Hardware Error]: CPU 6: Machine Check: 0 Bank 20:
>c8012a4000200e0f kernel: mce: [Hardware Error]: TSC 0 mce: MISC 800000 mce:
>kernel: mce: [Hardware Error]: PROCESSOR 0:306f2 TIME 1508767394 SOCKET 1
>APIC 10 microcode 36 mcelog[2635]: CPU 6 on socket 1 received Bus and
>Interconnect Errors in Other-transaction mcelog[2636]: Location: CPU 6 on
>socket 1 ...
>systemd[1]: Starting Machine Check Exception Logging Daemon...
>systemd[1]: Started Machine Check Exception Logging Daemon.
>mcelog[2628]: Hardware event. This is not a software error.
>mcelog[2628]: MCE 0
>mcelog[2628]: CPU 6 BANK 20
>mcelog[2628]: MISC 800000
>mcelog[2628]: TIME 1508767394 Mon Oct 23 16:03:14 2017
>mcelog[2628]: MCG status:
>mcelog[2628]: MCi status:
>mcelog[2628]: Error overflow
>mcelog[2628]: Corrected error
>mcelog[2628]: MCi_MISC register valid
>mcelog[2628]: MCA: BUS error: 1 6 Level-3 Generic Generic Other-transaction
>Request-did-not-timeout mcelog[2628]: Running trigger `bus-error-trigger'
>mcelog[2628]: QPI:
>mcelog[2628]: Intel QPI physical layer detected a QPI in-band reset but
>aborted initialization mcelog[2628]: STATUS c8012a4000200e0f MCGSTATUS 0
>mcelog[2628]: MCGCAP 7000c16 APICID 10 SOCKETID 1
>mcelog[2628]: CPUID Vendor Intel Family 6 Model 63
>mcelog[2628]: <27>Oct 23 16:04:47 mcelog: CPU 6 on socket 1 received Bus and
>Interconnect Errors in Other-transaction mcelog[2628]: <27>Oct 23 16:04:47
>mcelog: Location: CPU 6 on socket 1 ...

>This is with kernel 4.4.90-28 on openSuSE Leap 42.3, but after checking
>older journal entries I saw, that it also happened with 4.4.87-25.
>Machine specs:

>- Supermicro X10DRi/X10DRi, BIOS 2.0 12/28/2015
>- 2 x 6 core CPU Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
>- 128 GB RAM

>The error does not happen when restarting the os, only after a cold boot
>of the machine.

>I couldn't find appropriate information on the net.
>Is cpu 1 damaged?
>Can I do anything to correct the problem - or just ignore it?

>--
>Michael Hirmke

--
Michael Hirmke

--
To unsubscribe, e-mail: [hidden email]
To contact the owner, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: mcelog: CPU 6 on socket 1 received Bus and Interconnect Errors in Other-transaction

Anton Aylward-2
On 01/11/17 07:22 AM, Michael Hirmke wrote:
> push

a) dust
b) oxidation
c) bad caps

(In my case I'd have to add 'cat hair' and clean the fans, remember to use a
paper-clip.)

2/3 solution is 'pull'.

I suggest investing in one of those cans of pressure air with a nozzle.

While I sometimes have to pull memory and connectors, wipe the gold fingers with
an antistatic cloth, blow air into the connectors and replace all same, I rarely
have to replace capacitors.  I have more than enough low-end mobos from the
"Closet of Anxieties", but obviously that isn't the case with you.


PLEASE NOTE:

I'm not saying that this is an ultimate solution, and I'd be VERY reluctant to
'pull and polish' the CPUs, but this is a first line of wolf fencing of the
problem.

NEXT UP: quality of the PSU from a cold start.

Heck, it's getting cold and I turn the computer on before the forced air heating
has warmed the house ...

Time was I had a [project in a portacabin.  We'd turn the heating on in the
portacabin and go get breakfast; come back an hour later.  Any earlier and the
electronics wouldn't turn on.

--
         A: Yes.
     >   Q: Are you sure?
     >>  A: Because it reverses the logical flow of conversation.
     >>> Q: Why is top posting frowned upon?


--
To unsubscribe, e-mail: [hidden email]
To contact the owner, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: mcelog: CPU 6 on socket 1 received Bus and Interconnect Errors in Other-transaction

Michael Hirmke
Hi Anton,

thx for your answer.

>On 01/11/17 07:22 AM, Michael Hirmke wrote:
>> push

>a) dust

I already cleaned everything as careful as possible.

>b) oxidation

Puh, I'd have to disassemble fan and cpu to check that.
I'd prefer to do this as a last ressort.
But of course you're right - this may be one reason.

>c) bad caps

Oops, changing caps is beyond my skills 8-<

[...]
>While I sometimes have to pull memory and connectors, wipe the gold fingers
>with an antistatic cloth, blow air into the connectors and replace all same,
>I rarely have to replace capacitors.  I have more than enough low-end mobos
>from the "Closet of Anxieties", but obviously that isn't the case with you.

Indeed - this Supermicro mobo is a high-end mobo.

>PLEASE NOTE:

>I'm not saying that this is an ultimate solution, and I'd be VERY reluctant
>to 'pull and polish' the CPUs, but this is a first line of wolf fencing of
>the problem.

You are right, but I#m not very eager to do that 8-<

>NEXT UP: quality of the PSU from a cold start.

>Heck, it's getting cold and I turn the computer on before the forced air
>heating has warmed the house ...

I don#t think, this is a problem here, because the machines run the
whole day, so everything is warm, when one of them is switched off for a
short while and then switched back on.

[...]

Bye.
Michael.
--
Michael Hirmke

--
To unsubscribe, e-mail: [hidden email]
To contact the owner, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: mcelog: CPU 6 on socket 1 received Bus and Interconnect Errors in Other-transaction

Anton Aylward-2
On 01/11/17 08:03 AM, Michael Hirmke wrote:

> Hi Anton,
>
> thx for your answer.
>
>> On 01/11/17 07:22 AM, Michael Hirmke wrote:
>>> push
>
>> a) dust
>
> I already cleaned everything as careful as possible.

+1

>
>> b) oxidation
>
> Puh, I'd have to disassemble fan and cpu to check that.

Yes/No/Maybe
Sometimes it's just a matter of un-plugging what three is to be unplugged,
including the power leads to the mobo, and what contacts are accessible.  Wipe.
Blow air.

> I'd prefer to do this as a last ressort.

I dread the thought of pulling the CPUs! But you can dust-off their fans and the
power leads to those fans.
The only reason I can imagine is
a) the CPU really really dies
b) you decide to upgrade to 8-core or 16-core


> But of course you're right - this may be one reason.
>
>> c) bad caps
>
> Oops, changing caps is beyond my skills 8-<

On a multi-layer board like a mobo, this is beyond mine too, though I've fixed
up a flat-screen that died of this.


> [...]
>> While I sometimes have to pull memory and connectors, wipe the gold fingers
>> with an antistatic cloth, blow air into the connectors and replace all same,
>> I rarely have to replace capacitors.  I have more than enough low-end mobos
>>from the "Closet of Anxieties", but obviously that isn't the case with you.
>
> Indeed - this Supermicro mobo is a high-end mobo.

Right!  Oh, what is it, how much did it set you back?
Obviously this is not something I'd expect to find in the Closet of Anxieties!

But never-the-less, clean what contacts you can clean.

>
>> NEXT UP: quality of the PSU from a cold start.
>
>> Heck, it's getting cold and I turn the computer on before the forced air
>> heating has warmed the house ...
>
> I don#t think, this is a problem here, because the machines run the
> whole day, so everything is warm, when one of them is switched off for a
> short while and then switched back on.

... your electricity bill, Bro, not mine!

Still, 'pull and polish'.


--
         A: Yes.
     >   Q: Are you sure?
     >>  A: Because it reverses the logical flow of conversation.
     >>> Q: Why is top posting frowned upon?


--
To unsubscribe, e-mail: [hidden email]
To contact the owner, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: mcelog: CPU 6 on socket 1 received Bus and Interconnect Errors in Other-transaction

Michael Hirmke
Hi Anton,

[...]
>> Indeed - this Supermicro mobo is a high-end mobo.

>Right!  Oh, what is it, how much did it set you back?

How do you mean that?

>Obviously this is not something I'd expect to find in the Closet of
>Anxieties!

>But never-the-less, clean what contacts you can clean.

Yep.

[...]
>> I don#t think, this is a problem here, because the machines run the
>> whole day, so everything is warm, when one of them is switched off for a
>> short while and then switched back on.

>... your electricity bill, Bro, not mine!

Solar power is one of my closest friends :))

>Still, 'pull and polish'.

Yep.

Bye.
Michael.
--
Michael Hirmke

--
To unsubscribe, e-mail: [hidden email]
To contact the owner, e-mail: [hidden email]