1. This forum section is a read-only archive which contains old newsgroup posts. If you wish to post a query, please do so in one of our main forum sections (here). This way you will get a faster, better response from the members on Motherboard Point.

Ultra 5 - RED State Exception

Discussion in 'Sun Hardware' started by Jeff Wieland, Apr 8, 2005.

  1. Jeff Wieland

    Jeff Wieland Guest

    Two days ago my Ultra 5 started locking up. The console would become
    entirely unresponsive -- the screen would be blank, and Stop-A would do
    nothing. I would have to power-cycle it to get it back.

    My Ultra 5 is running Solaris 8 6/00, with MU7 and current patches
    installed. It has the 400 MHz processor, and 512 MB of memory.

    When it locked up at one point yesterday afternoon, it actually
    paniced, and my wife wrote down the error messages (this is my personal
    workstation, BTW). She wrote down:

    WARNING: [AFT1] Uncorrectable Memory Error on CPU0
    Instruction access at TL=0, errID 0x00003956.17d8e88d

    AFSR 0x00000001<ME>.80300000<PRIV,UE,CE>AFAR0x00000000.1f94ef00
    AFSR.PSYND 0x000(Score 05) AFSR.ETS 0x00 Fault_PC 0x1014ef00
    VDBH 0x261<UE> UDBH.ESYND 0x61 UDBL 0x0000 UDBL.ESYND 0x00
    VDBH Syndrome 0x61 Memory Module DIMM3

    panic[cpu0]/thread=2a10007dd20: [AFT1] errID 0x00003956.17d8e88d UE Error(s)
    See previous messages for details
    ...
    bunch of numbers
    ...

    Syncing file systems ... 2 done
    dumping to /dev/dsk/c1t0d0s1, offset 214827008
    61% done


    It had hung during the dump. I opened the machine up last night,
    reseated the processor, memory, and PCI cards, and cleaned out the
    dust. I also checked that the fans are working. It ran about 15
    hours, until about 12:30 pm today. It was hung again completely, so
    this time I hooked ttya to ttya on my wife's Ultra 10. It hung again
    this afternoon, but this time I've got console messages:

    RED State Exception

    TL=0000.0000.0000.0005 TT=0000.0000.0000.0068
    TPC=0000.0000.f000.3048 TnPC=0000.0000.f000.304c TSTATE=0000.0044.0004.1401
    TL=0000.0000.0000.0004 TT=0000.0000.0000.0068
    TPC=0000.0000.f000.302c TnPC=0000.0000.f000.3030 TSTATE=0000.0044.0004.1401
    TL=0000.0000.0000.0003 TT=0000.0000.0000.0068
    TPC=0000.0000.f000.3014 TnPC=0000.0000.f000.3018 TSTATE=0000.0044.0004.1401
    TL=0000.0000.0000.0002 TT=0000.0000.0000.0068
    TPC=0000.0000.1000.a1ac TnPC=0000.0000.1000.a1b0 TSTATE=0000.0000.0000.1501
    TL=0000.0000.0000.0001 TT=0000.0000.0000.004e
    TPC=0000.0000.1003.524c TnPC=0000.0000.1003.5250 TSTATE=0000.0099.0000.1601


    Watchdog Reset
    Externally Initiated Reset


    The first error sounded like a bad memory module, but the second one
    would seem to point to a bad CPU?
     
    Jeff Wieland, Apr 8, 2005
    #1
    1. Advertisements

  2. Bad memory board most likely. I'm sure others will walk you through
    figuring out which board is bad. Don't recall if the U5 requires a pair
    to be replaced at a time or not.
     
    Michael Vilain, Apr 8, 2005
    #2
    1. Advertisements

  3. Jeff Wieland

    Scott Howard Guest

    Hard to tell, but most likely memory. If you've got enough memory in
    the machine remove the DIMM and see if the problem occurs again. If it
    does, it's the CPU or the motherboard.

    Scott
     
    Scott Howard, Apr 8, 2005
    #3
  4. Jeff Wieland

    Jeff Wieland Guest

    I was thinking that it might be bad cache memory. Weren't there
    problems with cache on the faster Ultra-5/10 processors? Anyway,
    I swapped in a spare 270 MHz processor to see what happens. So
    far, it's been running since last night with problems (but running
    ssslllooowwwlllyyy :) ). If it runs through the weekend, I'm
    thinking that its probably the processor.

    That first error message that I posted did make a reference to
    DIMM3, though...
     
    Jeff Wieland, Apr 8, 2005
    #4
  5. Jeff Wieland

    Ben Guest

    Yeah, E$ on the USII 450MHz's had a problem a ways back, not sure how
    wide spread it is/was. That could be it but the only way to know is
    from the p/n.

    You can tweak the E$ scrubber rate from 100 to 1000 (lessen the
    frequency to reduce the chances of associated panics) or get a CPU with
    mirrored E$. For the system tweaks, try:

    set ecache_scrub_enable=1
    set ecache_scan_rate=1000
    set ecache_calls_a_sec=100

    These are dynamic IIRC so you could adb/mdb them into your running
    kernel. Your call, of course. And, again, the system tweaks just
    reduce the chances of those sort of panics; might be worth a try to save
    you some money.
     
    Ben, Apr 8, 2005
    #5
  6. Jeff Wieland

    Hans Surst Guest


    Take a look at the line of electrolyte capasitors besides the CPU modules.
    There is a good chance they are burst or inflated. When they are bursted
    your 270MHz module will throw panics over some time too.
     
    Hans Surst, Apr 8, 2005
    #6
  7. Jeff Wieland

    Sunny Guest

    Indeed, I have fixed about a dozen Ultra 5/10 motherboards by replacing
    bulging filter caps. All filter caps should be replaced even if only one
    or two show visible signs of failure as the rest have tried to
    compensate and will soon follow.

    I have not seen this problem on SparcStations or Ultra 1/2s despite
    their even more advanced age - possibly better quality components were used.

    Sunny
     
    Sunny, Apr 9, 2005
    #7
  8. Hi Jeff,

    have a look to the first error that occurred on the system.

    WARNING: [AFT1] Uncorrectable Memory Error on CPU0
    it looks like an uncorrectable memory error, reported by CPU 0

    my recommendation is to replace this dimm module.

    if all 4 memory bank are populated replace to modules (two slots must be
    populated with dimm modules)

    test the system. if no more errors arise move the two dimms in the other
    slots and test again.

    if the system is still stable, the dimm module is faulty

    regards

    thomas

     
    Thomas Weidner, Apr 9, 2005
    #8
  9. Jeff Wieland

    Ben Guest


    He said he swapped the CPU and it continues to run, which
    cicumstantially suggests that the original CPU is the culprit. Anyone
    know what syndrome 0x61 is, btw? Knowing that would help.

    In lieu of that, it would be good to trace it; trade slots current DIMM3
    and another slot and see if the error follows; run some OBP max diags
    and stress it with VTS (CPU & Mem) to more quickly scare out anything
    hiding under the covers.
     
    Ben, Apr 9, 2005
    #9
  10. Jeff Wieland

    Jeff Wieland Guest

    It hung again this morning at 1:50 am:

    RED State Exception

    TL=0000.0000.0000.0005 TT=0000.0000.0000.0010
    TPC=0000.0000.1000.4200 TnPC=0000.0000.1000.4204 TSTATE=0000.0091.0000.1501
    TL=0000.0000.0000.0004 TT=0000.0000.0000.0010
    TPC=0000.0000.1000.4200 TnPC=0000.0000.1000.4204 TSTATE=0000.0091.0000.1501
    TL=0000.0000.0000.0003 TT=0000.0000.0000.0010
    TPC=0000.0000.1000.44c0 TnPC=0000.0000.1000.44c4 TSTATE=0000.0091.0000.1501
    TL=0000.0000.0000.0002 TT=0000.0000.0000.0024
    TPC=0000.0000.1000.716c TnPC=0000.0000.1000.70d0 TSTATE=0000.0091.0000.1500
    TL=0000.0000.0000.0001 TT=0000.0000.0000.0010
    TPC=0000.0000.1000.7170 TnPC=0000.0000.1000.7174 TSTATE=0000.0000.0000.1400

    So, it's not the processor. I swapped the 400 MHz processor back in,
    and pulled the memory out of slots 3 & 4. I haven't run VTS on it yet
    -- I probably should -- maybe tonight. I had been looking forward to
    getting a 440 MHz module, and getting 10% more speed :).

    I have been stressing by compiling Mozilla. With the 400 MHz
    processor, it takes about 6 hours. It took nearly 16 hours with the
    270 MHz module! I'm compiling with Forte 6 Update 2, BTW. It has hung
    when doing essentially nothing, though. My 1.5 GHz Sun Blade 1500 at
    work will build Mozilla in 90 minutes.

    The capacitors *look* OK to me. If they did need replacing, that's way
    beyond my soldering skill level. There are a couple of people around
    Purdue here that probably could do it for me, if I ask them nicely.
     
    Jeff Wieland, Apr 9, 2005
    #10
  11. Hi Jeff,

    if the error arise again (e.g. Red State Exep.) with the 400 MHz CPU
    Module and without the Dimms of Slot 3 & 4 it is possible that the
    mainboard is faulty.

    Can you say which partno the mainboard has. The partno has the following
    syntax

    375-xxxx
    e.g.
    375-0115

    regards

    Thomas
     
    Thomas Weidner, Apr 10, 2005
    #11
  12. Jeff Wieland

    Jeff Wieland Guest

    Thomas,
    According to prtconf -vp, it's a 375-0066. So far it's been running
    about 31 hours without locking up with DIMMs 3 & 4. It does page a
    bit more now, though :).
     
    Jeff Wieland, Apr 10, 2005
    #12
  13. Hi Jeff,

    it looks very good. I think one of the to dimm modules (dimm3) is faulty.

    regards

    Thomas
     
    Thomas Weidner, Apr 11, 2005
    #13
  14. Jeff Wieland

    Jeff Wieland Guest

    It ran for 2 1/2 days without modules 3 & 4, so I pulled out 1 & 2,
    and put 3 & 4 in the first two slots. It's been running like this
    for about 7 hours now -- we'll see what happens. I'm thinking that
    I have either a bad DIMM or a bad motherboard.
     
    Jeff Wieland, Apr 12, 2005
    #14
  15. Jeff Wieland

    Jeff Wieland Guest

    It failed hard today -- with any memory in slots 3 & 4, all I get is
    constant stream of "red state exceptions". I'm going to get another
    motherboard to try. At least it's working fine with 256 MB :).
     
    Jeff Wieland, Apr 16, 2005
    #15
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.