1. This forum section is a read-only archive which contains old newsgroup posts. If you wish to post a query, please do so in one of our main forum sections (here). This way you will get a faster, better response from the members on Motherboard Point.

scsi failure: bus or disk?

Discussion in 'Sun Hardware' started by Paul Douglas, Dec 27, 2003.

  1. Paul Douglas

    Paul Douglas Guest

    I woke up on Xmas morning to find my Ultra 30 had crashed. I tried to
    reboot but it got a little way into loading the O/S and hung. Another try
    produced the following messages very early in the boot process:

    WARNING:/[email protected],4000/[email protected] (glm0):
    SCSI bus DATA IN phase parity error
    WARNING:/[email protected],4000/[email protected] (glm0):
    Target 0 reducing sync. transfer rate

    and the boot halted soon after. After a couple more tries it did boot.

    I'm wandering if the error is in the scsi system or just the disk. The
    actual crash seems to have happened when the system tried to mount another
    disk in order to perform a backup. I'm still getting these problems with
    that disk removed (and they first occurred, according to the log, at a time
    when the 2nd disk wasn't involved). I have tried the 2nd disk with another
    machine with no problem.

    I ran test-all at ok prompt and got no errors. However, whatever the error
    is it seems to be intermittent, since the machine did boot once.

    I'd be grateful if someone could tell me which component is at fault (or at
    least most likely to be at fault). I just don't know from looking at the
    log entries. These follow for info.

    Many thanks,

    Paul


    /var/adm/messages:

    Dec 24 07:53:16 avon glm: [ID 655122 kern.warning] WARNING:
    ID[SUNWpd.check_intcode.6006] Dec 24 07:53:16 avon scsi: [ID 107833
    kern.warning] WARNING: /[email protected],4000/[email protected] (glm0): Dec 24 07:53:16 avon
    Resetting scsi bus, data overrun: got too much data from target from (0,0)
    Dec 24 07:53:16 avon genunix: [ID 408822 kern.info] NOTICE: glm0: fault
    detected in device; service still available Dec 24 07:53:16 avon genunix:
    [ID 611667 kern.info] NOTICE: glm0: Resetting scsi bus, data overrun: got
    too much data from target from (0,0) Dec 24 07:53:16 avon scsi: [ID 107833
    kern.warning] WARNING: /[email protected],4000/[email protected] (glm0): Dec 24 07:53:16 avon
    Target 0 reducing sync. transfer rate Dec 24 07:53:16 avon glm: [ID 923092
    kern.warning] WARNING: ID[SUNWpd.glm.sync_wide_backoff.6014] Dec 24
    07:53:16 avon scsi: [ID 107833 kern.warning] WARNING: /[email protected],4000/[email protected]
    (glm0): Dec 24 07:53:16 avon got SCSI bus reset Dec 24 07:53:16 avon
    genunix: [ID 408822 kern.info] NOTICE: glm0: fault detected in device;
    service still available Dec 24 07:53:16 avon genunix: [ID 611667 kern.info]
    NOTICE: glm0: got SCSI bus reset Dec 24 07:53:16 avon scsi: [ID 107833
    kern.warning] WARNING: /[email protected],4000/[email protected]/[email protected],0 (sd0): Dec 24 07:53:16
    avon SCSI transport failed: reason 'reset': retrying command

    -----
    at this point, everything still apparently running ok and I was unaware of
    any problem
    -----

    Dec 25 01:10:00 avon unix: [ID 836849 kern.notice] Dec 25 01:10:00 avon
    ^Mpanic[cpu0]/thread=300024eda80: Dec 25 01:10:00 avon unix: [ID 340138
    kern.notice] BAD TRAP: type=31 rp=2a100664a10 addr=30 mmu_fsr=0 occurred in
    module "sd" due to a NULL pointer dereference Dec 25 01:10:00 avon unix:
    [ID 100000 kern.notice] Dec 25 01:10:00 avon unix: [ID 839527 kern.notice]
    mount: Dec 25 01:10:00 avon unix: [ID 520581 kern.notice] trap type = 0x31
    Dec 25 01:10:00 avon unix: [ID 381800 kern.notice] addr=0x30 Dec 25
    01:10:00 avon unix: [ID 101969 kern.notice] pid=1926, pc=0x11ad438,
    sp=0x2a1006642b1, tstate=0x4480001600, context=0xaff Dec 25 01:10:00 avon
    unix: [ID 743441 kern.notice] g1-g7: 1487c00, 0, 10000, 30003039508,
    30002bdc508, 16, 300024eda80 Dec 25 01:10:00 avon unix: [ID 100000
    kern.notice] Dec 25 01:10:00 avon genunix: [ID 723222 kern.notice]
    000002a100664740 unix:die+80 (31, 2a100664a10, 30, 0, 30001400b20,
    30001400b38) Dec 25 01:10:00 avon genunix: [ID 179002 kern.notice] %l0-3:
    0000000000000000 0000000001413460 000002a100664a10 000002a100664908 Dec 25
    01:10:00 avon %l4-7: 0000000000000031 00000300001bcd18 00000300001bcd40
    0000030007dddf98

    -----
    the mount is the attempt to mount the backup disk

    after this, there's another 60 lines much like the one above, then the
    system goes down

    finally, here's one of the boot attempt mesages:
    -----

    Dec 26 11:30:52 avon scsi: [ID 365881 kern.info] /[email protected],4000/[email protected]
    (glm0): Dec 26 11:30:52 avon Rev. 3 Symbios 53c875 found. Dec 26 11:30:52
    avon pcipsy: [ID 370704 kern.info] PCI-device: [email protected], glm0 Dec 26 11:30:52
    avon genunix: [ID 936769 kern.info] glm0 is /[email protected],4000/[email protected] Dec 26
    11:30:52 avon scsi: [ID 107833 kern.warning] WARNING: /[email protected],4000/[email protected]
    (glm0): Dec 26 11:30:52 avon SCSI bus DATA IN phase parity error Dec 26
    11:30:52 avon glm: [ID 663555 kern.warning] WARNING:
    ID[SUNWpd.glm.parity_check.6010] Dec 26 11:30:52 avon scsi: [ID 107833
    kern.warning] WARNING: /[email protected],4000/[email protected] (glm0): Dec 26 11:30:52 avon
    Target 0 reducing sync. transfer rate Dec 26 11:30:52 avon glm: [ID 923092
    kern.warning] WARNING: ID[SUNWpd.glm.sync_wide_backoff.6014] Dec 26
    11:30:52 avon scsi: [ID 193665 kern.info] sd0 at glm0: target 0 lun 0 Dec
    26 11:30:52 avon genunix: [ID 936769 kern.info] sd0 is
    /[email protected],4000/[email protected]/[email protected],0
     
    Paul Douglas, Dec 27, 2003
    #1
    1. Advertisements

  2. Paul Douglas

    CJT Guest

    The first thing I would do is check that the SCSI bus is properly
    terminated and all the connections are tight.
     
    CJT, Dec 27, 2003
    #2
    1. Advertisements

  3. Paul Douglas

    Paul Douglas Guest

    thanks but I have already tried that (and both disks are locked into
    the internal drive cage anyway, so there's no loose wiring involved).
     
    Paul Douglas, Dec 28, 2003
    #3
  4. Paul Douglas

    Scott Howard Guest

    The internal drives in an E450 still have cables going to them - from
    the motherboard or PCI card to the backplane itself. At least some of
    these cables have a habit of getting pinched when the case is put back
    onto the E450 if you're not careful...

    Scott.
     
    Scott Howard, Dec 28, 2003
    #4
  5. Sounds like it might be a known E450 problem.
    The disks plug into backplanes which can flex, leaving the disks
    only partially connected. The backplane mounting arrangement has
    been modified several times to fix this, early models are the
    worst.

    Shut down and remove the RH side cover.
    Unlatch all disks and slide out a few inches.
    For each disk in the central stack, support the rear of the
    backplane with your hand and insert the disk whilst watching
    through the side of the cage to ensure that the connector seats fully.
    Some fiddling and/or moderate force on the rear of the backplane
    may be needed to make sure that everything aligns correctly.
    Repeat for the outer stack.
     
    Chris Newport, Dec 28, 2003
    #5
  6. Paul Douglas

    Paul Douglas Guest

    thanks everyone, it does seem to have been a loose connection to the
    backplane and that has solved the problem.

    Paul
     
    Paul Douglas, Dec 29, 2003
    #6
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.