Paul wrote:
> Mark Jeynes wrote:
>> I run a system with this mobo and Ubuntu 8.04 server 64-bit. Recently
>> the system has been hanging after about 20-30 mins, sometimes
>> reporting ata disk errors.
>>
>> If I reboot the board boots as far as "Mouse Initialized" then hangs
>> for 3 mins or so. After this it reports error with disk 1.
>>
>> Poweriing down for 3-4 mins and rebooting, the system will boot
>> successfully again. Only it hangs 20 mins or so later and we're back
>> into the same loop. The hang seems to occur when there is substantial
>> SATA activity. After boot my 3 SATA disks start a RAID resync which
>> says will take 240 mins to complete - all disk activity stops 15 mins
>> or so later and the system is frozen. Sabe result on a variety of
>> recent kernels.
>>
>> The same condition occurs if I do not start the RAID software and run
>> concurrent but independent disk integrity checking software on the
>> SATA disks (i.e. I'm using the 'badblocks' utility under Linux).
>>
>> Right now I'm unsure if my issue lies with the disks or the mobo; it's
>> hard to isolate the problem when you are dependent on the mobo to test
>> the disks and vice versa!
>>
>> any words of wisdom from you clever people will be gratefully received!
Firstly I'd like to say a hearty thankyou Paul. Just seeing a reply
this morning made me feel I'm not alone on this planet. cheers mate.
>
> One thing I've noticed here, as a home user, is that if a SATA disk
> has a problem, there doesn't appear to be a mechanism to reset the
> disk interface from the motherboard. When something similar happened
> to me, I had to power cycle, before the hard drive was reset and
> could be seen again.
>
> The fact that a reboot after a failure in your case, results in an
> "error with disk1", which is cleared by powering down, suggests the
> disk is the part that is hung up, rather than the motherboard.
> The chipset should be resettable, on the reboot, so I wouldn't expect
> it to stay in a stuck state.
>
> Have you tried downloading the disk diagnostic from the disk
> manufacturer website ?
I did today ... on your advice (thankyou! I'd not considered they would
offer such a thing). The tool says my disks are fine :-) (phew)
>
> Is there a chance the disk(s) are overheating ?
Possibly ... and I know they have in the past (smartctl told me). I did
have them stacked in one of those 5-in-3 backplane caddies. Probably
not a good idea when 3 neighbouring RAID disks decide to do a total
resync. There's not much room in there for airflow, so this kind of
need means things will get steamy - even though it's backed with a fan
that could suck a golfball through six feet of hose.
>
> Does the power supply have enough 12V amps for all
> the loads you have connected ?
I should say so ... it's a nice 750W supply from Silverstone.
Hopelessly overdone but you know how gadget-lust takes over when
shopping for machine parts.
>
> You could also try testing the disks as simple data disks on
> another computer. You could use something like the free version
> of HDTune for Windows, as a test stimulus for the drives (i.e. no need
> for the OS to see a file system on the drive, to test it). HDTune has
> a read benchmark, that reads the disk surface, and also has an error scan.
> It also reads drive temperature via SMART (that is, as long as
> the port the disk is connected to, can issue SMART commands).
>
> http://www.hdtune.com/download.html
Wow. I'm humbled by your knowledge of this topic and very, very
grateful (there you go, I said I would be).
I believe my main problem is my desire to silence the machine as far as
practical so after turning off most casefans it was getting a bit hot in
there. Though I can't back this with hardcore science, I've discovered
that re-enabling a couple of case fans today gave me several hours of
uptime, enough to complete the resync. This brought my recovery task to
critical mass - resync done means disk activity mainly stops and the
problem cause of getting hot subsides. That's my theory - but the
beautiful truth is it's still going now. Now I can turn to a
preventative course of action rather than desperate recovery task.
thanks again
>
> Paul