On 2009-07-04, glennklockwood <> wrote:
> Hi again.
>
> I've been having a peculiar problem with my Sun Blade 1000, which I
> just upgraded to dual 900MHz US-III Cu processors. The system worked
> fine for a few days, but today I found that when I pressed the power
> button, it powered on for two to three seconds (fans start, front
> lights up), then it shut off. There isn't enough time for anything to
> come across the serial console, and subsequent attempts to power on
> result in the same thing happening. I was finally able to get the
> system to power on and boot, but almost immediately I got a thermal
> shutdown notice (citing a temperature of 127--both fans were working
> though).
Hmm ... that is rather low for a shutdown -- depending on which
temperature system is in use. In my Sun Fire 280R (which has the RSC
(Remote System Control) card for remote monitoring) the temperatures
are;
F C Warn F Warn C Fail F Fail C
RSC card 91 33 212 100 230 110
CPU0 131 55 199 93 203 95
CPU1 127 53 199 93 203 95
This is with two 900 MHz Cu CPUs.
The actual CPU temperatures in the spare Sun Fire 280R (for experimentation)
which has 750 MHz (non-Cu) CPUs are somewhat hotter.
> I should point out that my other Blade 1000 showed very similar power-
> on difficulties with its dual 750MHz processors. I finally got fed up
> with having to hit the power button several times before it would stay
> on, and I swapped the HDs and memory from that machine to this current
> one which also started with dual 750's. This power-on problem only
> started happening again after the upgrade to 900's. With the older
> machine, though, I never got thermal warnings after finally powering
> on; in fact, if I could actually get it to boot, it would stay up for
> days without any problems.
Note that for the Sun Fire 280R (same system board and CPUs) the
fan tray has three fans -- one for the PCI and UPA cards, one for the
CPUs, and one for the memory DIMMs. And when upgrading from the non-Cu
CPUs to Cu types you are warned to replace the fan tray with a later
model. I've compared both types and find the only visible difference is
that the later model (for the Cu CPUs) has a 14W fan in the center (CPU)
position instead of a 7W fan. Note that the fans in a SB-2000 do not
seem to be any higher power than those in the SB-1000.
> I've seen this sort of behavior (machine will power everything for a
> few seconds before shutting down) in an x86 machine that had a faulty
> motherboard, but this is now happening in two separate motherboards so
> I am skeptical that this is the case here. I found that unplugging
> the DVD and disks did not help anyway, so the problem must lie with
> the CPUs, motherboard, memory, or power supply.
Or -- possibly the CPU to system board connection. It is
possible that some dust is obscuring contacts which allow reading the
temperatures on the CPUs, and as a result the CPUs are *sensed* as
running hot while they are not *really* hot. I would suggest removing
each CPU spraying the connectors in both the CPUs and the system board
with a *good* contact cleaner and re-seating the CPUs. Try one CPU at a
time in slot 0 and see if one has the problem and the other does not.
If one has the problem and the other does not, is it possible
that someone removed the heat sinks from that CPU module and then
re-attached them -- perhaps reducing the thermal conductivity while
doing so? I've seen one eBay vendor who seems to remove the heatsinks
to photograph the actual CPU chip -- something which *I* would not do,
and I would not buy from that vendor. If the photos show the CPU chip
instead of the barcode label, skip that vendor.
Also -- which style of torque wrench does your system have?
There are two -- one (the older style) is the wire bent into a circle
and you tell the torque limit by the ends of the circle touching, and
the other (newer style) is a torque limiting screwdriver with a dayglo
green handle which slips with a click when the proper torque is reached.
I have seen Sun documents (in PDF format) which suggest that the
later design is far to be preferred. The green torque limiting
screwdriver fits in a clip in the cage where the DVD drive, the smart
card and the floppy drives are all mounted. Look for a dayglow green
the came as the ring around the Robertson (square drive) sockets in the
CPU modules.
The old style torque driver lives in a green plastic carrier
which slides in between the two internal disk drives. Older SB-1000s
have that style. Newer ones and older SB-2000s have the torque
screwdriver style. Newer SB-2000s come without a torque wrench at all,
with the assumption that when you buy new CPUs from Sun, you will
receive a new Torque driver with each CPU. If you don't have *any*
torque limiting screwdriver Utica makes some very nice adjustable ones
(quite expensive), and you want one which will reach 5 inch-pounds IIRC.
The screwdrivers from Utica come in both inch-pounds range and in
inch-ounces range. The inch-pounds ones go down to 6 inch-pounds, but
you can fudge it to one step lower to get the five you need. Obviously,
you need to multiply the five inch-pounds by 16 to get 80 inch-ounces.
It is quite important to follow the instructions for removing
and replacing the CPU modules. If you don't, you can damage the
connectors on either the system board or the CPU modules. Also, trying
to remove them by turning too many turns at one end before going to the
other can cause the circlip to pop off the jack screw and get lost
inside the system along with the thrust washers.
> Does anyone have any ideas? I was hoping that the power-on issues I
> was having with my first Blade 1000 would be fixed by moving the ram/
> disks to this new one, but now they've cropped up again and I don't
> know what component would be at fault here.
CPUs and their mounting (connectors, torque, and proper mounting
of heat sinks to CPUs) are most likely the cause if you are quickly
getting overtemperature warnings and the fans are running.
Also -- for other problems (assuming that you have eight DIMMs
in the system) pull out four and see what happens. If that still gives
the same problems, pull the remaining four and replace them with the
four which you first pulled.
Once you have a group of four identified as problematical, you
can swap out one at a time with others (assuming that both sets of four
are the same size) to identify the actual bad DIMMs. I found that two
512 MB DIMMs were flakey out of eight -- and one flakey one was in each
set, so I had to do a lot of swapping to get a known good set before
testing the others to truly identify the bad ones.
You have my thoughts above.
Good Luck,
DoN.
--
Email: <> | Voice (all times): (703) 938-4564
(too) near Washington D.C. |
http://www.d-and-d.com/dnichols/DoN.html
--- Black Holes are where God is dividing by zero ---