Motherboard Forums


Reply
Thread Tools Display Modes

clock stops, U5/10 Solaris 8

 
 





















Jim Prescott
Guest
Posts: n/a

 
      07-11-2003, 07:28 PM


[crossposted comp.unix.solaris,comp.sys.sun.hardware]

A couple times a month the clock on one of our systems stops. Actually
it gets stuck in a 3 second loop. Eg:
Thu Jul 10 08:21:10 EDT 2003
Thu Jul 10 08:21:08 EDT 2003
Thu Jul 10 08:21:09 EDT 2003
If we reset the time it then runs fine for a few weeks. Other that
having the wrong time, the system seems to run fine. We normally keep
time synchronized with a fairly old version of xntpd but as a test we
turned it off for a few weeks and the problem still occurred. The system
didn't lose track of time or its settings after being unplugged for ~20
minutes so I suspect the the motherboard battery is fine.

Since the problem initially occurred in one of our key servers we swapped
disks, PCI cards, RAM and IDPROM with an otherwise identical machine; the
problem followed the motherboard & chasis.

There doesn't seem to be anything odd running. Our users don't have
root so I don't think they are causing it (unless it is some kind of
residue from an attack/intrusion).

We were all set to just write it off as a broken machine when suddenly
another system started having the exact same symptoms. One is an U5
360Mhz, the other an U10 440Mhz, both purchased around 8/2000. Both
are running Solaris 8 with the recommended patch set as of 108528-19
(a couple months out of date now but the problem had existed prior to
those patches too).

Has anyone else seen this? Any thoughts on how to proceed? I'll be
installing the latest recommended patchset but since the problem occurs
rarely it'll be a while before we'll know if that actually helped.
--
Jim Prescott - Computing and Networking Group
School of Engineering and Applied Sciences, University of Rochester, NY
 
Reply With Quote
 
Dr. David Kirkby
Guest
Posts: n/a

 
      07-14-2003, 12:58 AM
Jim Prescott wrote:
>
> [crossposted comp.unix.solaris,comp.sys.sun.hardware]
>
> A couple times a month the clock on one of our systems stops. Actually
> it gets stuck in a 3 second loop. Eg:
> Thu Jul 10 08:21:10 EDT 2003
> Thu Jul 10 08:21:08 EDT 2003
> Thu Jul 10 08:21:09 EDT 2003


> We were all set to just write it off as a broken machine when suddenly


If a patch does not cure it (as others have suggested it might), you
may think it worth the time/effort to replace the NVRAM chip. It
depends on how much you value your time (excuse the pun), but that is
the hardware device that keeps the time. They cost little and are in
sockets that make them easy to remove and re-program. If you do this,
take a note of the mac address and hostid before doing the swap.
There's a good FAQ on the web on the nvram chip.

I've nothing to confirm that chip would solve the problem, but it is
by far the most likely cause IF it's a hardware problem.
--
Dr. David Kirkby,
Senior Research Fellow,
Department of Medical Physics,
University College London,
11-20 Capper St, London, WC1E 6JA.
Tel: 020 7679 6408 Fax: 020 7679 6269
Internal telephone: ext 46408
e-mail
 
Reply With Quote
 
Andy Lennard
Guest
Posts: n/a

 
      07-14-2003, 07:38 AM
In message <>, Dr. David Kirkby
<> writes
>Jim Prescott wrote:
>>
>> [crossposted comp.unix.solaris,comp.sys.sun.hardware]
>>
>> A couple times a month the clock on one of our systems stops. Actually
>> it gets stuck in a 3 second loop. Eg:
>> Thu Jul 10 08:21:10 EDT 2003
>> Thu Jul 10 08:21:08 EDT 2003
>> Thu Jul 10 08:21:09 EDT 2003

>
>> We were all set to just write it off as a broken machine when suddenly

>
>If a patch does not cure it (as others have suggested it might), you
>may think it worth the time/effort to replace the NVRAM chip. It
>depends on how much you value your time (excuse the pun), but that is
>the hardware device that keeps the time. They cost little and are in
>sockets that make them easy to remove and re-program. If you do this,
>take a note of the mac address and hostid before doing the swap.
>There's a good FAQ on the web on the nvram chip.
>
>I've nothing to confirm that chip would solve the problem, but it is
>by far the most likely cause IF it's a hardware problem.


That's interesting. I'd always assumed that current time was held within
'the kernel' somewhere, and that the NVRAM was only read at boot time,
and written to maintain consistency. Now oddly interested, does anyone
have any pointers to a description of how the OS interacts with the
NVRAM clock timer? Thanks.

--
Andrew Lennard
 
Reply With Quote
 
Brian Utterback
Guest
Posts: n/a

 
      07-14-2003, 04:13 PM
It sounds like the hardware TOD clock is stuck. The next time this happens try running
the following script:

adb -k /dev/ksyms /dev/mem <<EOF
clock_adj_hist/4E
adj_hist_entry/D
lbolt/D
lbolt64/E
EOF

If the values in the array "clock_adj_hist" are relatively close together (less than 5000
apart), then the problem is very likely a stuck TOD clock.

You can set the parameter "tod_validate_enable" to enable hardware TOD validation. Just
add this line to /etc/system:

set tod_validate_enable = 1

You asked for a description of the interaction of the hardware TOD clock and the in kernel
software clock. Okay, here goes:

As you surmised, the hardware clock is used at boot time to set the initial value of the
software clock. However, this is not the last time it is used. It is also used to double
check the value of the software clock. To understand this, you need to understand the
way the clock is implemented in the kernel.

The software clock is simply a counter that is incremented each "tick". The tick is a
a programmable oscillator, generally programmed to induce an interrupt each 100th of
a second. In the interrupt handling routine, all of the periodic system maintenance
occurs, including incrementing the clock tick.

There is a problem with this setup, however. The tick is not the highest level interrupt.
If a higher priority interrupt is being serviced when the tick fires, the tick is masked.
Unlike other interrupts, the tick does not queue, it is just dropped. Thus the kernel clock
will run slow, but not in a uniform way.

To counteract this, the kernel compares the software clock and the hardware clock to see if
they match. If they don't, then the software clock is reset to match the hardware clock,
since the hardware clock does not have the same problem of losing ticks.

Whenever the software clock is "set" by an external means such as the settimeofday call, the
value of the hardware clock is reset to match the software clock.

This whole process is further complicated by the fact that the hardware clock has a one second
resolution and the software clock has a nanosecond resolution. This simple fact is at the heart
of many of the bugs encountered by this setup. Since the minimum possible difference is one
second, we would like to fix the software clock when the two clocks are apart by one second.
But because of the discrete nature of the hardware clock, it is possible to detect a one second
difference when they really are only one nanosecond apart, but not detect any difference if they
differ by as much as 1.999999 seconds. So,to be sure that they are at least one second apart,
we look for a numeric difference of at least 2 seconds. Thus, any Solaris system, just sitting
there, idle, might experience a 2 second clock jump forwards or backwards at any time. This
the symptom of the bug that Paul mentioned earlier. However, it is not the actual jump that is
the bug, but the frequency of the jumps that is the real bug.

This all is made even worse with the advent of SunFire line of systems. These systems only maintain
a single hardware clock for all of the domains, with an "offset" stored for each domain. Keeping
all of the possible discrete transitions and reading and writing straight led to a number of bugs
in this code.

However, there is hope in sight. As of Solaris 8, the tick handling routines where changed to
use kernel cyclic timers instead of an interrupt handler. Kernel cyclic timers are not subject
to losing "ticks". So, starting in Solaris 8, the kernel clock should no longer run slow. It
may or may not keep better time than the hardware clock, but they should be on a par now, with
no reason to prefer one over the other.


Andy Lennard wrote:
> In message <>, Dr. David Kirkby
> <> writes
>
>>Jim Prescott wrote:
>>
>>>[crossposted comp.unix.solaris,comp.sys.sun.hardware]
>>>
>>>A couple times a month the clock on one of our systems stops. Actually
>>>it gets stuck in a 3 second loop. Eg:
>>> Thu Jul 10 08:21:10 EDT 2003
>>> Thu Jul 10 08:21:08 EDT 2003
>>> Thu Jul 10 08:21:09 EDT 2003

>>
>>>We were all set to just write it off as a broken machine when suddenly

>>
>>If a patch does not cure it (as others have suggested it might), you
>>may think it worth the time/effort to replace the NVRAM chip. It
>>depends on how much you value your time (excuse the pun), but that is
>>the hardware device that keeps the time. They cost little and are in
>>sockets that make them easy to remove and re-program. If you do this,
>>take a note of the mac address and hostid before doing the swap.
>>There's a good FAQ on the web on the nvram chip.
>>
>>I've nothing to confirm that chip would solve the problem, but it is
>>by far the most likely cause IF it's a hardware problem.

>
>
> That's interesting. I'd always assumed that current time was held within
> 'the kernel' somewhere, and that the NVRAM was only read at boot time,
> and written to maintain consistency. Now oddly interested, does anyone
> have any pointers to a description of how the OS interacts with the
> NVRAM clock timer? Thanks.
>


--
blu

Brian's 12th rule of support: Supporting any technology
that has something called an "oid", will hurt.
--------------------------------------------------------------------------------
Brian Utterback - Solaris Sustaining (NFS/Naming) - Sun Microsystems Inc.,
Ph/VM: 781-442-1343, Em:brian.utterback-at-ess-you-enn-dot-kom

 
Reply With Quote
 
Ivan Richwalski
Guest
Posts: n/a

 
      07-15-2003, 09:17 AM

Jim Prescott wrote:
>
> A couple times a month the clock on one of our systems stops. Actually
> it gets stuck in a 3 second loop. Eg:
> Thu Jul 10 08:21:10 EDT 2003
> Thu Jul 10 08:21:08 EDT 2003
> Thu Jul 10 08:21:09 EDT 2003
> If we reset the time it then runs fine for a few weeks. Other that
> having the wrong time, the system seems to run fine. We normally keep
> time synchronized with a fairly old version of xntpd but as a test we
> turned it off for a few weeks and the problem still occurred. The system
> didn't lose track of time or its settings after being unplugged for ~20
> minutes so I suspect the the motherboard battery is fine.


I had this same problem last month with an U10/440Mhz, purchased 7/2000.
At the time, the system had an uptime of just over 400 days, so there
hadn't been any recent system changes. The clock did the same thing,
where the clock would get stuck in a 3 second loop. Also, while
watching the clock with a simple "date;sleep 1" loop, I would see the
clock make huge jumps in time (days, months and even years - both
backwards and forwards) right before getting stuck. Most of the time
the huge jump would only last for 1 second before jumping back to
regular time and getting stuck, but sometimes it would get stuck at the
time it jumped to. Resetting the system date would only fix the problem
for a short while (from a hour to only a couple minutes).

Turning off xntpd didn't change any thing, neither did rebooting the
system. On reboot, the "IDPROM contents invalid" message came up, so I
figured it was just a bad IDPROM. I made sure everything inside the
machine was physically good (RAM seated, drive and power connectors
tight, etc.). Just to make sure, I even tried swapping out the 1Gig
Kingston RAM in the machine with the original 128Meg Sun RAM that the
machine came with. Still the problem would occur within a few minutes
of booting up. Once or twice when I rebooted it, the system wouldn't
come up. The fans and hard drives would spin up, but the front panel
light wouldn't come on, even after waiting as long as 10+ minutes. At
the time, I just thought it was related to the bad IDPROM.

I ordered a new IDPROM chip from Mouser, which took a couple days to
deliver. As a temporary measure, I had a script that checked the system
time(), sleeping 5 seconds between checks. If the difference between
checks wasn't 5, I'd force an ntpdate against another system on the
local network. Not the prettiest solution, but it was very early Sunday
morning and I knew the new IDPROM wouldn't arrive until Tuesday.

When the new IDPROM arrived, I installed it and reprogrammed the hostid.
The clock seemed stable, and the system ran fine all that day. The next
morning the problem returned again, doing the same wild jumps and
getting stuck in a loop. I started going through the same things I had
tried before, but now after rebooting the system it wouldn't come up
again (fans and drives turning, but no light or activity). It would
take several attempts before finally booting. After another couple
system reboots, it refused to come up no matter how many times I tried
power cycling the system. I even tried it with the hard drives and CD-
ROM unplugged, just in case it might be the power supply.

That afternoon, we were able to get in touch with a company that
services Sun equipment, and we brought the machine down to them where
they replaced the motherboard. After swapping boards, the system booted
right up every time and the clock was stable. Nothing else was changed,
and no new patches were installed, and still with the original power
supply.

For the past month now, the system has been back up and running. I had
my script watching the system time, but there wasn't anything other than
the usual little bit of drift. I just reenabled xntpd a couple days ago
since everything is back to normal now.

As a footnote, the day after getting the system back, the tech who did
the replacement called to make sure everything was fine. He said he had
tried installing Solaris on the old motherboard (with RAM and drives
that he had there). It powered on at first, but locked up during the
install, and then wouldn't boot anymore after that once.

I've only found one reference to this type of problem happening to
anyone else:
http://google.com/groups?threadm=b6s....kreonet.re.kr

Just to toss out an idea, but I know a couple years ago some PC
motherboard makers had some problems with bad capacitors that lead to
systems crashing and not booting. When I had my system open I looked
the board over pretty closely and didn't see anything that looked
obvious (like warped or leaking caps). But maybe it's something
similar? It's pretty strange for 5 machines (your two, mine, and two in
the above thread) all running into the same odd clock looping problem,
especailly within a couple months of each other.

Ivan Richwalski
 
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
computer clock problem The Buzz HP 6 12-07-2007 11:29 PM
800 FSB CPU on 400 FSB motherboard? rjdriver Elitegroup 2 11-19-2007 10:17 AM
Any way to upgrade CMOS clock?? maruk2@hotmail.com Dell 24 05-17-2007 03:10 AM
BIOS clock problem jpberta IBM 2 04-26-2007 12:55 PM
Front side bus & memory Dewayne Thomas Asus 2 04-01-2007 05:38 AM


All times are GMT. The time now is 03:15 AM.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43