LinuxSA Mailing list archives
Index:
[thread]
[date]
[subject]
[author]
[stats]
From: Alan Kennington <akenning@topology.org>
To : LinuxSA <linuxsa@linuxsa.org.au>
Date: Sun, 22 Jul 2001 10:54:25 +0930
Re: BIND 9.1.0 fails badly
On Sun, Jul 22, 2001 at 09:42:52AM +0930, michael wrote:
>
> Actually, check this out, I think it might suit your purpose nicely:
>
> http://open.digicomp.ch/gpl/sawdog/
>
> What is sawdog?
>
> Sawdog is a script which informs the sysops of mission critical servers in
> the case of a failure, like a sort of watchdog. The script executes a
> given set of Expect scripts, and if one of the Expect scripts fails, it
> sends an email or an SMS, or executes a command. You can probe for more
> than just reachability because the Expect scripts can check if the
> responses on the ports are correct.
>
Michael,
That sounds good. The documentation indicates that it is indeed
a simple script for chekcing status and sending e-mail notifications.
It's a little disappointing, though, to have to continually
be setting up watchdog things to get software to work correctly.
I spent a really disproportionate amount of time workign out
how to create scripts to just keep my PPP link going.
The hard work there was to run a pattern recognition routine
(in Perl) to determine if the PPP link was in a bad state or not,
using some ad hoc pattern recognition algorithms.
I was kind of hoping that I wouldn't have to put watchdogs
on everything that can go wrong in the system.
After all, then I might need another watchdog to make sure
that the watchdog is running.
The difficult thing is identifying _repeatable_ error conditions
that I can write scripts for.
Right now, since total program failure (as in BIND just dying)
is quite rare, I intend to just get the latest version and
install that, in the hope that the bug is fixed.
More problematic is when bugs occur in a program which
is still running, and when that bug is not like anything
that I have ever seen before.
Writing an "expect" script for each kind of bug that is seen is
unlikely to be profitable in the long run, because most bugs
are intermittent, and tend to be fixed in some later release before
I see the bug more than once.
Total program failure is a relatively rare form of failure
in my experience.
Ultimately, the best solution for software bugs is for authors
not to include them in their source in the first place.
But current methodology seems to be along the lines of
"if it works for a week, release it - and if someone reports
a bug, fix it some time". This would not get a space probe
to Neptune, that's for sure.
It doesn't even let me take a rest in NZ for 2 weeks, because
I had to take a notebook PC with me to check on my system every
1-2 days. And the hardware and software bugs in the notebook,
running SuSE 7.1 with kernel 2.4.3, were a big drama too.
The PCMCIA slots were only recognized by cardmgr software and
pcmcia kernel modules half the time, and if I moved a modem
card from a slot and reinserted it, the whole machine hung and
had to be re-booted. This bug didn't happen with kernel 2.4.0,
but 2.4.0 has big ReiserFS bugs. So I couldn't win.
It shouldn't take 2 months to get one machine working correctly.
On my 10 machines so far, I think that median time to achieve
acceptable service from a machine with linux has been 2 months.
(With MS and the Mac, I just gave up trying before I achieved
acceptable service!)
About 30 years ago, probes were launched into a harsh environment
with no hope of applying any hands-on servicing or manual re-boot,
and they are still out there working beyond the solar system.
What has happened to the art of software development since then?!
Just thoughts in passing on a Sunday morning....
Cheers,
Alan Kennington.
--
LinuxSA WWW: http://www.linuxsa.org.au/ IRC: #linuxsa on irc.linux.org.au
To unsubscribe from the LinuxSA list:
mail linuxsa-request@linuxsa.org.au with "unsubscribe" as the subject
Index:
[thread]
[date]
[subject]
[author]
[stats]
Return to the LinuxSA Mailing List Information Page