Subject:
|
Re: unscheduled outage
|
Newsgroups:
|
lugnet.admin.general
|
Date:
|
Wed, 6 Dec 2000 23:53:49 GMT
|
Viewed:
|
436 times
|
| |
| |
In lugnet.admin.general, Todd Lehman writes:
> lugnet.com experienced an unscheduled service outage today which, according
> to the logs, lasted for approximately 68 minutes from 15:02 EST to 16:10 EST.
> I'll have to scour the logs later to hopefully get to the bottom of what
> caused the problem. I have a hunch that the system ran out of memory and
> will need an upgrade. It's due for that soon anyway. More info when/if I
> know more.
The culprit seems to be a database table that got horribly fragmented on disk
over the natural course of activities in the past year. However, that in and
of itself wasn't the problem. The problem was the way the table was being
periodicallly cleaned of obsolete entries. It was being checked way too
often and way too comprehensively. It was the snapshot script that runs
every 5 minutes for building the news traffic page. Whenever the snapshot
script runs, it adds a new entry and deletes the oldest entry (it should be
exactly one week old). But the script also looks to see if there is any
previous entry before that (i.e., exactly one week and 5 minutes old). If
everything is running OK, it shouldn't find one. But if it does find one,
then it knows that something strange happened (say, the system was down or
overloaded for 10 minutes or whatever) and it proceeds to clean the table of
all old entries via a table sweep. A table sweep on this particular data is
extremely inefficient when the data is highly fragmented -- it was taking
more than 5 minutes, which is so long that it resulted in another gap in the
snapshot, which then caused another sweep, and so on and so forth. The
system was hitting the disk constantly and causing all of its work to run
very slowly. Processes built up and eventually there was a meltdown.
I've changed the clean-up algorithm so that it doesn't go into anal-clean-
sweep mode when it finds anomalies. (It still does some very simple and
constrained removals -- it has to do that -- but no table sweeps.) I'll
need to add a low-priority process that runs once per day or once per week
to purge the table of obsolete gunk. Something that automatically sends
email to my pager would be a good idea too.
I haven't seen the system purring along so happily and quietly (as it is now)
in quite some time.
--Todd
p.s. Thanks Tim & Matt
|
|
Message is in Reply To:
| | unscheduled outage
|
| lugnet.com experienced an unscheduled service outage today which, according to the logs, lasted for approximately 68 minutes from 15:02 EST to 16:10 EST. I'll have to scour the logs later to hopefully get to the bottom of what caused the problem. I (...) (24 years ago, 6-Dec-00, to lugnet.admin.general)
|
2 Messages in This Thread:
- Entire Thread on One Page:
- Nested:
All | Brief | Compact | Dots
Linear:
All | Brief | Compact
|
|
|
|