To LUGNET HomepageTo LUGNET News HomepageTo LUGNET Guide Homepage
 Help on Searching
 
Post new message to lugnet.admin.generalOpen lugnet.admin.general in your NNTP NewsreaderTo LUGNET News Traffic PageSign In (Members)
 Administrative / General / 8440
8439  |  8441
Subject: 
Re: unscheduled outage
Newsgroups: 
lugnet.admin.general
Date: 
Wed, 6 Dec 2000 23:53:49 GMT
Viewed: 
416 times
  
In lugnet.admin.general, Todd Lehman writes:
lugnet.com experienced an unscheduled service outage today which, according
to the logs, lasted for approximately 68 minutes from 15:02 EST to 16:10 EST.
I'll have to scour the logs later to hopefully get to the bottom of what
caused the problem.  I have a hunch that the system ran out of memory and
will need an upgrade.  It's due for that soon anyway.  More info when/if I
know more.

The culprit seems to be a database table that got horribly fragmented on disk
over the natural course of activities in the past year.  However, that in and
of itself wasn't the problem.  The problem was the way the table was being
periodicallly cleaned of obsolete entries.  It was being checked way too
often and way too comprehensively.  It was the snapshot script that runs
every 5 minutes for building the news traffic page.  Whenever the snapshot
script runs, it adds a new entry and deletes the oldest entry (it should be
exactly one week old).  But the script also looks to see if there is any
previous entry before that (i.e., exactly one week and 5 minutes old).  If
everything is running OK, it shouldn't find one.  But if it does find one,
then it knows that something strange happened (say, the system was down or
overloaded for 10 minutes or whatever) and it proceeds to clean the table of
all old entries via a table sweep.  A table sweep on this particular data is
extremely inefficient when the data is highly fragmented -- it was taking
more than 5 minutes, which is so long that it resulted in another gap in the
snapshot, which then caused another sweep, and so on and so forth.  The
system was hitting the disk constantly and causing all of its work to run
very slowly.  Processes built up and eventually there was a meltdown.

I've changed the clean-up algorithm so that it doesn't go into anal-clean-
sweep mode when it finds anomalies.  (It still does some very simple and
constrained removals -- it has to do that -- but no table sweeps.)  I'll
need to add a low-priority process that runs once per day or once per week
to purge the table of obsolete gunk.  Something that automatically sends
email to my pager would be a good idea too.

I haven't seen the system purring along so happily and quietly (as it is now)
in quite some time.

--Todd

p.s.  Thanks Tim & Matt



Message is in Reply To:
  unscheduled outage
 
lugnet.com experienced an unscheduled service outage today which, according to the logs, lasted for approximately 68 minutes from 15:02 EST to 16:10 EST. I'll have to scour the logs later to hopefully get to the bottom of what caused the problem. I (...) (24 years ago, 6-Dec-00, to lugnet.admin.general)

2 Messages in This Thread:

Entire Thread on One Page:
Nested:  All | Brief | Compact | Dots
Linear:  All | Brief | Compact
    

Custom Search

©2005 LUGNET. All rights reserved. - hosted by steinbruch.info GbR