To LUGNET HomepageTo LUGNET News HomepageTo LUGNET Guide Homepage
 Help on Searching
 
Post new message to lugnet.admin.generalOpen lugnet.admin.general in your NNTP NewsreaderTo LUGNET News Traffic PageSign In (Members)
 Administrative / General / 8310
8309  |  8311
Subject: 
Re: Server freakage today
Newsgroups: 
lugnet.admin.general, lugnet.announce
Followup-To: 
lugnet.admin.general
Date: 
Sat, 11 Nov 2000 01:40:44 GMT
Highlighted: 
(details)
Viewed: 
2687 times
  
In lugnet.admin.general, Todd Lehman writes:
[...]
I apologize for the inconvenience this has caused and may continue to cause
until it is isolated and fixed.

Good news.  It turned out to be a bug in some of my own code.  It was hard to
find but easy to fix.  Technical details follow.

I was able to reproduce the symptoms of the problem by asking the webserver
for a bogus URL that confused it, for example:

   http://news.lugnet.com/loc/us/ca/sf//

When it starts displaying a news page, the code which draws the light blue
bands at the top examines the requested URL and splits it into component
links, for example:

   Local / United States / California / San Francisco /

The loop that does the splitting was a while() loop with a faulty exit
condition -- a regex which unwittingly assumed that the URL was always a
canonical URL (i.e., didn't have any funky stuff going on like multiple
slashes in a row).  The chopper went like this:

   $url =~ s{[^/]+/$}{};

That's fine 99.99999% of the time, but it has no effect if the string ends in
two or more slashes -- thus, it's a bug.  I corrected the substitution...

   $url =~ s{[^/]+/+$}{};

...and then retested to verify that it no longer hung the process.  Then, I
added a redirector so that goofy URLs like the one shown above would get
bounced, so that the underlying code never even sees the bad URL in the first
place.  So that's an extra layer of protection on top of a clean bug fix.

Now, why it took the system down:  Because the while() loop sat there running
and running and running, chewing up CPU cycles like crazy and starving other
processes, and in the meantime it was accumulating URL fragments in a list
(since it expected to then output these in reverse order once it had collected
them all).  So that chewed up a few megabytes of memory per minute, and
eventually the webserver ran out of memory and the process was killed -- but
not after other processes had been starved (put on the waiting list and kept
from running) for long enough to cause problems.  The server slowly
decelerated to a halt.  It would be kind of like your car deciding to press
its own accelerator to the floor very gradually while downshifting to lower
and lower gears.

According to my process logs, this explains quite well the two spikes that
happened earlier today as well as the one that happened on November 6 and
the one that happened on July 19.  In all cases, the server's load climbed
slowly to an almost unbelievably high level, leaving very few clues.  But
with 20-20 hindsight, all the clues make complete sense.  The strangest thing
I couldn't figure out was why the httpd logs didn't contain a record of any
strange URLs.  It's because the process never completed and never wrote out
its final results (bytes sent and result code) to the logfile.

Well, there it is.  If you've gotten this far, I'm curious -- did anyone
notice today if they accidentally typed in a URL by hand and got it wrong?
Something like

   http://[anything].lugnet.com/[anything]//

That is, basically, a .lugnet.com URL with two or more slashes on the end...?
If so, approximately what time did you do that?  And did you hit 'Esc' or
'Stop' and try it again, or did you notice the error and remove the trailing
slash?  Did you try it again with the slash two hours later?  The first spike
today happened around 2:30pm EST, the second one around 4:30pm EST.  I'm
curious because I found three runaway processes all going at once around 4:35pm
EST today -- as if someone had invoked the same broken URL three times.  It
certainly slowed things down quicker.  :-)

--Todd



Message has 1 Reply:
  Re: Server freakage today
 
(...) Two or more slashes - well, I find it annoying that I have to type any slash at all! If I type "news.lugnet.com/cad/dev" I get the answer: Oops! The page you requested does not exist: (URL) However, a similarly named page does exist: (URL) is (...) (24 years ago, 13-Nov-00, to lugnet.admin.general)

Message is in Reply To:
  Server freakage today
 
Twice today in the past two hours, the load average on the lugnet.com server has shot up to extremly high levels. Normal activity is between 0.5 and 2.0 or so during the day. It spiked up to 300 twice today and then back down to normal levels. I (...) (24 years ago, 10-Nov-00, to lugnet.admin.general, lugnet.announce)  

10 Messages in This Thread:


Entire Thread on One Page:
Nested:  All | Brief | Compact | Dots
Linear:  All | Brief | Compact

This Message and its Replies on One Page:
Nested:  All | Brief | Compact | Dots
Linear:  All | Brief | Compact
    

Custom Search

©2005 LUGNET. All rights reserved. - hosted by steinbruch.info GbR