[Haskell-cafe] haskell wiki indexing

Robin Green greenrd at greenrd.org
Tue May 22 10:30:15 EDT 2007


On Tue, 22 May 2007 15:05:48 +0100
Duncan Coutts <duncan.coutts at worc.ox.ac.uk> wrote:

> On Tue, 2007-05-22 at 14:40 +0100, Claus Reinke wrote:
> 
> > so the situation for mailing lists and online docs seems to have
> > improved, but there is still the wiki indexing/rogue bot issue,
> > and lots of fine tuning (together with watching the logs to spot
> > any issues arising out of relaxing those restrictions). perhaps
> > someone on this list would be willing to volunteer to look into
> > those robots/indexing issues on haskell.org?-)
> 
> The main problem, and the reason for the original (temporary!) measure
> was bots indexing all possible diffs between old versions of wiki
> pages. URLs like:
> 
> http://haskell.org/haskellwiki/?title=Quicksort&diff=9608&oldid=9607
> 
> For pages with long histories this O(n^2) number of requests starts to
> get quite large and the wiki engine does not seem well optimised for
> getting arbitrary diffs. So we ended up with bots holding open many
> http server connections. They were not actually causing much server
> cpu load or generating much traffic but once the number of nearly hung
> connections got up to the http child process limit then we are
> effectively in a DOS situation.
> 
> So if we can ban bots from the page histories or turn them off for the
> bot user agents or something then we might have a cure. Perhaps we
> just need to upgrade our media wiki software or find out how other
> sites using this software deal with the same issue of bots reading
> page histories.

http://en.wikipedia.org/robots.txt

Wikipedia uses URLs starting with /w/ for "dynamic" pages (well, all
pages are dynamic in a sense, but you know what I mean I hope.) And
then puts /w/ in robots.txt.
-- 
Robin


More information about the Haskell-Cafe mailing list