[pmwiki-users] Robustness of PmWiki

Sat Jun 24 11:58:09 CDT 2006

On Sat, Jun 24, 2006 at 04:49:16PM +0200, Martin Bayer wrote:
> Am Samstag, 24. Juni 2006 05:54 schrieb Patrick R. Michaud:
> > Thus, I can completely understand your admin's position on Slurp.
> > I can't think of any reason why a search engine's spiders need to be
> > requesting the same url several times per day.  In this respect it
> > does have a similar impact to a ddos attack.
> 
> I see what the problem is, but I cannot agree with the way that was done,
> i.e. without any announcement. [...]

I totally agree.  Keeping clients informed about potential service 
changes is paramount. 

> > Because of the heavy
> > toll of search engine spiders, PmWiki has a fair amount of robot control
> > built-in.  First, it doesn't return any "non-browse" pages to robots,
> > thus if a robot follows a link to things like "?action=edit" or
> > "?action=diff", PmWiki quickly returns a "403 Forbidden" to the robot, to
> > avoid any cost of generating the responses to those requests.
> 
> In MoinMoin you have a line saying
> <meta name="robots" content="index,nofollow">
> in all pages but FrontPage, RecentChanges, and TitleIndex, which, as you can
> imagine, doesn't really speed up indexing by search engines (even not by
> the "good" ones). 

The robots meta tag isn't designed to speed up indexing, it simply
indicates what should be indexed and what should not.  Even "nofollow"
doesn't do much good here -- *any* link (on any site) without a
corresponding nofollow means the robot is going to visit that 
particular link.  For large spiders such as Yahoo! and Googlebot,
a link w/o nofollow simply means that the link is added to the 
pool of links to be periodically scanned by the robots.  It may be
hours or days before it's actually retrieved, and once a link is
in the pool it may be retrieved at any time.  In other words, there's
not any particular order in which links are retrieved.

> Plus, it still causes useless load on TitleIndex,
> RecentChanges, and so on, because you will find many "action" links on
> these pages (diffs to versions, attachments, and so on).

That's why it's important for the engine to quickly recognize
such inappropriate actions and block them asap.  Even better is 
for the webserver to block them before passing control to the engine.
On many of my sites I have the following mod_rewrite rules in place:

    # block ?action= requests for these spiders
    RewriteCond %{QUERY_STRING} action=[^rb]
    RewriteCond %{HTTP_USER_AGENT} Googlebot [OR]
    RewriteCond %{HTTP_USER_AGENT} Slurp [OR]
    RewriteCond %{HTTP_USER_AGENT} msnbot [OR]
    RewriteCond %{HTTP_USER_AGENT} Teoma [OR]
    RewriteCond %{HTTP_USER_AGENT} ia_archive
    RewriteRule .* - [F,L]

This looks for any robot request containing "action="  in the query string
(except for "action=b[rowse]" and "action=r[ss|df]"), and
denies such requests at the webserver level.  This avoids the
cost of even initializing the wiki engine for such requests.

> > Personally, I figure that if Yahoo! won't be a good Internet citizen 
> > and have some respect for the costs its spiders are incurring
> > on my site (and bandwidth that I have to pay for), I don't really 
> > care if my sites' content appears in its search engine.
> 
> Good point. On the other hand this means to promote search engine
> monoculture.

Not at all -- there are lots of other good search engines available.
Yahoo! is just one player of many.

> > My response to Slurp has been to set robots.txt to give Slurp a
> > Crawl-delay of 60 seconds, simply to reduce the sheer volume of
> > requests.  
> 
> We had set a 'Crawl-delay' of 120 for 'Slurp'. Yet even with this it seemed
> that the wiki engine could not handle the load caused by that bot. Now
> 'Slurp' is set to 'Disallow: /', which I still believe not being a
> solution.

I have two suggestions here -- first, if it's at all possible,
block extraneous requests at the webserver level, similar to what
I did with mod_rewrite above.  That will reduce the load a lot.

Second, I'd go ahead and set a Crawl-delay of 600.  As I already
mentioned, Slurp tends to hit pages far more often than it needs to
(in some cases as much as 20 or 30 times per day), a Crawl-delay of
600 might mean that pages are indexed once per week instead of once
per hour, but that's still frequent enough to keep things reasonably
up-to-date in Yahoo!'s search engine (and prevents the site from
disappearing entirely from Yahoo!).

> > In short, given what I've seen of Slurp's behavior on the sites I run, I
> > think that webserver administrators are entirely justified in
> > severely restricting or denying Slurp's access to the webserver.
> > Perhaps if enough web administrators start complaining about
> > Slurp then Yahoo! will fix their broken bot.
> 
> In the meantime, the project could die by unpopularity. 

Somehow I don't think that being absent from a single search engine
(with the possible exception of Google) is sufficient to cause
a project to die by unpopularity.  Thus far in June 2006, pmwiki.org
has had 1,575 search requests come from Yahoo!, and 9,825 search
requests from Google.  So, far more people use Google to find
pmwiki.org than Yahoo!  I don't think that a 20% drop in search
results from a single source would be sufficient to cause an
otherwise good project to completely fail -- there are likely other
factors at work.

[One can reasonably claim a circular reasoning flaw in my search
statistics above, since pmwiki.org is using Crawl-delay to reduce 
Slurp's access to the site but not Googlebot's, and therefore pmwiki.org
pages may be less likely to appear in Yahoo! search results and
there would be fewer incoming requests.  However, countering this
argument is the fact that even with the higher Crawl-delay value
Slurp generates more requests to pmwiki.org than Googlebot.  Also, 
other (non-wiki) sites that I maintain seem to show a 5-to-1
ratio of search requests coming from Google versus Yahoo!.]

> That already
> happened to our sister project linuxwiki.de, now replaced by a competitive
> project, linwiki.de (which is using MediaWiki, BTW). I don't think this
> should ever happen, because it is against what I believe the wiki idea
> means. Thus, I'm thinking about migrating to an other wiki engine software
> (and that's why I started this thread).

I think you'd be very happy with PmWiki.  Of course, I'm somewhat
biased.  :-)  I know quite a few people who started with MoinMoin 
(especially because they prefer Python) but ultimately switched 
to PmWiki because it is a lot easier to work with.

Pm