[pmwiki-users] Yahoo! Slurp is broken (was: pmwiki.org performance)

Patrick R. Michaud pmichaud at pobox.com
Wed Mar 15 14:16:13 CST 2006

On Wed, Mar 15, 2006 at 01:45:51PM -0600, Patrick R. Michaud wrote:
> - The site is being hammered by search engine spiders.
>   Here are stats for March 10th to the present:
>   Total number of hits to pmwiki.org :  457,339  (100.0%)
>   Hits coming from Slurp (Yahoo!)    :   84,008  ( 18.4%)

Some more details:

I decided to analyze the statistics for Yahoo!'s spider (called
"Slurp") a bit further, and I've come to an important conclusion:

    Yahoo! Slurp is STUPID.
    Yahoo! Slurp is STUPID.
    Yahoo! Slurp is STUPID.

I reanalyzed my logs for the past 15 days.  In the past
15 days, pmwiki.org has had 1.2 million hits.  Yahoo! Slurp
accounts for 207,510 of them.

Beyond that, let's see what Slurp requested most.  First, it
asked for /robots.txt 5,682 times.  Okay, I can deal
with that.  But after that, the most frequently requested
urls are:

    380     /
    230     /wiki/PITS/PITS
    182     /wiki/Cookbook/HomePage
    178     /wiki/Cookbook/Cookbook?from=Cookbook.HomePage
    166     /wiki/PmWiki/PmWiki
    119     /wiki/PmWiki/SuccessStories

Does Slurp *really* need to be grabbing a copy of each of
these pages an average of ten times per day?!?  

In fact, if we just ask the question "how many pages were
downloaded by Yahoo! Slurp more than 15 times over the
past 15 days?", it turns out that over 1600 pages were
downloaded 15 times or more over the last 15 days.

I'm guessing (or rather, hoping) that Yahoo!'s spiders are 
extremely sensitive to the existence of "Expires:" tags on
pages.  I'm going to try adding Expires: tags to pages
whenever Yahoo! requests them, just to see if I can reduce
the onslaught.  If that doesn't work, then I may just
start filtering Yahoo Slurp at the webserver level.

If anyone else has other insights, ideas, or suggestions,
I'd be very interested in hearing them.


More information about the pmwiki-users mailing list