[pmwiki-users] Yahoo! Slurp is broken

Sebastian Pipping webmaster at hartwork.org
Thu Mar 16 05:26:08 CST 2006


Patrick R. Michaud wrote:
> I reanalyzed my logs for the past 15 days.  In the past
> 15 days, pmwiki.org has had 1.2 million hits.  Yahoo! Slurp
> accounts for 207,510 of them.
> 
> Beyond that, let's see what Slurp requested most.  First, it
> asked for /robots.txt 5,682 times.  Okay, I can deal
> with that.  But after that, the most frequently requested
> urls are:
> 
>     380     /
>     230     /wiki/PITS/PITS
>     182     /wiki/Cookbook/HomePage
>     178     /wiki/Cookbook/Cookbook?from=Cookbook.HomePage
>     166     /wiki/PmWiki/PmWiki
>     119     /wiki/PmWiki/SuccessStories
> 
> Does Slurp *really* need to be grabbing a copy of each of
> these pages an average of ten times per day?!?  
> 
> In fact, if we just ask the question "how many pages were
> downloaded by Yahoo! Slurp more than 15 times over the
> past 15 days?", it turns out that over 1600 pages were
> downloaded 15 times or more over the last 15 days.
> 
> I'm guessing (or rather, hoping) that Yahoo!'s spiders are 
> extremely sensitive to the existence of "Expires:" tags on
> pages.  I'm going to try adding Expires: tags to pages
> whenever Yahoo! requests them, just to see if I can reduce
> the onslaught.  If that doesn't work, then I may just
> start filtering Yahoo Slurp at the webserver level.
> 
> If anyone else has other insights, ideas, or suggestions,
> I'd be very interested in hearing them.

----------------------------------------------------------------
In case Slurp does learn from expire tags you could
make it a recipe and make the delay depend on the page.
So a page would have a "last slurp visit" and a
"minimum seconds between slurp visits" value.
When the page specific delay is exceeded the header
says "update me" otherwise "i'm still the the old page".

Such a script would be most effective when called as
soon as possible - maybe not too far after
"define('PmWiki',1);" . This reminds me I thought about
another thing closely related to this:

config.php is included quite lately: There is a lot of code
before the config is read. For a Slurp script - and quite
sure for other scripts too - it might be useful to have an
additional early-jump-in point for really urgent scripts.
Maybe together with a file "local/config_urgent.php" which
is included as early as possible.

Later this solution could be changed to a function
"init( $stage )" with $stage being an int which is
incremented each time. In the config you can then add
the code as soon as the resources needed are available -
at the right stage value in a switch statement.


Sebastian


-- 
Sebastian Pipping
http://www.hartwork.org/




More information about the pmwiki-users mailing list