[pmwiki-users] Yahoo! Slurp is broken
Sebastian Pipping
webmaster at hartwork.org
Thu Mar 16 05:26:08 CST 2006
Patrick R. Michaud wrote:
> I reanalyzed my logs for the past 15 days. In the past
> 15 days, pmwiki.org has had 1.2 million hits. Yahoo! Slurp
> accounts for 207,510 of them.
>
> Beyond that, let's see what Slurp requested most. First, it
> asked for /robots.txt 5,682 times. Okay, I can deal
> with that. But after that, the most frequently requested
> urls are:
>
> 380 /
> 230 /wiki/PITS/PITS
> 182 /wiki/Cookbook/HomePage
> 178 /wiki/Cookbook/Cookbook?from=Cookbook.HomePage
> 166 /wiki/PmWiki/PmWiki
> 119 /wiki/PmWiki/SuccessStories
>
> Does Slurp *really* need to be grabbing a copy of each of
> these pages an average of ten times per day?!?
>
> In fact, if we just ask the question "how many pages were
> downloaded by Yahoo! Slurp more than 15 times over the
> past 15 days?", it turns out that over 1600 pages were
> downloaded 15 times or more over the last 15 days.
>
> I'm guessing (or rather, hoping) that Yahoo!'s spiders are
> extremely sensitive to the existence of "Expires:" tags on
> pages. I'm going to try adding Expires: tags to pages
> whenever Yahoo! requests them, just to see if I can reduce
> the onslaught. If that doesn't work, then I may just
> start filtering Yahoo Slurp at the webserver level.
>
> If anyone else has other insights, ideas, or suggestions,
> I'd be very interested in hearing them.
----------------------------------------------------------------
In case Slurp does learn from expire tags you could
make it a recipe and make the delay depend on the page.
So a page would have a "last slurp visit" and a
"minimum seconds between slurp visits" value.
When the page specific delay is exceeded the header
says "update me" otherwise "i'm still the the old page".
Such a script would be most effective when called as
soon as possible - maybe not too far after
"define('PmWiki',1);" . This reminds me I thought about
another thing closely related to this:
config.php is included quite lately: There is a lot of code
before the config is read. For a Slurp script - and quite
sure for other scripts too - it might be useful to have an
additional early-jump-in point for really urgent scripts.
Maybe together with a file "local/config_urgent.php" which
is included as early as possible.
Later this solution could be changed to a function
"init( $stage )" with $stage being an int which is
incremented each time. In the config you can then add
the code as soon as the resources needed are available -
at the right stage value in a switch statement.
Sebastian
--
Sebastian Pipping
http://www.hartwork.org/
More information about the pmwiki-users
mailing list