[pmwiki-users] Google local site search

Wed Dec 28 10:00:15 CST 2005

Patrick R. Michaud schrieb:
> On Wed, Dec 28, 2005 at 12:51:21AM +0100, Joachim Durchholz wrote:
> 
>>>Newer versions of PmWiki (since 2.1.beta8) automatically return 
>>>"403 Forbidden" errors to robots for any action other than 
>>>?action=browse, ?action=rss, or ?action=dc.
>>
>>Um... AFAIK Google punishes sites that are "polymorphic" when crawled by 
>>Google. (Dunno how they find out - maybe they send a crawler that looks 
>>just like a normal browser and samples some of the pages.
 >
> I've done a bit of research on this, and according to several experts
> only sites that present egregiously different content are punished for
> polymorphism.  Supposedly minor changes to link targets to strip off
> things like query parameters aren't supposed to be punished.

OK, then that's settled.

 >> I'm generally shy of doing pages differently depending
 >>on who visits it - what if there's a bug in the code that does the
 >>polymorphism? I'll never find out.)

I'd like to hear your view on this one. I think that's a relevant one - 
with various parties writing code that transforms different aspects of 
PmWiki, it could be difficult to reliably test whether Google&co are 
really seeing the pages we think they see. Imagine a bug that mangles 
the > in <a href=...> only when presenting oneself to Google - it will 
go unnoticed for a long time. Worse, few people have the tools and 
expertise to see that.

>>It might be a better idea to mark the ?action=edit etc. links as "don't 
>>follow by spiders". I.e.
>>  <a href="...?action=edit" rel="nofollow">...</a>
> 
> 1.  PmWiki determines what to strip based on $ScriptUrl -- in many cases
>     it doesn't have the full <a href='...'>...</a> tag immediately
>     available in order to add rel="nofollow" to it.

That's "just work".
More than worth the effort, maybe.

 > And some <a> tags already have a rel= attribute.

The nofollow attribute can be appended. rel= takes a space-separated 
list of keywords, and browsers are advised to ignore entries that they 
don't know how to interpret. (That's probably why Google chose rel= to 
dump the attribute into.)

> 2.  I'm not convinced that adding rel="nofollow" means that the
>     robot won't follow the link.  According to 
>     http://googleblog.blogspot.com/2005/01/preventing-comment-spam.html
>     and http://microformats.org/wiki/relnofollow, the rel="nofollow"
>     attribute simply means that the search engine shouldn't give the
>     link any credit when ranking sites in search results.  It doesn't
>     mean that the robot doesn't follow the link.
> 
>     Google does say at http://www.google.com/webmasters/bot.html
>     that placing rel="nofollow" will cause Googlebot to not follow
>     the link, but that doesn't mean that other robots have to follow suit.

Sure - but as you said below, it need not be a 100% sure thing.

I'd expect most search engines to honor "nofollow" by not following the 
link anyway.
First, because it's easiest to program. "nofollow" means "search 
engines: don't follow this link!", so they can simply ignore the <a...> 
tag. Otherwise, they'd have to add the link to the list of known pages 
but be careful not to include that in the ranking. (It would be a small 
complication, but I'm pretty sure that search engine programs are too 
complicated already and every little simplification is applied if possible.)
Second, because most webmasters will interpret it that way. Which means 
that most web search engines will follow suit :-)
Third, because Google is already doing it that way, so that's yet 
another incentive for other search engines to do likewise.

>>How does PmWiki find out it's being accessed by a robot?
> 
> It just does a simple pattern match against the User-Agent HTTP header.  
> The point of robots.php isn't to absolutely detect and control every 
> possible robot, it's just to detect and manage the most popular ones
> and reduce the load on the site.

Um... which pattern? (I'd be hard pressed to come up with something 
useful - robots identify themselves with a bewildering array of keywords.)

Regards,
Jo