[pmwiki-users] "nofollow" links missing on skins

Fri Apr 7 10:23:34 CDT 2006

(This is a long post -- I decided to go ahead and explain in detail
here, and maybe we can see about transferring this post into an 
appropriate page on pmwiki.org somewhere.)

On Fri, Apr 07, 2006 at 09:32:26AM -0400, Neil Herber wrote:
> At 2006-04-07  08:00 AM -0500, Patrick R. Michaud is rumored to have said:
> >It's worth noting that the anti-spam convention that first introduced
> >"nofollow" doesn't say anything about robots not following the links,
> >only that links with rel="nofollow" shouldn't be weighted in search
> >results.
> 
> My brain is a bit fuzzy this morning. Does PmWiki not deflect robot 
> requests for these links? Or do I still need to have a robots.txt 
> file? I know I am (at least) conflating nofollow and disallow behavior here 
> ...

Yes, there are a couple of different aspects to the problem that
have to be considered independently (and it's easy to confuse them).

The first aspect is advising robots what not to request, 
and this is handled by things such as "nofollow" and robots.txt.
PmWiki, or any other web application for that matter, can never 
absolutely control what a robot chooses to ''request''.  All that 
can be done is to ''advise'' robots that the site prefers that 
certain urls not be requested (robots.txt) or that some links
not be followed ("nofollow").  

Some robots are better behaved than others at following the
advice a site gives.  More details about this are below.

The second aspect is deciding how to respond to a request that
a robot makes.  Once a robot has made a request, advice such
as "nofollow" and robots.txt no longer matters for the purposes
of that request -- the robot has made the request whether we
advised it or not.  Now the application (e.g., PmWiki) has to 
decide how to respond.

When PmWiki detects that a request has come from a robot
(an interesting problem in its own right), PmWiki checks
the $RobotActions array to see if the requested action is
allowed to robots.  By default the allowed robot actions
are "?action=browse", "?action=rss", and "?action=dc".
All other actions result in an immediate "403 Forbidden" 
response back to the robot.

So, to quickly answer your question for this second aspect:  
yes, PmWiki deflects robot requests for disallowed actions 
by returning "403 Forbidden" responses to them.  In fact,
if you check your webserver logs you may see a lot of
requests coming from search engine spiders that resulted
in "403" responses from PmWiki (a normal response would 
be "200" or "304").

Of course, it would be better for all concerned (resource
and bandwidth utilization) if the robot simply didn't make 
forbidden requests in the first place, and this is 
where things like robots.txt and "nofollow" come into play.  
These mechanisms provide a way to tell robots that they shouldn't
make certain requests.  

There are three main standards dealing with search engine spider
behavior:  the "Robots Exclusion Protocol" (robots.txt) [1], 
the "Robots META Tag" (<meta name='robots' .../>), and 
Google Sitemaps.  Since Google Sitemaps is specific to Google 
(and Googlebot is very well behaved anyway), I'll skip it for now.
Essentially, these protocols provide a mechanism of advising
robots as to allowed/disallowed urls on either a site-wide
or per-page basis.  Nearly all of the major robots and search-engine
spiders are very well behaved with respect to these standards.

Unfortunately, the robot standards are very "coarse" --
they can only be used to cover broad groups of urls or links.
(The rel="nofollow" attribute in links is not part of either
of the above standards, and I cover it separately below.)
At best we can use robots.txt to advise robots not to
request pages beginning with various prefix strings (but no
wildcards), or we can use <meta name='robots' content='nofollow' />
to tell a robot not to follow any links in the page it just
retrieved.  But any finer resolution than that isn't possible to
do under the present standards.

For PmWiki, this means we cannot use the robots standards to 
advise a robot "don't request pages with ?action=edit" or 
"don't request any url that ends with 'RecentChanges'".  
Thus, robots that are "well-behaved" with respect to the above 
standards are still going to make requests that we would prefer 
they avoid.  Thus the best we can do is intercept the requests 
as quickly as possible when they do occur and deny them before 
they consume more resources than they already have.  This
is what PmWiki does by default.

Now then, advising a robot what it may request, and advising a
robot what to do with the contents of a response we send back 
are two different things.  The best example of this is 
<meta name='robots' content='noindex' />, which tells
search engine spiders that they should not index the contents
of the current page, but it's okay to follow the links on
the page to other resources (unless excluded by robots.txt).

And now we get to the major point of confusion surrounding
the use of "rel='nofollow'" attributes in links:  Despite its
name, rel='nofollow' is ''not'' defined as a mechanism to tell
robots not to follow links.

The rel='nofollow' attribute was first introduced in January 2005 
as a way to combat comment spam and wikispam that attempts to 
increase search engine rankings [2].  As defined by this protocol,
a link containing rel='nofollow' is an indication to a search
engine that the link should not get any credit when ranking 
websites in search results.  

Notably, the definition doesn't mention anything about a
spider choosing to follow a rel='nofollow' link -- it just
says it shouldn't count for ranking purposes.

Admittedly, a search engine spider might choose to not follow
links marked with rel='nofollow', and this could be an easy
way of preventing such links from getting additional weight.
But at the moment I haven't see anything "official" that
requires robots or search engine spiders not to follow
links with rel='nofollow'.

In fact, as of April 2006 the only robot/spider that I'm aware 
of that doesn't follow rel='nofollow' links is Googlebot, and 
that has been true only since July 2005.  (Prior to then, Googlebot
followed such links.)  From what I've seen in my logs, both msnbot
and Yahoo! Slurp continue to follow links that have been
marked with rel='nofollow'.  Of course, this doesn't in any way
indicate that msnbot or Slurp are violating any protocols or
standards, because the standard doesn't say that rel='nofollow'
links shouldn't be followed -- the standard just says not
to use such links when weighting search results.  So, msnbot
and Slurp are still technically "well-behaved" according to
the standards, even if they follow links that were marked
with rel='nofollow'.

All of which gets back to how PmWiki handles things, which is
to deny robots' requests for inappropriate actions, and try as
much as possible to advise robots not to bother with such requests
in the first place.  But with the possible exception of Googlebot,
seeing webserver log entries of robot requests for "?action=edit" 
and "?action=diff" probably doesn't indicate a configuration problem 
or bug -- it just means that the robot exclusion standards aren't 
nearly as robust as we'd probably like them to be.

Hope this helps.  :-)

Pm

1. http://www.robotstxt.org/wc/exclusion.html
2. http://googleblog.blogspot.com/2005/01/preventing-comment-spam.html