search engine beware! Was: Re: [pmwiki-users] notice of current edit

Joachim Durchholz jo at durchholz.org
Fri Apr 15 09:03:43 CDT 2005


Radu wrote:
> Me, I find that file more of a pain than a solution. Mischievous search 
> engines or slurp machines (a la webzip) totally ignore these files, and 
> some may even use them to get at content that's deemed a bit 'private'

robots.txt isn't a privacy mechanism, it's a mechanism for directing web 
crawlers.

Say, if you have the same page in different forms (one with all bells 
and whistles, another one as text-only search-engine-friendly), then 
robots.txt is helpful. If you have private content, robots.txt won't 
help you (only passwords will).

> Since no sane individual can see two different pages in the same second,

Then count me among the insane. (Well, that might be accurate actually 
*ggg*)

For examples, sometimes I right-click large bundles of links for "open 
in new tab"; the click rate is about 2 or 3 per second.

> not to mention edit them, there is a way to differentiate between search 
> engines and actual wiki authors: log the timestamp of the previous 
> access from each IP. If it's smaller than a settable interval (default 
> 2s), then do not honor requests for edit.

Web crawlers already have counteracted that measure. wget, for example, 
has options to set arbitrary intervals when in "suck-the-site" mode.

 > For an even stronger
> deterrent, to save processor time when the wiki is supposed to be 
> hidden, we could also add an $Enable switch to keep from honoring ANY 
> request to fast-moving IPs.

I don't think the problem is serious enough to warrant exclusion of 
legitimate requests.

If you really want private areas, use passwords. If you want to exclude 
robots from saving pages, password-protect the edit link.

To prevent robots from accidentally saving pages, made sure that the 
edit pages all have a noindex,nofollow set on the appropriate meta tag. 
That's enough to prevent the good guys from doing anything harmful, and 
for the bad guys - well, that's wikispam, and can be fought using passwords.

Here's a feature request: if a user turns out to be a spammer, have a 
function that undoes all his edits and elides them from the page 
history, too. Also, make the history pages have noindex,nofollow 
(otherwise, a wiki spammer would not mind to have his spam removed - as 
long as it's available via the page history function, it's still a link 
farm and useful for him).

Regards,
Jo



More information about the pmwiki-users mailing list