[pmwiki-users] Crawling through

Patrick R. Michaud pmichaud at pobox.com
Thu Dec 14 21:26:05 CST 2006


On Thu, Dec 14, 2006 at 06:48:13PM -0800, pmwiki at 911networks.com wrote:
> I am trying to make a sitemap to submit to google & yahoo. 
> I used the free website: xml-sitemaps.com.
> 
> I have configured:
> 
> $SearchPatterns['default'][] = '!\\.(All)?Recent(Changes|Uploads|Comments)$!';
> $SearchPatterns['default'][] = '!\\.Group(Print)?(Header|Footer|Attributes)$!';
> $SearchPatterns['default'][] = '!\\.(Left|Right|Side)(Bar|Menu|Note)$!';
> $SearchPatterns['default'][] = '!^Site\\.!';
> $SearchPatterns['default'][] = '!^PmWiki\\.!';

Changing the value of $SearchPatterns won't affect crawling at
all -- it only affects the results of the (:searchresults:) and
(:pagelist:) directives within PmWiki.

> # disable the page history [globally]
> if ($action == 'diff') $action='browse';
> # so that only registered users can see the source
> $HandleAuth['diff'] = $HandleAuth['source'] = 'edit';
> 
> upload is also disabled in local/config.php
> 
> The problem that xml-sitemaps.com included all the Site files, 
> the PmWiki files, the footers, the headers as separate files. 
> It also indexed the same pages as: edit, diff, source, upload

xml-sitemaps.com will follow any links it happens to find on a
site.  What's worse, xml-sitemaps.com doesn't seem to honor the 
any of the robot meta tags -- it includes pages (such as ?action=edit)
that clearly indicate that they're not to be indexed by robots
and that the links on those pages shouldn't be followed.  To me 
that's a fairly serious bug in xml-sitemaps.com's spider.

(I note that the 2.2 version of the XML Sitemaps Generator
software apparently honors the robots meta tags, but the 1.0 
version used by the online interface doesn't seem to do that.)

I'm not quite sure what to offer, other than to suggest
downloading a later version of the XML Sitemaps Generator
and trying that.  

I may be able to come up with a recipe to generate XML
sitemaps based on PmWiki pagelists, but I don't know how long
that would take or how useful it'd be.  I do know that
search engines, especially Googlebot, are completely 
hammering my site with tons of unnecessary hits, so
something like this is growing in importance.

Pm




More information about the pmwiki-users mailing list