[pmwiki-users] Being pounded

Tue Dec 19 16:28:47 CST 2006

On Tue, Dec 19, 2006 at 01:48:56PM -0800, pmwiki at 911networks.com wrote:
> On Tue, 19 Dec 2006 13:47:51 -0600
> "Patrick R. Michaud" <pmichaud at pobox.com> wrote:
> > On Tue, Dec 19, 2006 at 11:32:37AM -0800, pmwiki at 911networks.com
> > > I use the fixflow skin [thanks Hans, you are listed in the
> > > credits] with the (:noaction:). So you can't see the Edit, Print,
> > > Recent Changes, Source...
> > > 
> > > When I create the sitemap through http://www.xml-sitemaps.com it
> > > finds properly 30 pages, but I check my logs it get:
> > > 
> > > sqlhacks.com 72.52.140.189 - - [15/Dec/2006:09:36:27 -0800]
> > > "GET /index.php/Retrieve/Retrieve HTTP/1.1" 404 9187 "-" "Mozil
> > > la/6.0 (MSIE 6.0; Windows NT 5.1;)" sqlhacks.com 72.52.140.189 - -
> > > [15/Dec/2006:09:36:28 -0800]
> > > "GET /index.php/Retrieve/Retrieve?action=source HTTP/1.1" 200 9
> > > 538 "-" "Mozilla/6.0 (MSIE 6.0; Windows NT 5.1;)" sqlhacks.com
> > > ...
> > 
> > Somewhere on the site there is probably a (broken) link to a page 
> > called Retrieve.Retrieve.  Since that page doesn't have
> > (:noaction:) in it, the links appear and the xml-sitemaps.com
> > spider follows them.
> 
> I do NOT have a page named Retrieve.Retrieve. 

Right -- that's what I meant by "broken link" -- it's a link
to a non-existent page.  Sorry for not being precise -- I'll
rephrase:

Somewhere on the page is a link to a *non-existent* page 
called Retrieve.Retrieve.  Since that non-existent page 
doesn't have (:noaction:) in it, the links appear and the 
xml-sitemaps.com spider follows them.  See 
http://www.sqlhacks.org/index.php/Retrieve/Retrieve to
see what the robot is seeing.

So, because there's a broken link somewhere to a 
non-existent Retrieve.Retrieve page, the spider is picking
up extra links to ?action=source, ?action=edit, ?action=search,
etc. on that "non-existent" page.  

The spider apparently isn't smart enough to realize that a 
404 error code means "no such page" and that it shouldn't
index that page or follow any of the links on the page.

(On the other hand, PmWiki appears to be giving robots
an 'index,follow' indication on non-existent pages.  We
should probably fix that to default to "noindex,nofollow".
But even this won't help with xml-sitemaps.com, as their
spider appears to ignore such directives anyway.)

> Where did the Retrieve.Retrieve come from?

I don't know -- but all it takes is one incorrect link to
get the spider to go to that page.  It could be coming from
an errant sidebar link or from the skin somewhere.

> I have in the Site.SideBar:
> 
> ----
> %sidehead% [[Retrieve|[+Retrieve Data+]]]
> ----
> 
> Could it be from the Site.SideBar?

Yes, it's not a good idea to have unqualified links
in Site.SideBar.  I would make sure that it's

    %sidehead% [[Site.Retrieve|[+Retrieve Data+]]]

> Do I need a (:noaction:) in the SideBar, any side effect?

I don't think that (:noaction:) in the SideBar will help--
by the time the SideBar is being displayed it's usually too
late to affect any earlier parts of the layout or the
main text area.

Pm