[pmwiki-users] pageindex (was: Category - Links not to include)

Tue May 8 08:57:41 CDT 2007

On Tue, May 08, 2007 at 06:04:41AM +0200, Petko Yotov wrote:
> On Tuesday 08 May 2007 04:59, Patrick R. Michaud wrote:
> > > Senond, this will cause pmwiki to open for reading every single page
> > > (file) in the pagelist to make sure there is such string, and if there
> > > are many pages, it may be quite slow (running at 100% CPU). ...
> >
> > Actually, this won't speed up or slow things down that much.  Both with
> > and without the "[[!" search term, PmWiki will still read all of the
> > pages in the resulting pagelist.  
> 
> Wow, this is not great :-((. It shouldn't read all files if all the data the 
> request needs can be found by scanning only the .pageindex file! Hmm... I was 
> sure it was smarter. :-(

Well, "read all files" is different from "read all files in the resulting
pagelist".  :-)

When PmWiki is searching pages using pagelist, it uses the .pageindex
file to determine pages that may be safely *excluded* from the results.
It then scans any remaining pages to verify that they match (and updates
the .pageindex for any that don't).

While the use of .pageindex to determine "negative matches" might
seem backwards from what most people expect, it's more effective
this way.  At any point in time the .pageindex file may be incomplete
or missing, so we can't rely on it to give us a complete list of all 
possible matching pages.  However, we can use it to exclude pages
that there's no point in processing any further.  (This is also why
PmWiki can still produce accurate results when .pageindex is missing--
it only uses .pageindex as optimization, not as an authoritative
source.)

Even if .pageindex could reliably tell us all of the pages, it
doesn't always provide all of the data the request needs.  If 
$EnablePageListProtect is set (the default), then we still have to
read the resulting pages in order to check read permissions and
possibly other sorts of operations on them.

So, let's go back to the original case of a Category.* page containing:

    (:pagelist link={*$FullName} list=normal "[[!" :)

If there are five pages in the Category, and .pageindex is complete,
then the link= parameter will end up excluding all of the pages
that don't link to the Category page.  The list= parameter will
strip out any pages that don't match the 'normal' patterns.

For the (five) pages that remain, we then read the pages to
check permissions, verify that {*$FullName} is really a link 
target (as opposed to simply having the name in the page text
somewhere), and search for "[[!" in the text.  

Both with and without the "[[!" term in the pagelist command 
we end up reading a total of five pages, which we would have 
to read anyway in order to verify read permissions on each.

Hope this helps,

Pm