[pmwiki-users] pageindex (was: Category - Links not to include)

Tue May 8 11:00:20 CDT 2007

On Tuesday 08 May 2007 15:57, Patrick R. Michaud wrote:
> On Tue, May 08, 2007 at 06:04:41AM +0200, Petko Yotov wrote:
> > On Tuesday 08 May 2007 04:59, Patrick R. Michaud wrote:
> > > > Senond, this will cause pmwiki to open for reading every single page
> > > > (file) in the pagelist to make sure there is such string, and if
> > > > there are many pages, it may be quite slow (running at 100% CPU). ...
> > >
> > > Actually, this won't speed up or slow things down that much.  Both with
> > > and without the "[[!" search term, PmWiki will still read all of the
> > > pages in the resulting pagelist.  
> >
> > Wow, this is not great :-((. It shouldn't read all files if all the data
> > the request needs can be found by scanning only the .pageindex file!
> > Hmm... I was sure it was smarter. :-(
>
> Well, "read all files" is different from "read all files in the resulting
> pagelist".  :-)
>
> When PmWiki is searching pages using pagelist, it uses the .pageindex
> file to determine pages that may be safely *excluded* from the results.
> It then scans any remaining pages to verify that they match (and updates
> the .pageindex for any that don't).
>
> While the use of .pageindex to determine "negative matches" might
> seem backwards from what most people expect, it's more effective

No, I agree, I also think it is mainly used to exclude files.

> this way.  At any point in time the .pageindex file may be incomplete
> or missing, so we can't rely on it to give us a complete list of all
> possible matching pages.  However, we can use it to exclude pages
> that there's no point in processing any further.  (This is also why
> PmWiki can still produce accurate results when .pageindex is missing--
> it only uses .pageindex as optimization, not as an authoritative
> source.)
>
>
> Even if .pageindex could reliably tell us all of the pages, it
> doesn't always provide all of the data the request needs.  If
> $EnablePageListProtect is set (the default), then we still have to
> read the resulting pages in order to check read permissions and
> possibly other sorts of operations on them.

My bad, I forgot this one: the reason is that I always disable it first thing 
when installing a wiki. This was (also) an often cited way to optimize and 
speed-up pagelists, IIRC here on the list and in the manual.

>
> So, let's go back to the original case of a Category.* page containing:
>
>     (:pagelist link={*$FullName} list=normal "[[!" :)
>
> If there are five pages in the Category, and .pageindex is complete,
> then the link= parameter will end up excluding all of the pages
> that don't link to the Category page.  The list= parameter will
> strip out any pages that don't match the 'normal' patterns.
>
> For the (five) pages that remain, we then read the pages to
> check permissions, verify that {*$FullName} is really a link
> target (as opposed to simply having the name in the page text
> somewhere),

Because the links are separated in the .pageindex entries by ":", I thought it 
was enough to determine the linked pages (as opposed to text matches). Also, 
the text strings in .pageindex are split to separate words and would not 
match "Group.Page" with a dot, so, again, it is enough.

> and search for "[[!" in the text.
>
> Both with and without the "[[!" term in the pagelist command
> we end up reading a total of five pages, which we would have
> to read anyway in order to verify read permissions on each.

But again, 
* when $EnablePageListProtect is disabled, and
* when the pageindex entry is up-to-date, and also,
* the search is just on name, group, link and/or single words, and
* there is no search on PageTextVars, nor on quoted expressions, and
* the selected order is on name/group/mtime,

(actually, many restrictions, but I bet most searches almost never need more),

then it would have all data without opening every file in the list. If the 
match contains 100s of pages, and the pagelist is limited by count=10..20 
then pmwiki seems to open/read all 100s of pages, and not only page10 to 
page20.

When searching only by name, however, luckily, not all files are opened.

>
> Hope this helps,

Thank you, this message was very helpfull to better understand the indexing 
and searching. It is clearer now and will help better strategic dispositions 
when designing new sites.

May I ask about the PageListCache mechanism, does it also need to read all 
pages once the list were cached? And what happens if a user has read 
permissions on some pages, and another user has not, for the same cached 
pagelist?

Thanks,
Petko