[pmwiki-users] Trouble with .pageindex when too much _new_ data to index (+ sqlite)

Petko Yotov 5ko at 5ko.fr
Wed Jan 28 16:16:55 CST 2015


On 2015-01-28 22:10, ABClf wrote:
> Main issue encountered is how .pageindex is handling its indexation
> task. It sounds like it definitely stops working when the amount of
> _new_ data is too big.
> I mean, the process looks like it evaluates first the amount of new
> data, rather than starting to index, thus, in case there is too much
> new data, you get a memory error message, and the game is over. I wish
> the pageindexing would work, and work, and work, no matter how much
> new data there is to index, until its done.
> 
> If the amount of new data is acceptable, then he will start making the
> index. Not in one time : you will have to ask him several times, but
> at the end (search 10 times, more or less), you know its done, and you
> have not encoutered memory issue.

This is done by the function PageIndexUpdate() in scripts/pagelist.php.

There is a 10 seconds default limit for the indexation work, that is, if 
there are more pages that haven't been indexed, they will be dropped and 
will be indexed on the next search.

While pages are indexed, there shouldn't be a huge need for memory. 
After the terms of a page are compacted, they are written into the 
".pageindex,new" file and dropped from memory (actually the values are 
replaced). Same for the next pages up to 10 seconds. After, the contents 
of the old ".pageindex" file is copied to ".pageindex,new" and then 
".pageindex,new" is renamed to ".pageindex", replacing the old file. 
None of these operations should require a lot of memory.

The only place I see where the memory usage can grow, is on line 773 of 
pagelist.php. This line adds the processed page name to an array, so 
that PmWiki knows that the page was already processed. If you have a 
huge number of pages, only the characters composing the page names may 
go over the memory limit. If your error messages mention this line 773, 
the problem is there.

You can reduce the number of pages indexed (actually the number of 
seconds of continued indexing) by adding this in config.php:

   $PageIndexTime = 5; # 5 seconds instead of 10

I'll review the functions next weekend in case we are missing something.

See also the recipe SystemLimits, you may be able to increase the memory 
limits.

> Related question is : as I'm using sqlite for storing a big amount of
> short and very short pages, why use the pmwiki .pageindex process
> rather than performing a fulltext search ?

The SQLite PageStore() class only allows the "storage" of the pages into 
a single SQLite database file. The reasons, the pros and cons are 
explained in the recipe page.

Other than "a fulltext search from the SQLite database is not yet 
written", I think the built-in search using .pageindex will perform much 
faster than a fulltext database search.

Petko




More information about the pmwiki-users mailing list