[pmwiki-users] Trouble with .pageindex when too much _new_ data to index (+ sqlite)

Wed Jan 28 15:10:06 CST 2015

Hello,

I'm testing how PmWiki, using sqlite recipe as page store, can handle one
hundred thousands of (short) pages (and maybe more later if I can make it
work the way I need).

In my case, for a dictionary simulation, I define a group for Word and
another group for Quote. A pagelist in Word pages will print out the linked
quotes.

I import 70k words (word, sense, etymology, synonym, etc., all in ptv, to
be templated later) and 90k quotes (the quote, source, word) and sqlite
database is now about 100 mo (for 100k « pages ») and .pageindex is about
40 mo.

Main issue encountered is how .pageindex is handling its indexation task.
It sounds like it definitely stops working when the amount of _new_ data is
too big.
I mean, the process looks like it evaluates first the amount of new data,
rather than starting to index, thus, in case there is too much new data,
you get a memory error message, and the game is over. I wish the
pageindexing would work, and work, and work, no matter how much new data
there is to index, until its done.

If the amount of new data is acceptable, then he will start making the
index. Not in one time : you will have to ask him several times, but at the
end (search 10 times, more or less), you know its done, and you have not
encoutered memory issue.

To import all my data, I have had no other choice than to split my original
big files in 10 or 20 pieces.
After each partial import, I have had to run a few searches to activate the
process, until I am sure the indexation is done (check explorer until size
is not growing anymore).
In other words, no way to import 50 megas of new data and to get it
indexed. Have to split them first.

At the end of the story, it works, and it's not working bad at all. Hans'
TextExtract (limited to Word group) does a nice job as well :

133 results from 90 pages, 72236 pages searched in 1.95 seconds

(regex doesn't work, but you can target anchored text).

It's working, yes, but I don't feel safe. Mostly because of the trouble of
getting .pageindex done. Biggest problem is I can not delete the current
.pageindex wich took me more than one hour to get done.
In case I would delete this file, then the amount of _new_ data (all 100 mo
sqlite data would look like new) would be far to vast, PmWiki would run out
of memory on every search and the indexation process would never start,
failing first.

Is there something to do with the native search engine to avoid it failing
each time amount of new data is too big ?
Or how to secure the pageindex processing ?

ImportText has an internal mechanism to avoid hitting PHP's "maximum
execution time" limits : « this script will perform imports for up to 15
seconds (as set by the $ImportTime variable). If it's unable to process all
of the imported files within that period of time, it closes out its work
and queues the remaining files to be processed on subsequent requests. »

Would it be possible/easy/pertinent to implement such memory protection to
the native search engine ?

Related question is : as I'm using sqlite for storing a big amount of short
and very short pages, why use the pmwiki .pageindex process rather than
performing a fulltext search ?

Thank you for your advices,
Gilles.

-- 

---------------------------------------
| A | de la langue française
| B | http://www.languefrancaise.net
| C | languefrancaise at gmail.com
---------------------------------------
       @bobmonamour
---------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pmichaud.com/pipermail/pmwiki-users/attachments/20150128/fdcfeda1/attachment.html>