[pmwiki-devel] WordCount Recipe.

Tue Oct 31 13:35:27 CST 2006

Patrick R. Michaud wrote:
> On Tue, Oct 31, 2006 at 04:53:24AM -0500, Stirling Westrup wrote:
>> I've just finished a word count recipe for NaNoWriMo. 
> 
> Having looked at it only briefly, I'd say it's done "right",
> at least as far as 2.1 is concerned.

Good! I was unsure of a number of things, including naming all my new
page attributes 'foocount' and using =pagecounts as my flag to tell me
that I've calculated them. I'm worried about core namespace issues. We
may want some Recipe guidelines for naming things like this in the future.

> However, I'm thinking that PmWiki's internal handling of page 
> properties may change soon in 2.2 (to make things such as this
> much simpler), so I wouldn't completely rely on the current 
> implementation.

Thats fine. This only took a day to write, and now that I understand its
basic mechanism, it shouldn't be hard to rewrite to a new method.

> Also, the recipe seems to count words in the markup, whereas
> the original CountGlyphs recipe counted the only the words in
> the output.  That might be a somewhat significant difference.

Yes, it might. I decided not to worry about it until I got my beta
version working. Now that it is working, I suppose its the right time to
worry. I can think of several solutions, but I don't know which is better:

  1) Still count words in the markup text but convert all
non-alphanumerics to spaces first, and then collapse spaces. This would
make things like the string '%red%warning!%%' count as two words rather
than the 1 it seems to at the moment, while '!! title' would now count
as 1 word and not 2. Its arguable if this is better or not, as we'd
still be counting things like (:linebreaks:) as a word. I suppose I
could add a list of preg_replace's similar to $SaveAttrPatterns to
remove things we don't want to count and to handle stuff like links, but
its now getting more complicated and the processing time is going up.
I'm also not sure how this would play with UTF-8 pages or locales.

  2) Try to more closely duplicate what countglyphs did. Run the markup
text through MarkupToHTML, remove anything enclosed in <...> tags and
count that. This would mean that something like '%red%warning%%' would
only count as one word, but that text generated via (:include ...:) and
(:pagelist ...:) directives would get counted as well. It might also end
up counting text from groupheader and groupfooter pages. Finally,
because it would be counting the words in generated text which can
change without the current page being edited, saving the values in page
attributes would be counter-productive as they wouldn't be stable.

  3) Some better scheme that I haven't thought of.

Any suggestions?

>> I don't /think/ it would be too much
>> of a burden to calculate a word and/or glyph count on every save, but
>> all my sites are tiny so I've no experience on what might be burdensome
>> on the larger sites.
> 
> You're correct, it wouldn't be too much of a burden to count words
> on every save.  It's certainly less of a burden than doing it on
> every read.  

Good. I'll make saving to page attributes the default then.

I also keep thinking that there should be some way to reduce processing
times in the case where this recipe is used to count words in static
pages. I'm wondering if it would be a good idea to (somehow) record the
identities of pages for which these numbers had to be generated on the
fly. Then whenever anyone performs a save, we pull a random page off of
the list and call UpdatePage on it. Eventually all static pages would be
updated and processing times would drop. I'm not really sure how one
would go about doing this though, or even if its a good idea.