[Pmwiki-users] Re: Spamming technique

Sun May 23 15:19:18 CDT 2004

On Sun, 23 May 2004, Crisses wrote:

> Hey, has anyone tried to run a grep on their site in the wiki.d folder 
> to see all http:// requests in their pages?  maybe I'll do that (just 
> to eyeball what comes up).  an initial "approved" file would be pretty 
> easy to make from there.

I just ran the following command (in bash):

    grep ^text= wiki.d/* | tr ? \\n | tr " " \\n | tr ']' \\n | \
      grep -i http: | sed -e " s/.*\(http.*\)/\\1/" | sort | \
      uniq > URIs.lst

and it extracts a unique lists of URIs starting with 'http:'. The result
was a rather long list (more than 400 URIs). Realizing that I will use 
this command again, I ended up putting in a script 'find-URIs.sh' that 
you can find here:

	http://wiki.lyx.org/pmwiki.php/SiteTest/ConfigFiles

In order for you to use this script, you have to modify the line

	dir0=~lyx/www/pmwiki		# pmwiki/-directory

so that $dir0 points to your wiki directory. Then you can execute the 
script through:

	./find-URIs.sh +n > URIs.lst

which puts the result in a file called 'URIs.lst'.

Since I get so many URIs, I've put some 'valid-URIs.lst' in a file which I
use to filter the result as follows:

	cat URIs.lst | grep -v -f valid-URIs.lst

I still end up with about 200 links that I manually check (basically I 
just have to glance at them to see that they look reasonable).

What I've done now is to check in 'URIs.lst' into my version control 
system, so that the next time I run the command to check for URIs, I can 
simply see which URIs are new.

Oh, and I did find that another WikiSandbox-page had some bad links in 
it.

/Christian

-- 
Christian Ridderstr?m                           http://www.md.kth.se/~chr