[Pmwiki-users] Re: Spamming technique
Sun May 23 15:19:18 CDT 2004
On Sun, 23 May 2004, Crisses wrote:
> Hey, has anyone tried to run a grep on their site in the wiki.d folder
> to see all http:// requests in their pages? maybe I'll do that (just
> to eyeball what comes up). an initial "approved" file would be pretty
> easy to make from there.
I just ran the following command (in bash):
grep ^text= wiki.d/* | tr ? \\n | tr " " \\n | tr ']' \\n | \
grep -i http: | sed -e " s/.*\(http.*\)/\\1/" | sort | \
uniq > URIs.lst
and it extracts a unique lists of URIs starting with 'http:'. The result
was a rather long list (more than 400 URIs). Realizing that I will use
this command again, I ended up putting in a script 'find-URIs.sh' that
you can find here:
In order for you to use this script, you have to modify the line
dir0=~lyx/www/pmwiki # pmwiki/-directory
so that $dir0 points to your wiki directory. Then you can execute the
./find-URIs.sh +n > URIs.lst
which puts the result in a file called 'URIs.lst'.
Since I get so many URIs, I've put some 'valid-URIs.lst' in a file which I
use to filter the result as follows:
cat URIs.lst | grep -v -f valid-URIs.lst
I still end up with about 200 links that I manually check (basically I
just have to glance at them to see that they look reasonable).
What I've done now is to check in 'URIs.lst' into my version control
system, so that the next time I run the command to check for URIs, I can
simply see which URIs are new.
Oh, and I did find that another WikiSandbox-page had some bad links in
Christian Ridderstr?m http://www.md.kth.se/~chr
More information about the pmwiki-users