slef-reflections on Blog Tools

Comment Tools

Dear Blogger - the truth about your CAPTCHA

Posted by mjr 2007-11-04

Dear Blogger,

Your CAPTCHA (word verification or whatever) doesn't test whether someone is spamming. It tests whether someone has good eyesight, hearing, literacy, numeracy or browser-configuration, which is correlated in one way (few spambots include captcha-crackers yet, but there are some and they already have an 80% success rate IIRC) but it is not strongly correlated for humans (are blind people more likely to spam? I doubt it) and now some spammers are using porno-trojans to get humans to crack the eyetests for them - see

I block spam by premoderating all comments (they're well over 95% spam), but that does mean you need to check your site dashboard a lot to avoid cooling discussions too much.

Other anti-spam tactics include OpenID, URL blacklists and so on - I don't use them because I've not added support for them to my site yet. I suspect users can't use them because Google hasn't added support for them. They've bought captcha's false sense of security.

As for encouraging non-spam comments, I'll generally defer to this top 10 from Darren Rowse: On my sites, I find that inviting questions doesn't make much difference, asking questions has limited use because readers are suffering from "question-fatigue" these days and I have difficulty appearing to be humble or gracious (people often read it as sarcasm! Hopefully that's just my audience), but the rest of the techniques seem to work for me and I use them when I can (not all of the sites I write are under my full control).

Please let me know if you see other successful ways to reduce spam or encourage good comments.

Hope that helps, MJR/slef

Jordi commented:

"I'm using Mako's akismet plugin for pyBlosxom, and it's made a great difference. I do get some spams through, but I hope that it'll get better as I train further Akismet.

I think last time I looked at the stats, I had got like 2 spams out of 1000 tries or so."

Yes, Akismet seems to be one of the better tools, as long as it's installed properly.

Simon commented:

"OpenID - antispam?

Maybe I missed something, I thought OpenID provided authentication. If I have the same OpenID I can be assumed to be the same person (or person with person from the creator of the ID).

As far as I am aware OpenID doesn't provide any "trust", so any spammer can have as many OpenIDs as he wants.

This I thought was the main flaw with OpenID, distributed authentication needs a distributed trust scheme. At which point you might as well have use GNUPG, as it is the administering trust that is difficult, authentication in contrast is easy."

Good point. These days, I sometimes advocate moderating new posters to mailing lists as an anti-spam tactic. To use a similar tactic on web comments, you need to be able to decide whether the commenter is the same one that you trust, which is where OpenID authentication can help.

I think Wordpress offers a similar facility with its own registration and hidden-email boxes, but I think I prefer OpenID.

On 2008-06-30, Paul Russell commented:

"Blogger /does/ actually support OpenID authentication for comments, but as far as I know doesn't allow 'first post authentication', which given spammers could create a dummy OpenID provider which supports authenticating without human intervention doesn't really offer much additional security in my view."

I think that's only true if Blogger doesn't give you any way to assign long-lasting permit or deny values to OpenID providers and users. That would be a half-arsed use of OpenID and entirely disservice-as-usual for Blogger.

Oh, and Blogger added OpenID support since I originally wrote the above. I can't get it to work reliably for me. Javascript-reliant?

Stuttering through comment spam

2007-10-16 (Permalink): I get too much spam and some of it isn't obvious from the subject line, particularly if it's a blog comment, so I use the following script to read the suspicious emails and pause at the bottom of each mail.

It takes two optional parameters, explained in the comments below. Please comment on any obvious improvements to this quick hack.

#!/usr/bin/mzscheme -qr
(let ((c (or (and (> (vector-length argv) 0)
                  (string->number (vector-ref argv 0)))
             1)) ; first arg is how long to pause in seconds, default 1
      (re (or (and (> (vector-length argv) 1)
                   (regexp (vector-ref argv 1)))
              (regexp "^From [^ ]*@"))) ; second arg is regexp to pause at
      (i #f))
  (let loop ((l (read-line)))
    (if (regexp-match re l) (if i (sleep c) (set! i #t)))
    (write-string l) (newline)
    (if (not (eof-object? (peek-char))) (loop (read-line)))))
# An earlier version in perl
$c = ($ARGV[0]||1); # How long to pause, in seconds
$re = ($ARGV[1]||'^From [^ ]*@'); # What to pause on
$i = 0;
while (<STDIN>) {
  if (/$re/) { if ($i) { sleep ($c) } $i=1; }
  print $_;

Barak A. Pearlmutter commented:

"One idiom, used twice, is:

(or (and X Y) Z)

where Y cannot be false. This is better expressed as:

(if X Y Z)

The (set! i #t) is a bit ugly. It is good to use when/unless instead of if when the conditional is guarding a side effect rather than a value. There are two calls to read-line which are conceptually the same, and should be combined. The choice of the identifier "i" is a bit mysterious, unless you store a match count in it instead of a boolean. What you're really storing is a match count, but you truncate the count to 0 or many. Since one might wish to sleep longer on successive matches, I'm removing the truncated-addition code. Last, the line is not an appropriate variable to maintain in the loop; but i is.

My rewrite:

#!/usr/bin/mzscheme -qr
(let ((c (if (> (vector-length argv) 0)
	     (string->number (vector-ref argv 0))
	     1)) ; first arg is how long to pause in seconds, default 1
      (re (regexp (if (> (vector-length argv) 1)
		      (vector-ref argv 1)
		      "^From [^ ]*@")))) ; second arg is regexp to pause at
  (let loop ((i 0))
    (let* ((l (read-line))
	   (match (regexp-match re l)))
      (when (and match (not (zero? i)))
	(sleep c))
      (write-string l) (newline)
      (unless (eof-object? (peek-char))
	(loop (+ i (if match 1 0)))))))

I think Y can be false in my script: (string->number (vector-ref argv 0)) can be false if argv[0] isn't a number, and (regexp (vector-ref argv 1)) can be false if the regexp is invalid, so I'll leave those (or ...) idioms, but I welcome all the other changes. Thanks!

Blogging on Other Sites

2007-09-02 (Permalink): Some of my blogging is going to take place on other sites from now on. Sometimes for money reasons, sometimes for audience reasons.

Eagle-eyed readers may have spotted that I've already switched on a feed of my WsM Forum writing and my new cycle-racing feed should go live in the middle of next week. I'll still be archiving those posts, just in case the other sites go off-line, and they still appear in my "all posts" feeds.

I may move my satellite TV stuff off-site soon too, which would leave programming, webmastering and my cooperative business posts on my personal site. This means a site which "fits together" more neatly than my current mix, I think. After that, I may edit the categories again, merging koha back into hacks.

Are there other changes which I should consider?


2007-07-11 (Permalink): Diary of a geek: The trials and tribulations of trying to be compliant mentions the question of embedding HTML in RSS feeds. My simple answer is to use RDF Site Summary (aka RSS-1) and the content module, then put the HTML version into a content:encoded CDATA section. RSS 0.91 doesn't really work for this and the whole Winer-inspired Really Simple Syndication is fatally flawed on XML namespaces.

Anyway, if you're really a geek, please add a feedback route (pingback is my current favourite) to your blog.

The Aggregator

I just rewrote my aggregator (schcyrssmerge2) which still works roughly on this algebra from 2004 (still the basic set theory rule of blog aggregate generation: aggregate_i+1 = (blogs/aggregate_i) u (aggregate_inblogs)) but now calculates agg1 a different way.

First, there's agg0 - the items from the input RSS file - and I store only the post ids of agg0 as agg0i for speed.

Previously, I was reading in all the blogs and doing the above set theory calculation in one go. With a couple of ill-behaved blogs (full-content feeds of a few hundred kb), mzscheme's memory usage was just getting silly, topping out at hundreds of megabytes to merge 4Mb of feeds into a 3Mb aggregate. I think some of the problems are PLT's - reading a 3Mb RSS file in with

"(define a (call-with-input-file "friends.rss" read-xml))"

seems to result in mzscheme (v352) holding 100Mb. What's with that?

Anyway, now I've written a newitems function which extracts new items not listed in agg0i and old items which still appear in a blog and adds them to lists.

Then, there's a function which tail-recurses over all the blog filenames, parsing the RSS, calling newitems on each one and building the lists of old and new items.

Finally, agg1 is the new items appended to the old items. That is only calculated as it gets written out as a RSS file.

I'll give it a few days of testing, then I'll publish a tarball. I use this to build my all.rss feed from my various source feeds, in case you didn't realise, as well as to run some planets for public and private use.

Anyone got a suggestion how best to reduce the read-xml memory consumption? Upgrade? Switch dialect? I'd like some XML read/write functions and the SRFI list functions.

Comments are moderated (damn spammers) but almost anything sensible gets approved (albeit eventually). If you give a web address, I'll link it. I won't publish your email address unless you ask me to, but I'll email you a link when the comment is posted, or the reason why it's not posted.

This is copyright 2007 MJ Ray. See fuller notice on front page.