[Home page](../../) [Latest blog](../all.html)

# slef-reflections on Blog Tools

  * Comment Tools
  * Other Sites
  * HTML in RSS
  * The Aggregator

* * *

## Comment Tools

### Dear Blogger - the truth about your CAPTCHA

##### Posted by mjr 2007-11-04

Dear Blogger,

Your CAPTCHA (word verification or whatever) doesn't test whether someone is
spamming. It tests whether someone has good eyesight, hearing, literacy,
numeracy or browser-configuration, which is correlated in one way (few
spambots include captcha-crackers yet, but there are some and they already
have an 80% success rate IIRC) but it is not strongly correlated for humans
(are blind people more likely to spam? I doubt it) and now some spammers are
using porno-trojans to get humans to crack the eyetests for them - see
<http://www.schneier.com/blog/archives/2007/11/spammers_using.html>
<http://news.bbc.co.uk/1/hi/technology/7067962.stm>

I block spam by premoderating all comments (they're well over 95% spam), but
that does mean you need to check your site dashboard a lot to avoid cooling
discussions too much.

Other anti-spam tactics include OpenID, URL blacklists and so on - I don't use
them because I've not added support for them to my site yet. I suspect
Blogger.com users can't use them because Google hasn't added support for them.
They've bought captcha's false sense of security.
<http://www.w3.org/TR/turingtest#security>

As for encouraging non-spam comments, I'll generally defer to this top 10 from
Darren Rowse: [http://www.problogger.net/archives/2006/10/12/10-techniques-to-
get-more-comments-on-your-
blog/](http://www.problogger.net/archives/2006/10/12/10-techniques-to-get-
more-comment$) On my sites, I find that inviting questions doesn't make much
difference, asking questions has limited use because readers are suffering
from "question-fatigue" these days and I have difficulty appearing to be
humble or gracious (people often read it as sarcasm! Hopefully that's just my
audience), but the rest of the techniques seem to work for me and I use them
when I can (not all of the sites I write are under my full control).

Please let me know if you see other successful ways to reduce spam or
encourage good comments.

Hope that helps, MJR/slef

[Jordi](http://oskuro.net/) commented:

> "I'm using Mako's akismet plugin for pyBlosxom, and it's made a great
difference. I do get some spams through, but I hope that it'll get better as I
train further Akismet.

>

> I think last time I looked at the stats, I had got like 2 spams out of 1000
tries or so."

Yes, Akismet seems to be one of the better tools, as long as it's installed
properly.

Simon commented:

> "OpenID - antispam?

>

> Maybe I missed something, I thought OpenID provided authentication. If I
have the same OpenID I can be assumed to be the same person (or person with
person from the creator of the ID).

>

> As far as I am aware OpenID doesn't provide any "trust", so any spammer can
have as many OpenIDs as he wants.

>

> This I thought was the main flaw with OpenID, distributed authentication
needs a distributed trust scheme. At which point you might as well have use
GNUPG, as it is the administering trust that is difficult, authentication in
contrast is easy."

Good point. These days, I sometimes [advocate moderating new posters to
mailing
lists](http://permalink.gmane.org/gmane.org.fsf.europe.discussion/1991) as an
anti-spam tactic. To use a similar tactic on web comments, you need to be able
to decide whether the commenter is the same one that you trust, which is where
OpenID authentication can help.

I think Wordpress offers a similar facility with its own registration and
hidden-email boxes, but I think I prefer OpenID.

On 2008-06-30, [Paul Russell](http://likesomuch.blogspot.com) commented:

> "Blogger /does/ actually support OpenID authentication for comments, but as
far as I know doesn't allow 'first post authentication', which given spammers
could create a dummy OpenID provider which supports authenticating
myspamdomain.com/anyoldusername without human intervention doesn't really
offer much additional security in my view."

I think that's only true if Blogger doesn't give you any way to assign long-
lasting permit or deny values to OpenID providers and users. That would be a
half-arsed use of OpenID and entirely disservice-as-usual for Blogger.

Oh, and Blogger added OpenID support since I originally wrote the above. I
can't get it to work reliably for me. Javascript-reliant?

  * Comment on this
  * See also [Webmastering: Feedback and Comments](webcss#feedback)

### Stuttering through comment spam

2007-10-16 (Permalink): I get too much
[spam](http://mjr.towers.org.uk/blog/2007/spam) and some of it isn't obvious
from the subject line, particularly if it's a blog comment, so I use the
following script to read the suspicious emails and pause at the bottom of each
mail.

It takes two optional parameters, explained in the comments below. Please
comment on any obvious improvements to this quick hack.

    
    
    #!/usr/bin/mzscheme -qr
    (let ((c (or (and (> (vector-length argv) 0)
                      (string->number (vector-ref argv 0)))
                 1)) ; first arg is how long to pause in seconds, default 1
          (re (or (and (> (vector-length argv) 1)
                       (regexp (vector-ref argv 1)))
                  (regexp "^From [^ ]*@"))) ; second arg is regexp to pause at
          (i #f))
      (let loop ((l (read-line)))
        (if (regexp-match re l) (if i (sleep c) (set! i #t)))
        (write-string l) (newline)
        (if (not (eof-object? (peek-char))) (loop (read-line)))))
    (exit)
    ;
    #!/usr/bin/perl
    # An earlier version in perl
    $c = ($ARGV[0]||1); # How long to pause, in seconds
    $re = ($ARGV[1]||'^From [^ ]*@'); # What to pause on
    $i = 0;
    while (<STDIN>) {
      if (/$re/) { if ($i) { sleep ($c) } $i=1; }
      print $_;
    }
    

[Barak A. Pearlmutter](http://www.bcl.hamilton.ie/~barak/) commented:

> "One idiom, used twice, is:

>

> (or (and X Y) Z)

>

> where Y cannot be false. This is better expressed as:

>

> (if X Y Z)

>

> The (set! i #t) is a bit ugly. It is good to use when/unless instead of if
when the conditional is guarding a side effect rather than a value. There are
two calls to read-line which are conceptually the same, and should be
combined. The choice of the identifier "i" is a bit mysterious, unless you
store a match count in it instead of a boolean. What you're really storing is
a match count, but you truncate the count to 0 or many. Since one might wish
to sleep longer on successive matches, I'm removing the truncated-addition
code. Last, the line is not an appropriate variable to maintain in the loop;
but i is.

>

> My rewrite:

>  
>  
>     #!/usr/bin/mzscheme -qr

>     (let ((c (if (> (vector-length argv) 0)

>            (string->number (vector-ref argv 0))

>            1)) ; first arg is how long to pause in seconds, default 1

>           (re (regexp (if (> (vector-length argv) 1)

>                     (vector-ref argv 1)

>                     "^From [^ ]*@")))) ; second arg is regexp to pause at

>       (let loop ((i 0))

>         (let* ((l (read-line))

>          (match (regexp-match re l)))

>           (when (and match (not (zero? i)))

>       (sleep c))

>           (write-string l) (newline)

>           (unless (eof-object? (peek-char))

>       (loop (+ i (if match 1 0)))))))

>     (exit)

>  
>

> "

I think Y can be false in my script: (string->number (vector-ref argv 0)) can
be false if argv[0] isn't a number, and (regexp (vector-ref argv 1)) can be
false if the regexp is invalid, so I'll leave those (or ...) idioms, but I
welcome all the other changes. Thanks!

  * Comment on this
  * Start of this section
  * Start of this page
  * [All topics](../)

* * *

## Blogging on Other Sites

2007-09-02 (Permalink): Some of my blogging is going to take place on other
sites from now on. Sometimes for money reasons, sometimes for audience
reasons.

Eagle-eyed readers may have spotted that I've already switched on [a
feed](http://mjr.towers.org.uk/blog/wsmforum.html) of my [WsM
Forum](http://www.wsmforum.co.uk/) writing and my new cycle-racing feed should
go live in the middle of next week. I'll still be archiving those posts, just
in case the other sites go off-line, and they still appear in my "all posts"
feeds.

I may move my satellite TV stuff off-site soon too, which would leave
programming, webmastering and my cooperative business posts on my personal
site. This means a site which "fits together" more neatly than my current mix,
I think. After that, I may edit the categories again, merging koha back into
hacks.

Are there other changes which I should consider?

  * Comment on this
  * Start of this section
  * Start of this page
  * [All topics](../)

* * *

## HTML in RSS

2007-07-11 (Permalink): [Diary of a geek: The trials and tribulations of
trying to be compliant](http://blog.andrew.net.au/2007/07/10#being_compliant)
mentions the question of embedding HTML in RSS feeds. My simple answer is to
use [RDF Site Summary (aka RSS-1)](http://purl.org/rss/1.0/) and [the content
module,](http://purl.org/rss/1.0/modules/content/) then put the HTML version
into a content:encoded CDATA section. RSS 0.91 doesn't really work for this
and the whole Winer-inspired Really Simple Syndication is fatally flawed on
XML namespaces.

Anyway, if you're really a geek, please add [a feedback
route](http://mjr.towers.org.uk/blog/2007/webcss#feedback) (pingback is my
current favourite) to your blog.

  * Start of this section
  * Start of this page
  * [All topics](../)

* * *

## The Aggregator

I just rewrote my aggregator (schcyrssmerge2) which still works roughly on
[this algebra from
2004](http://mjr.towers.org.uk/blog/2004-4.html#1081685979%40blogger.dsl.pipex.com)
(still the basic set theory rule of blog aggregate generation: aggregate_i+1 =
(blogs**/**aggregate_i) **u** (aggregate_i**n**blogs)) but now calculates
**agg1** a different way.

First, there's **agg0** \- the items from the input RSS file - and I store
only the post ids of **agg0** as **agg0i** for speed.

Previously, I was reading in all the blogs and doing the above set theory
calculation in one go. With a couple of ill-behaved blogs (full-content feeds
of a few hundred kb), mzscheme's memory usage was just getting silly, topping
out at hundreds of megabytes to merge 4Mb of feeds into a 3Mb aggregate. I
think some of the problems are PLT's - reading a 3Mb RSS file in with

> "(define a (call-with-input-file "friends.rss" read-xml))"

seems to result in mzscheme (v352) holding 100Mb. What's with that?

Anyway, now I've written a **newitems** function which extracts new items not
listed in **agg0i** and old items which still appear in a blog and adds them
to lists.

Then, there's a function which tail-recurses over all the blog filenames,
parsing the RSS, calling **newitems** on each one and building the lists of
old and new items.

Finally, **agg1** is the new items appended to the old items. That is only
calculated as it gets written out as a RSS file.

I'll give it a few days of testing, then I'll publish a tarball. I use this to
build my all.rss feed from my various source feeds, in case you didn't
realise, as well as to run some planets for public and private use.

Anyone got a suggestion how best to reduce the read-xml memory consumption?
Upgrade? Switch dialect? I'd like some XML read/write functions and the SRFI
list functions.

  * Start of this section
  * Start of this page
  * [All topics](../)

* * *

[Comment form for non-frame browsers](../../comp/respond.pl).

Comments are moderated (damn spammers) but almost anything sensible gets
approved (albeit eventually). If you give a web address, I'll link it. I won't
publish your email address unless you ask me to, but I'll email you a link
when the comment is posted, or the reason why it's not posted.

This is copyright 2007 MJ Ray. See fuller notice on [front page](/).