June 2008

There’s a particularly insidious kind of comment spam nowadays, one that cannot be defeated by automated measures such as captcha. I’ve been noticing more and more of it on QuestionCopyright.org, and I’m sure other web site admins are experiencing the same wave.

I don’t know if this phenomenon has a name yet; perhaps you can tell me.

Basically, it’s comment spam written by people — real human beings, not robots — who are paid to surf the web. They scan each article as quickly as they can and then leave a “drive-by” comment. The comment is usually on-topic, more or less, but of extremely low quality, and contains a commercial link back to whoever’s paying that person to surf. This is now a business model: there are intermediaries who hook up people willing to leave links for money with companies looking to boost their search engine rankings. The intermediary charges a flat rate per comment (US $0.20 seems to be the going rate), keeping a percentage and paying the rest to the surfer. The customers buy in bulk, of course, and the surfers are paid in bulk; the intermediary’s business model is based on economies of scale and smoothing things out.

Such comments now comprise the vast majority of new comments on QuestionCopyright.org. That is, we still get genuine comments at the same rate we did before — in fact, that rate may even be slowly increasing — but it’s dwarfed by the number of paid-link spams we’re getting now. It’s a total deluge. To a first approximation, all our new comments are spams, and then if you look closely there are a few hams (good comments) scattered randomly in the flood.

No Turing Test can possibly solve this, because actual humans are involved. Making your captcha puzzles harder won’t help: all it does is drive up the price-per-comment a bit for the buyers, while also making it harder for legitimate commenters to leave their remarks. The only way to detect such comments is to have human editors reading and making judgements.

This has profound implications for the user interfaces by which editors filter out spam. But before we get to that, let me show you how insidious the problem is. Here are some examples, all taken recently from the same article on QuestionCopyright.org. (Note that this is just the tip of the iceberg — the same thing is happening on all the articles on the site.)

This article is very
Submitted by Anonymous on Thu, 2008-06-19 12:53.

This article is very useful.I read it carefully and I agree with the main idea of the author.The opportunities that the internet offers to make our life and work simpler should be taken advantage of. It is absolutely necessary in the field of copyright.

Internet marketing

That one was a pretty easy call. Even the link text (“Internet marketing”) practically screams spam.

But how about this next one?

Extremely Informative
Submitted by Sedona on Mon, 2008-06-09 11:32

Thank you for this extremely informative article. I agree I don’t feel its about creativity but the publishing entities sustaining a mood of “go cautiously” and keep a big legal war chest.

Thank you,

A little harder to tell, that time. The comment doesn’t exactly say anything, but it’s not immediately clear that it’s nonsense — you have to read it somewhat carefully to figure that out. Sedona is actually a registered user of the site (note that her name is highlighted in the header line), and the link text at the bottom is just the name “Sedona” too. But the link points to www.sedona-spiritual-vacations.com.

It turned out that this comment was also pretty clearly spam: “Sedona” left similar comments elsewhere on the site, always with the same commercial link at the bottom. None of her comments said anything much, let alone responded in a meaningful way to the content of the article.

But it gets worse. Some link spam comments actually say something. The paid surfer reads the article, apparently enjoys it and has some kind of non-trivial thought about it, and leaves a halfway decent (or sometimes even better than that) comment — but still with a paid link. Like this:

Patent and CopyRight
Submitted by Anonymous on Thu, 2008-05-22 15:41

Apart from the middle man and distributors its probably Lawyers who benefit the most from these **laws**.

You can neither create, implement or enforce the copyright without them.

One only has to look at the case of RIM ( Blackberry ) in Canada who was forced to pay 600 Million to what was in essence a group of 30+ lawyers who pro bono backed a patent that was actually overthrown in court ( but not before RIM was told to pay ).

This is not an isolated example where a claim jumper has been given a ridiculous patent by the patent office ( who frequently revokes them after they are challenged )

My uncle, who is somewhat of an economist, likes to say that lawyers are one of the very few professional groups who do not contribute to the gross national product of a country.

I am not against lawyers, they are a useful bunch. But like many government employees ( which they are not ) when allowed, too many of them actively attempt to overvalue their services within the scope or measurement of a country’s forward economic progress.

Signed, A Poker Lover

Not a great comment, I admit, but not completely pointless either. If there had been no commercial link, nothing else about it would have raised my suspicions. It was followed up to some days later by this one:

Authors & Artists
Submitted by Anonymous on Sun, 2008-06-08 14:14

I agree very much with the previous poster. How many lawsuits are brought up about copyright a year. The millions of dollars which are thrown at law companies around the world to up hold a ‘companies’ intellectual rights is pathetic to say the least.

The only person who truely has a right to claim stack is the writer, producer artist. I qould pay my way to anyone who does work for me. If they provide a service like my electric or water company I pay them. But what do these middle men companies do? They look out for themselve and only themselve. It is time the power was taken away from the big corperations and given back to the people who really deserve to be paid. Those who created it in the first place.


David of PC Sport Live

Wow, the link spammers are following up to each other’s posts! Actually, it’s possible that the “David” of the second post is the same person who wrote the first post, even though he (or she?) portrays himself as being a different person. I’ve noticed they do that a lot. You can often tell, from a combination of the writing style and the link destination, that supposedly distinct commenters are really the same person.

The next day, someone followed up to “David”‘s comment:

Copyright laws
Submitted by Anonymous on Mon, 2008-06-09 16:34

Hello Everyone,

Today, I note that RedHat Founder Bob Young also weighed in on the copyright issue :

A new open source software group has added its voice to the opposition against the Conservative government’s ( Canada )impending copyright reform bill. Lulu CEO Bob Young likens the legislation to banning screwdrivers because they could be used by burglars.


Young said the proposed bill will cater too heavily to the content industry and not to the engineers and software developers that are going to be most severely impacted by the new laws. The proposed anti-circumvention legislation, he said, is similar to making the use and ownership of screw-drivers and pliers illegal because they can be used to commit crimes such as burglary.

Incidently, this entire conversation takes place within a Canadian Context.

Young further says,

“The copyright philosophy behind the U.S. DMCA is that it’s illegal to do what software engineers do every day of the week and what they’ll have to continue to do in order to build better technology for all companies,” Bob Young, spokesperson for the Canadian Software Innovation Alliance (CSIA) and a former founder and CEO at Red Hat Inc., said. “The biggest concern is we’re going to have law substitute for good technology. We’re crafting these laws without having anyone from the technology industry engaged in the process.”

The complete article is here itworldCanada.com

An interested internet marketing guy

Hmmm. It’s clumsily written, and consists mostly of quotes from someone else, but there’s real content there: that quote about the screwdriver is terrific. I have to admit that the comment actually contributes something to the site. I think it probably lies somewhere between typical paid link spam and a real comment: it might be from a person who is actually associated in some permanent way with the business being linked to, and who just makes a habit of always signing his posts with a link back to his business. Or it might be the usual kind of paid link spam. I frankly can’t tell.

I could go on and on; the above is a tiny fraction of what we’ve been getting on the site. There are obvious spams, semi-spams, maybe-not-spams, clearly-not-spams, and every gradation in between. I sometimes have to exercise real judgement when doing comment moderation; it’s not always clear what’s spam and what’s not.

In fact, it is no longer possible to divide comments into “spam” and “not spam” in an unambiguous, binary way. A given comment can now fall into both categories. Paid-link spammers are humans, and may have genuine reactions to the articles they read, even though most of the time they’re reading primarily to get just enough of a sense of the topic to be able to write a drive-by comment. Editors will just have to deal with comments on a case-by-case basis. It may be possible to apply some automatable heuristics, but they will always be imperfect, because the problem of categorization has become arbitrarily complex.

This phenomenon has implications for both site editors and software designers. For the former:

  • Site editors need to get it through their heads that they’re editors. That is, they’re responsible for quality of the site, and that includes the comments. Whether a comment has spam-like commercial links in it or not is not the question. The real question is, Does that comment contribute anything useful to the site? It’s true that there is a strong correlation between commercial links and poor quality, but it’s the poor quality that’s the problem. If a comment is good but has commercial links, you don’t have to throw out the baby with the bathwater — just replace the links with some text like “[commercial link deleted]”. (It’s important to leave some visible sign that the comment has been edited, otherwise the commenter is effectively being misquoted by your site.)

  • Site editors need to educate their readers about the situation. Most readers don’t run web sites with open comment forums, and most also have no idea that this whole paid-link comment spam problem even exists (partly because the editors have been protecting them from it). When you accidentally delete one of their comments, or they see comments disappearing, they’ll wonder why; you should have an explanation at the ready, and should refer to it often. (I’m about to update QuestionCopyright.org’s editorial policy to reflect this, and then will link more prominently to that policy from various places on the site.)

The implications for software designers (particularly of content-management systems such as Drupal, which is what we’re running on QuestionCopyright.org) are equally important:

  • Stop thinking about spam-filtering as the problem of filtering a few spams out of the stream of hams. It’s the other way around: the spams are the stream, and the problem is to pick out the rare hams. Please design interfaces accordingly!

    If it takes two clicks plus a request/response loop with the server just to see the full body of a new comment, and then another click-plus-loop to mark it as spam or ham, and then another click to confirm, then site editors will waste the majority of their time clicking and waiting. If I have to visit comments by visiting the articles to which they are attached, then the interface is mis-tuned: the operative unit should the comment, not the article. Since most new comments are obviously spam, I don’t need to see the original article to mark them as such. They are the common case, and the interface should be optimized toward them.

    The ideal interface would present the editor with a single page showing all as-yet-unmoderated comments, with their full texts (or arrange them in groups of 20, or whatever, if that would make the page too long). They would all be presumed spam, and the editor’s job would simply be to glide down them marking the hams. Each comment would have a link allowing the editor to see the original article, in case that context is necessary (though it rarely is). Each comment should have a flag next to it indicating whether or not there are any links in the comment at all — if a comment has no links, it is much less likely to be paid spam, and therefore much more likely to be high quality. This flag would enable the editor to set her expectations, which is a great help when faced with hundreds of comments.

  • Don’t make the parent->child threading relationship between comments fatal. That is, if comment B is a reply to comment A, but A is later classified as spam and deleted, B should not be deleted along with it. Users often click “reply” just as a way of making a new comment; the fact that the comment they’re replying to is spam has no bearing on the quality of the new comment. B may not be spam, even if A is. (The version of Drupal we’re currently running at QuestionCopyright.org gets this wrong, unfortunately, but newer versions may have fixed it.)

  • Have a setting that allows editors to simply prohibit links. It’s a policy decision that each site must make on its own (some links are useful, as we saw in one of the examples above), but the option should be there.

Any other ideas, folks? It’s a whole new world out there…

It’s over.

I don’t mean the Democratic Primary, I mean the general election. The whole thing. Barack Obama is going to walk over John McCain like a piece of gum on the sidewalk.

John McCain


Watching John McCain speak tonight, I was reminded of a principle my Go teachers often mentioned: don’t make a move to which your opponent’s best response would be a move he wanted to make anyway. I often did that: I’d put a stone down on the board, and my opponent’s response would simultaneously counter what I had done and serve some other purpose useful to my opponent. Those opponents who were trying to teach me would ask “Why did you make me stronger? I wanted to go there anyway.”

John McCain is making this mistake with Barack Obama. He’s handing Obama exactly the debating points Obama wants. He accuses Obama of thinking that government can provide solutions, but people remember Katrina, and Obama wants a chance to say specifically what he thinks government can do. He accuses Obama of being willing to engage in diplomacy with rogue regimes, but people remember Iraq, and Obama has a whole foreign policy debate he’d just love to get into with John McCain. He accuses Obama of turning to the past for answers; but people look back fondly on the past, because the present is so tarnished. Obama is only too happy to remind people how much better this country used to be run.

When you get right down to it, Obama just understands how this game is played, and McCain apparently does not. Yes, it helps that Obama is smarter, more charismatic, and genuinely has better policies. (That golden baritone voice and his so-ready-for-this wife Jacqueline Michelle don’t hurt either.) But it also helps that John McCain doesn’t seem to realize how easy it is going to be for Obama to turn McCain’s talking points into a real debate, and win it.

Barack Obama

Of course, due to the electoral college mess, it’s hard to predict how close things will be in the fall. But something would have to go seriously wrong for Obama not to pull this one off by a wide margin in the popular vote. John McCain is going to give Obama opportunity after opportunity to show how different he really is, and it’s only going to persuade more people to vote for Obama.

If I were running John McCain’s campaign, I’d start reading some Go books.