Thread Theory

Now this is a post I’ve been wanting to write for a while…

Conversational threads on the Internet should be independent of the medium they’re displayed in. That is, mailing lists, newsgroups, and web forums are all the same thing. They’re just different interfaces to the same data structure, and that data structure is the thread: a series of messages with directional reply relationships to one another.

The thread is the fundamental unit of conversation on the Internet, and the message relationships in a thread are independent of the interface used to access it.

One reason I’m ranting about this is that people sometimes set up “web forums” that behave like mailing lists except that you can’t read them with your regular email client. This is a tragedy, because for some of us our relationship with our email client is like that of a chef with his knives. You know how chefs usually travel with their own knives, rather than use someone else’s? I like to travel the Internet with my own knives: when I’m reading threaded information, I want to do it with my threadreader, that is to say, my mailreader. I’ve spent a lot of effort making it behave in precisely the way that works best for me, and there’s no reason those behaviors shouldn’t apply to all the threaded information I encounter on the Internet. I think I’m not alone in this.

Some might object that it’s equally tragic that people set up mailing lists instead of web forums. But it’s not, really. Mailing lists at least have a mature and well-understood programmatic interface — email — that allows sites like Gmane to provide the data via different interfaces. Gmane, by the way, seems to have independently arrived at the same conclusion about threads being independent from the interfaces used to access them.

The web forums example is just the tip of the iceberg. If the lesson that threads are the fundamental units had really embedded itself deeply into the programming world, we’d have mailing list software that would allow you to subscribe to just one thread at a time (or, if the list server software wouldn’t do it, most client software would do it by filtering). Blog comment threads would be accessible via a standard thread interface that would allow them to be presented as mail, news, web forums, blog comments, or chopped lettuce. It’s all the same thing: a message and a bunch of followups, some of which have sub-followups, and so on (see Directed Acyclic Graph).

I’m not asking everyone to use my mailreader, by the way. Indeed, I think it would probably drive most people insane. I’m merely saying that it would be good for us to have an explicit understanding of threads as the unit to focus on. What is a mailing list? It’s a big thread with a lot of subthreads. The charter (the declared purpose or topic) of the mailing list is an implicit initial message, and every message posted to the list is a “reply” to it; many of those messages are also replies to each other, of course.

What’s a blog post? It’s an initial message, and the comments are replies (and some comments are replies to replies, etc).

What’s a web forum post? It’s… Well, by now you can tell where this is going. I’ll stop here.

The world needs a unified conception of The Thread, something we can write programming interfaces for that then any client software can implement. This is not completely trivial, because in most messaging systems, the threads are not explicitly recorded. Instead, metadata attached to each message just says what earlier messages it is a followup to, and the threads are reconstructed from this. There is a certain richness, a possibility of variety, inherent in doing things that way, and we wouldn’t want to clobber that by making too naive a threading standard.

I wish I had time to work on it, but I don’t. I only have time to rant. So I’m ranting, in the hope that other programmers will agree that viewing all messaging systems as manifestations of The Mighty Thread is good way to look at things. Maybe someone can turn this idea into something useful.

16 Responses to “Thread Theory”

  1. Chuck Burgess Says:

    That is a really interesting observation. Looking at those different system setups from that perspective, it looks like a simple datastore of threads could be wrapped by a single interface spec that could then be implemented across all those UI types. It sounds too simple _not_ to do it…

    /me wonders what he can talk his PEAR buddies into trying in 2008…

  2. David Glasser Says:

    Does Atom (the syndication format, as well as the publishing protocol) fit the model you are looking for?

  3. Karl Fogel Says:

    Chuck,

    I think there is a natural, normalized database representation for threads, yes (it’s probably been implemented many times). The more complex question is how to represent the messages, since we don’t want to lose the property of the threads being an emergent property of the relationships between messages. Not that it’s rocket science: I think you’re probably right, it’s too simple not to do it :-).

    David,

    Well, from reading the Wikipedia page on the Atom standard, I think it’s trying to solve somewhat different problems. In particular, there doesn’t seem to be a concept like “Message-ID” there, or a standard for referring to other messages to indicate a reply relationship. (I haven’t read the standard carefully, though; please correct me if I’m wrong.)

  4. Chuck Burgess Says:

    For storing the threads, seems like it’s only a matter of keeping a “parent message” pointer with each thread record, and pulling the records back together to show their interrelationships should be just rebuilding the tree. Once you wrap one interface around reading/writing/restoring those threads, seems like building an individual adapter/facade for each of email client, news client, forum client, etc atop that interface would be the fun but still-fairly-straightforward part.

    As I continued thinking about all this yesterday, I got to thinking about where the actual storage should be. For email, you typically have local and/or remote… for news, the true storage is remote though your client can keep local copies… for forum, it’s all remote and you reach it via webpage. Typically I’d expect you to keep your own “thread archive”, implying all threads that you actively pull in will remain available to you in your local archive, but I could also see a use case for an ultra-light client program that holds no local storage whatsoever… at best, it’s local archive would only contain some kind of URL pointer that can get you to where the “one true” remote thread is actually located. This use case makes perfect sense when considering forums and news, but would obviously mean for email you actually have an email account that allows server-side storage (e.g. imap).

    I’m still enjoying thinking this out :)

  5. Karl Fogel Says:

    One thing to keep in mind is potential multi-parent situations. Email headers got this all worked out years ago: they’ve got the “Message-ID” header to give every message object in the universe a unique identifier (but see below about “Supersedes”); the “References” header to indicate all the messages (or at least the first and last-so-far) in the thread preceding this one; the “In-reply-to” header to indicate which message this one is replying to; and the (I think rarely used) “Supersedes” header (sometimes apparently called “Replaces”) for when you need to have a new version of an already-posted message.

    I don’t think “In-reply-to” is supposed to ever list multiple IDs, but you could have a situation in which there is only a “References” header and no “In-reply-to”. Also, nothing prohibits cycles, as far as I know: message A can be in-reply-to message B which is in-reply-to message C which is in-reply-to A.

    None of these things are showstoppers, though. I do think it’s important to separate message body storage from the thread structure. All the relationships here are represented in the metadata: the message bodies could be absent entirely, yet the relationships would still be the same.

    The idea of a “one true thread” may be dangerous. Email and news headers are arranged so that newly-encountered messages can be incorporated into a given thread store. It may be that no single archive has all the messages in a particular thread — in fact, incompleteness could be the common case. For example, if I reply privately to a public mailing list post by you, only you and I have the reply, but it’s still part of that thread.

    So although the thread is the fundamental unit, it’s also a very amorphous one.

    I found these pages useful:

    Message Threading in E-mail Software

    List of mail & news headers

    Wikipedia section on email headers

  6. Denny Schlesinger Says:

    Karl:

    CompuServe forums had beautiful threading and you could keep the data on your own machine, a necessity back then because connections were slow and expensive. When I first tried a web forum it was ten steps back from CompuServe and my first web project was to reproduce the CompuServe threading on the web. Unfortunately, my web programing skills back then were not up to the task and I had to abandon it.

    A thread is really just a tree with links back and forth between parent and child. Siblings are ordered by age (ascending Id will do).

    I have yet to see a web forum implement proper threading, the closest I have seen is The Motley Fool discussion boards and that is proprietary code and the messages are practically unsearchable.

  7. Karl Fogel Says:

    How do you feel about Gmane forums?
    All the code there is free, I’m pretty sure.

  8. Denny Schlesinger Says:

    Karl:

    I couldn’t find a forum at Gmane but I found some very interesting things. First, they have a threading diagram just like the one CompuServe used to use. I’m very comfortable with that and it effectively turns the mailing list into a forum format confirming what you said:

    Conversational threads on the Internet should be independent of the medium they’re displayed in. That is, mailing lists, newsgroups, and web forums are all the same thing. They’re just different interfaces to the same data structure, and that data structure is the thread: a series of messages with directional reply relationships to one another.

    The one difference that stood out is that in a forum you don’t need to quote as much as when replying to email since you are already following the thread and, when in doubt, you can always bring up the original instantly.

    Second, I chanced on a discussion about mailing lists vs. forums :)

    http://thread.gmane.org/gmane.comp.emulators.wine.devel/57000/focus=57072

  9. Karl Fogel Says:

    I think what you encountered at Gmane *are* forums, actually. They’re also mailing lists :-). You can reply to them from Gmane, without leaving your web browser.

    I usually use quoting to indicate precisely what part of a message I’m responding too — actually, one of my gripes about a lot of web-based forum software is that it usually doesn’t paste in the original message (with “>” or other reply markers) automatically, thus making quoting more work than it needs to be. (For example, my blog’s comment software doesn’t give me quote marks for the comment I’m following up to; if it had, I would have done so here.)

    Gmane’s forums get it right, though: if you choose the ‘followup’ action, it’ll provide you with the quoted message, email-style.

  10. Barry Warsaw Says:

    Karl, I couldn’t agree with you more. I’ve been thinking about many of the same issues over the years and plan to do something about it in Mailman 3. I’ve written about some of my plans in the mailman-developers mailing list, and that would be a good place for people to visit if they want to participate in making this real.

    GNU Mailman: http://www.list.org
    Mailman wiki: http://wiki.list.org

  11. Karl Fogel Says:

    That is great to hear (speaking as a heavy Mailman user), thanks!

  12. Silona Says:

    I completely agree they really are similar data.

    Juliette Melton and I were also talking about this at the tools of change conference in NYC.

    We also wanted to take it up a notch and allow people to vote on and tag threads as well so that really good ones could migrate easily into FAQs and knowledge bases.

  13. Silona Says:

    also … you know what I want to do with connect the dots…

    all pieces are stored in a db and the user can change the connection btn the datapoints between different metrics such as date or popularity.

  14. Karl Fogel Says:

    (Hi, Silona!) Yeah, partial automation of FAQ maintenance would be a great thing, and is long overdue. Although what would work best there might be not the thread (which is long and discursive and in some ways ill-defined, since it can include different messages depending on which database you draw it from), but rather a summary of the thread, with references to particular messages. Something like what SummaryDesk was started to do, but unfortunately that project is stalled due to limited time. :-)

  15. Eitan Greenberg Says:

    I wholeheartedly agree. I live inside my mail client and prefer mailgroups to forums because I can view the data in my customized view – and all in one place.

  16. micah Says:

    Hmmm, so everything is a thread? Or perhaps….. a wave?! Gasp!

    Senior Fogel, I advise you to initiate patent-infringement proceedings immediately

Leave a Reply