2010

National Inventory of Legal Materials

A project I’ve been working on for a few months (with support from O’Reilly Media) is now public, and I’m really excited about it — it’s a good cause, and also has the potential to become a very interesting piece of software.

First, the good cause part. From Carl Malamud’s post on O’Reilly Radar:

The Law.Gov movement identified 10 core principles for the dissemination of primary legal materials in the United States. If you find a jurisdiction that violates one of those principles, you can enter in a bug report. The code for the bug tracker is open source and we have a bug tracker for the bug tracker so you can help us get this ready for production.

The legal bug tracker is a classic open source story. We started with the Media Bugs code base developed by Scott Rosenberg and his team… the basic Media Bugs code base was extensively reworked and repurposed to be adapted for the National Inventory of Legal Materials. …

In other words, we turned this:

Screenshot of mediabugs.org

Into this:

Screenshot of bugs.resource.org

The transformation goes deeper than just appearance. The bug report fields themselves are different. For example, MediaBugs.org uses bug types like “misquotation”, “faulty statistics”, “error of omission”, etc:

Screenshot of mediabugs.org, reporting a bug with selection menu opened

Whereas the National Inventory of Legal Materials needed a checklist of potential violations: Does the jurisdiction charge for access to its laws? Is access vendor-neutral? and so forth:

Screenshot of bugs.resource.org, reporting a bug

You might think we had to change the database schema. But we didn’t!

Ben Brown’s PeoplePods codebase, on which both the MediaBugs and NILM Bug Tracker code are based, uses a clever trick: an indirection table (the ‘meta’ table) that allows PHP code to associate arbitrary fields to the central bug object, on the fly. In the code, you just start using the field, prefixed with “meta_“:

  <input type="text" class="text"
         name="meta_jurisdiction_contact_city"
         id="jurisdiction_contact_city"
         .../>

…and everything Just Works. The field springs into existence automagically — you can start using it everywhere, run SQL queries against it, etc.

Which brings me to the second part of this post.

If you’re a PHP programmer looking to help out on a good project, come talk to us. We can use a few more hands on deck, and you’d look far to find a better cause than helping to make primary legal materials publicly accessible to the public. While the project’s first priority is to meet the needs of the National Inventory of Legal Materials (see the current bug and feature request list here), in the longer term, I think it might be possible to turn this codebase into a bug tracker generator: a system that allows people to quickly create and deploy a customized bug tracker.

Usually people address customized bug tracking needs in one of two ways: they make do with some existing tracker (e.g., filing non-software bugs in one of the many free software bug trackers), or they spend a lot of time and effort building a bug tracker from scratch or near-scratch. Neither solution is entirely satisfactory. In the first, users — often non-technical users — have to endure an interface that is clearly not attuned to the actual nature of the bug reports being filed. In the second, the burden of writing a tracker from scratch raises the organizational risk, so you end up spending a lot of time making sure the requirements have been spec’d out accurately, which of course is impossible.

This code base has the potential to drastically lower the cost of making a customized tracker. So far, I’ve simply forked the MediaBugs theme and made the customizations manually; the differences are not huge, and that was the fastest route to shipping. Eventually, I’d like to add an abstraction layer that encapsulates the customizations in a description file, that could be either be generated via some kind of graphical administrative interface or just written out by hand:

{
  "bug":
    fields :
      [
        [ /* Stuff that goes on screen 1 of the submission process. */
          "summary": {
            "description" : "A summary of the bug.",
            "type" : "text"
          },
          "body": {
            "description" : "The main description body of the bug.",
            "type" : "text"
          },
          "contacted": {
            "description" : "Whether or not the bug source was contacted.",
            "type" : "boolean"
          },
          "response": {
            "description" : "If the source responded, what did they say?"
            "type" : "text",
            "display_if" : "contacted"
          },
          ...
        ],
        [ /* Stuff that goes on screen 2 of the submission process. */
          ...
        ]
        ...
     ]
  ...
}

Again, that vision is for the long term. Right now, what we need are some PHP programmers who want to help make the current code base better serve its purpose of tracking legal access bugs. Once we’ve done that, there will be two trackers (the original MediaBugs, and the new NILM Bugs) sitting on top of the same foundational code, and we can use the difference between them to triangulate on the right set of abstractions to make a true tracker generator.

Take a look at the project, especially the README, and then ask questions if you think you might be interested in helping. It’s all open source, and your committment can be as large or as small as you want it to be.

bugs.resource.org

There’s about to be an outcry over the possibility that U.S. Internet service providers might start charging by the byte — so-called “pay as you go” Internet service. Before the hard-headed economic realists duke it out with the participatory democracy free-speech propeller heads (I’m in both camps, so I say all that with love), here’s a modest proposal:

Charge based on the square root of the number of bytes.

Sometimes we think it’s natural that people should pay based on how much of something they use. The exceptions to that are interesting: while we’re okay with the principle when it comes to food (perhaps because people all generally use the same amount, within a narrow range), many are not okay with it when it comes to medical care. We’re mostly okay with it for electricity and water, even though consumption can vary widely and both goods are really required infrastructure for life in a modern society.

We haven’t really decided how we feel about it for Internet usage. It has a certain appeal: why should your neighbor stream online videos all day, slowing down everyone and transferring orders of magnitude more data over shared pipes than you do, yet pay the same amount per month?

One counterargument is that you and your neighbor are rate-limited and can only transfer a certain number of bytes per second at a maximum anyway, so if you consider the monthly charge to be paying for that maximum, it’s up to each user whether they want to use the full capacity each month or not.

That argument’s not very convincing, though, because the system isn’t physically capable of supporting everyone at a maximum simultaneously anyway. It’s a theoretical capacity only; in practice, the pipes are a shared resource, and the ISPs deal with that reality every day. People who consistently use more than their fair share of that resource should face a disincentive.

A better argument might be: bidirectional Internet access is so important to participation in society that we should find ways to subsidize it. A world where the rich have access to all the online video they want while the poor have to make do with ASCII art is a losing proposition for everyone.

But then how to make sure there’s disincentive to over-consume?

Charging by the square root of the number of bytes transferred per unit of time means that each user’s costs rise with usage, but with a much flatter curve than simply charging a straight rate per byte. You pay more, and if you use orders of magnitude more you pay noticeably more — but you don’t pay orders of magnitude more. The teenager who wants to upload her own movie will be able to do so, but doing things like that often enough will merit some consideration from the user, which is what we want.

(Obviously it doesn’t have to be exactly the square root function; I just mean some well-defined function that flattens the curve and can be intuitively “felt” by users given enough experience. Square root’s probably a good one, I’d guess, but I haven’t actually worked through the data. What do I look like, some kind of hard-headed economic realist?)

Dining Philosophers

I’m on a lot of conference calls lately. Sometimes the participants don’t all know each other, so we’d like to start by going around the table introducing ourselves.

The only problem is, there is no table.

Furthermore, there’s often no person who has a list of all the current participants, and even if someone did, people might dial in and drop off at random times.

I couldn’t think of any algorithm to solve this, at least not one that humans might be able to reliably run. But then I thought, why should the humans solve it? The conferencing system itself knows who’s dialed in! What if one could issue a command that would cause the central server to go around all its connections, asking each person to introduce themselves? Each participant would hear the system’s voice say “Please introduce yourself, then press # when done.” in their ear (and their ear only) when it was their turn. The central server could automatically adjust for newcomers and dropoffs while the process is under way, and it would know when everyone available has introduced themselves. While it’s telling a given person that it’s their turn, it could even tell all the other people their position in line, so they’d be prepared when it comes around to them.

Does anyone know a conferencing service that does this?

(I’m posting this here partly to create prior art. Whoever implements this first will have a temporary advantage, but we don’t want them turning it into a permanent advantage by patenting the technique. Ideas are cheap — see, I’m giving this one away for free.)

Winston Hussein Frankenbama

I’ve mentioned before that you’re basically not a complete human being if you haven’t seen Will Franken perform (what I didn’t mention is that you’re also not a complete human being if you have seen him perform, but this time the incompleteness is a result of seeing him, because he’s simply that great — so let your incompleteness be the fulfilling, joyful incompleteness of knowledge, not the dim, pathetic incompleteness of ignorance).

But I digress.

What I’m trying to say is, Will Franken is performing a new show, Will Franken Rises From The Ashes at The Purple Onion (140 Columbus Avenue in SF) next weekend: Friday, October 1st and Saturday, October 2nd, at 8pm both nights.

Tickets are $20 at the door, but if you’re making this decision based on price, you’re insane. You should make this decision based on one factor and one factor only: “Am I within 100 miles of San Francisco on either of those two nights?” Oh, what, it’s your wedding night? Cancel. You can get married any time; your guests would rather see Will Franken anyway, trust me.

Srsly. I wouldn’t blow my 100th blog post on just anything. This man is an amazing performer. Go to the show, and bring all your friends. You will rock back and forth with disturbed laughter until your socks explode.

Go see Jessica Ferris’s “Missing” in San Francisco on July 10th. Stop thinking — there is no decision to make. Just do it. Buy tickets. Tell your friends. Then show up. It’s that easy.

I saw Jessica perform in San Francisco a few years ago and it was mesmerizing. And that was for a show that she didn’t even particularly try to push. Now for Missing, she’s pulling out all the stops: in her 10 reasons to see “Missing” on July 10th, she outright commands her fans — of whom I am one — to rally, saying “this show is worth it”.

Jessica Ferris in performance

This smartly-constructed dark comedy is a mix of autobiography, physical theater, and social commentary. Says one audience reviewer: “I felt stimulated and energized by such a smart, quick, complex piece.” Says another, “It’s hilarious to the point of tears, touching, and close to the bone. How can she make me laugh till I cry telling such a dark story?”

The heart of the show is Ferris’s search for the truth about her father, who disappeared under mysterious circumstances when she was two years old. She portrays the many members of her family who tell their versions of the story of his disappearance. …

Here’s that tickets link again.

These two quotes from the June 10th New York Times beg for juxtaposition:

The first is from the article about the oldest leather shoe ever discovered:

…an Armenian doctoral student, Diana Zardaryan, noticed a small pit of weeds. Reaching down, she touched two sheep horns, then an upside-down broken bowl. Under that was what felt like “an ear of a cow,” she said. “But when I took it out, I thought, ‘Oh my God, it’s a shoe. To find a shoe has always been my dream.’

The second is from Hidden Misery: A Glimpse Into North Korea, which draws on interviews with North Koreans who have escaped what may be the most isolated, oppressive country on the planet. One of the article’s sources is the wife of an official in the ruling Worker’s Party (for obvious reasons, no other identifying information is given):

Those North Koreans who have never crossed the border have no way to make sense of their tribulations. There is no Internet. Television and radio receivers are soldered to government channels. Even the party offical’s wife lacks a telephone and mourns her lack of contact with the outside world. Her first question to a foreigner was “Am I pretty?”

Saw another legitimate email bounced as spam today:

This message was created automatically by mail delivery software.

A message that you sent could not be delivered to one or more of its
recipients. This is a permanent error. The following address(es) failed:

  myfriend@myfriendsdomain.com
    (generated from myfriend@domain-on-shared-server.org)
    SMTP error from remote mail server after
    RCPT TO:<myfriend@myfriendsdomain.com>:
    host mx.service-myfriendsdomain-uses.com [216.122.171.54]:
    554 5.7.1 Service unavailable;
    Client host [67.152.129.89] blocked using
    hostkarma.junkemailfilter.com=127.0.0.2;
    Black listed at hostkarma
    http://ipadmin.junkemailfilter.com/remove.php?ip=67.152.129.89

In other words, a completely legitimate mail was bounced because people who use the same mail server as the recipient (or for that matter, the sender) receive too much spam.

Sound surprising? Here’s the scenario:

  1. Sender spammer@spammyspamspam.com sends bad (even virus-laden) email to innocentvictim@domain-on-shared-server.com.
  2. The innocentvictim@ account is configured to forward automatically to innocentvictim’s real email address, like ivictim@gmail.com or innovic@somepersonaldomain.com or whatever.
  3. The recipient domain (gmail or somepersonaldomain) is protected by a spam-filter (in gmail’s case, their own custom filter, in the latter case, a filter like junkemailfilter.com’s service).
  4. The spam filter simply sees spammy mail coming from the shared server.
  5. The shared server gets docked points for sending spam!
  6. Lather. Rinse. Repeat.
  7. After a while, legitimate people get bounced for sending legitimate mail to innocentvictim@domain-on-shared-server.com, because the filtering service that protects the recipient’s final account treats all the forwards as spam, without unpacking them.
  8. Furthermore, mail from innocentvictim@domain-on-shared-server.com starts getting auto-rejected by some recepients, because those recipients use the same filtering services as innocentvictim and, as we already know, innocentvictim’s mail server is being docked points because of all the spammy mail innocentvictim receives and auto-forwards.

In other words, a server from which many people forward mail tends to get blacklisted not because that server originates any spam, but because addresses there receive spam. And who doesn’t receive spam? Right. You begin to see the problem :-). Furthermore, it’s very hard for the filtering service to do better: if the spam-filtering service were not to dock points in that scenario, then the spammers would get clever and structure their original mails to just look like forwarded mails. They don’t care. In fact, they already do that sometimes.

So as far as I can tell, blacklists are kind of inherently broken. I’ve personally had to deal with this problem many times. What I did in this case was go to the URL mentioned in the bounce message and removed our shared server’s IP from the blacklist, using the procedure offered by junkemailfilter.com. But they’ll just re-add us soon, because the source of the problem isn’t going away.

One solution would be for the forwarding source address to insert a special header (containing a unique code) into the mail before it passes the mail along to the final destination. Then on the junkemailfilter.com side, that person would configure their filtering to allow mails with that code through — never treat them as spam. However, that would be a lot of work for most email users, due to the heterogeneity of mail delivery software; I don’t see it as a generally applicable solution.

Another solution would be an interface at junkemailfilter.com whereby users could tell it “I’m auto-forwarding mail to you from domain-on-shared-server.com. Please keep that in mind when deciding whether domain-on-shared-server.com is an originating source for spam.”

Any other ideas?

[Reblogged from my post at Talking Points Memo.]

Bob Ostertag has a short but lancing piece in the Huff Post today about how the New York Times got astroturfed by an organization calling itself the “Gulf of Mexico Foundation”. The NYT describes them as a “conservation group” when the evidence is that they are, essentially, an oil industry front.

I understand that the pressures of reporting a story as it happens are real and sometimes require cutting corners, but if you don’t have time to do fact checking, then why not simply avoid making any factual assertions you don’t have to make? Nothing about the story requires labeling them a “conservation group”. Just, you know, leave off the word “conservation” — that’s all it takes.

(Of course, doing some elementary digging into the group’s governance would have been even better, but failing that, the NYT could at least avoid doing their PR work for them.)

speaking

For folks in the SF Bay Area: I’m at speaking at WordCamp San Francisco this Saturday at 10:30am: Bodysurfing the Blogosphere: How an Audience-Distributed Film Won Big.

It’s an in-depth look at how audience distribution worked for Nina Paley’s freely-licensed film Sita Sings the Blues. The rest of the WordCamp speaker schedule looks great too: Richard Stallman, Scott Berkun, Matt Mullenweg, Scott Rosenberg (I’m just bummed my talk is at the same time as his), and more.

Come on by if you’re in the area!

This is going to be a short one, but I can’t go to bed without sharing the news:

PUBPAT (the Public Patent Foundation) and the ACLU have just won a major victory for scientific freedom: the US District Court for the Southern District of New York has ruled in favor of their argument that patents on genes that cause hereditary breast cancer and ovarian cancer are invalid. And the court made the ruling for the best of reasons: that genes are a product of nature and not patentable in themselves.

BRCA1 gene molecule

I’m not a lawyer, but it looks to me like Judge Robert Sweet examined the question about as thoroughly as one possibly could — start around page 101 of the judgment (marked as page 98 in the text) for the details. While the judge was unwilling to go so far as to rule the patents invalid on constitutional grounds, as the plaintiffs had asked, he made it clear that he was avoiding it only because there was no need to reach that conclusion to decide the case (constitutional questions are traditionally avoided if there is another way to reach a judgment).

Congratulations to PUBPAT and the ACLU! Both of them are non-profits; neither is swimming in money, the former probably even less than the latter. You know what to do (I just did).