Algebraic data types: without them, we all suffer

Perhaps you’ve heard of “stringly typed” code. The use of strings instead of actual, you know, proper data structures, in places where it seems completely inappropriate.

Why do programmers write such things? I’m sure some of it is simple bad programming. But there are some legitimate advantages to stringly typed code, if you can manage to hold your nose long enough to give it a fair evaluation.

  • Strings behave like actual values. You can pass them around and copy them quite easily in pretty much all languages.
  • You don’t have to write any serialization/deserialization code. Woohoo!
  • Write-fast programming! Not having to design schemas and write data structures with all the associated “boilerplate” code is the same reason many give for using dynamic languages.

I can’t in good conscience write a list of advantages without also mentioning the disadvantages:

  • Difficult to use and understand. No clues given for what values should be passed in, or might be expected to come out. Documentation becomes harder and more important, and documentation is already rarely good enough.
  • Difficult to refactor. Since way too many things are strings, it’s very hard to find all the implications of a code change. Testing also becomes harder.
  • Biggest source of security vulnerabilities since buffer overflows. Not just SQL-injection, it’s very easy for validation code to disagree with all the possibilities for how the code consuming the string might behave. To see both of these sorts of bugs in action at the same time, consider how mysql_escape_string was buggy and mysql_real_escape_string became necessary.

Okay. So, the question is: can we get rid of these advantages? Well, let’s take one small step in the right direction to: JSONly typed code. It’s all the rage these days. You tend to see it less “in-the-small” compared to stringly typed, so you’re less likely to find simple functions expecting a JSON true/false object the way you can find methods expecting one of “true” or “false” as strings. But it shows up everywhere, regardless of appropriateness these days. Configuration files, same-machine IPC mechanisms, databases…

Let’s compare it to the advantages of stringly typed code:

  • Stringly typed values are already difficult to mutate, so we can get value-like semantics from JSON by simply treating it as immutable without any real loss.
  • Serialization/deserialization code is already written for us in a library. Score!
  • All the “boilerplate” data structure construction / visiting code: also in the library.
  • Still no schemas to think about writing, so code fast! Zoom zoom!

So, a slight cost in using a library, but with mostly the same advantages and with one extra really huge one:

We can actually represent tree-like data.

Think about your standard, popular, typed languages: C, C++, Java, C#. Actually representing tree-like data is a horrible mess of writing a ton of classes along with visitor interfaces and so on. It’s a pain.

Think about your standard, popular, relational databases: MySQL, Postgres, SQL Server. How do you represent tree-like data? I’m not even sure this admits a one-sentence description, so let’s go with: it’s a pain.

Dynamic languages (equivalently, NoSQL databases) manage a little better, simply because they don’t have to bother with writing down overly-verbose boiler-plate code. JSON as a datatype brings these advantages even to typed languages.

But it comes at the cost (of course) of actually having meaningful types. Consider still the same disadvantages listed for stringly typed code:

  • It’s still completely up to good documentation to know what to pass in, and what to expect out. I think every web developer has encountered an infuriating API that gives undocumented JSON error responses sometimes.
  • Refactoring is still a concern, but the use of JSON is more often at interface boundaries where API stability is expected. But this can still result in seemingly innocent changes causing problems with clients, simply because the interface is under-specified.
  • Security vulnerabilities are apparently massively reduced, but I suspect this is partly because this is an area where security researchers haven’t yet had a play day. JSON does seem to significantly reduce the possibility of radically reinterpreting the meaning of a value, but “validation vs interpretation” mismatches are still possible, and I’m sure they’re out there.

So, how do we completely fix all these disadvantages, while retaining all the advantages? Well, the answer is: algebraic data types.

JSON, after all, is just one simple algebraic datatype:

data JSONValue
 = JSONNull
 | JSONBool   Bool
 | JSONFloat  Double
 | JSONString String
 | JSONArray  [JSONValue]
 | JSONObject (Map String JSONValue)

That’s some simplified Haskell, but the important thing is: we’re almost done. All we need is some code to serialize/deserialize. We could even have Haskell make up its own code to do this for us, but it wouldn’t actually be the JSON syntax.

Compare that to the equivalent code we’d have to write in, say, Java. 7 short lines versus, what? A few hundred? And the code we write to consume/build these data structures is also going to be quite a bit simpler and shorter.

Let me pick another example. How about a (simplified) representation of C types?

data CType
 = CTypeBuiltin   [Qualifier] BuiltinCType
 | CTypePointer   [Qualifier] CType
 | CTypeTag       [Qualifier] CName
 | CTypeArray     CType (Maybe Int)
 | CTypeFunction  CType [CType]

Not too shabby for 6 lines. But consider what it would take to do this with classes.

Then consider what choice you’d make, if you wanted to store a table mapping names to types in SQLite.

As for the relative advantages and disadvantages:

  • We get values the same way JSON does: immutability.
  • We can automatically generate serialization/deserialization code, as long as the particulars of the syntax don’t matter too much.
  • We don’t need to write any boilerplate.
  • The code precisely documents the range of valid values.
  • The code is easily refactorable. I haven’t gone into pattern matching and exhaustiveness checking, but algebraic data types are an absolute dream for “change something, fix compiler errors, wow everything just works!”
  • The opportunities for validation errors are virtually eliminated. Our C types example requires only checking to make sure that a CName is actually a valid identifier (and thus not, say, a pile of arbitrary code). It’s possible to perfectly enforce this by making the constructor of a CName type always check this, and throw an exception otherwise. (Something that can’t be done for JSON.)

The lack of algebraic data types in modern languages and relational databases is, in my opinion, the biggest cause of frustration, error and bad code since the lack of generics. I hope one day even conservative languages (like C and Java) will incorporate them.

This entry was posted in Programming and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s