Generic Language

Algebraic data types: without them, we all suffer


Perhaps you’ve heard of “stringly typed” code. The use of strings instead of actual, you know, proper data structures, in places where it seems completely inappropriate.

Why do programmers write such things? I’m sure some of it is simple bad programming. But there are some legitimate advantages to stringly typed code, if you can manage to hold your nose long enough to give it a fair evaluation.

I can’t in good conscience write a list of advantages without also mentioning the disadvantages:

Okay. So, the question is: can we get rid of these advantages? Well, let’s take one small step in the right direction to: JSONly typed code. It’s all the rage these days. You tend to see it less “in-the-small” compared to stringly typed, so you’re less likely to find simple functions expecting a JSON true/false object the way you can find methods expecting one of “true” or “false” as strings. But it shows up everywhere, regardless of appropriateness these days. Configuration files, same-machine IPC mechanisms, databases…

Let’s compare it to the advantages of stringly typed code:

So, a slight cost in using a library, but with mostly the same advantages and with one extra really huge one:

We can actually represent tree-like data.

Think about your standard, popular, typed languages: C, C++, Java, C#. Actually representing tree-like data is a horrible mess of writing a ton of classes along with visitor interfaces and so on. It’s a pain.

Think about your standard, popular, relational databases: MySQL, Postgres, SQL Server. How do you represent tree-like data? I’m not even sure this admits a one-sentence description, so let’s go with: it’s a pain.

Dynamic languages (equivalently, NoSQL databases) manage a little better, simply because they don’t have to bother with writing down overly-verbose boiler-plate code. JSON as a datatype brings these advantages even to typed languages.

But it comes at the cost (of course) of actually having meaningful types. Consider still the same disadvantages listed for stringly typed code:

So, how do we completely fix all these disadvantages, while retaining all the advantages? Well, the answer is: algebraic data types.

JSON, after all, is just one simple algebraic datatype:

data JSONValue
 = JSONNull
 | JSONBool   Bool
 | JSONFloat  Double
 | JSONString String
 | JSONArray  [JSONValue]
 | JSONObject (Map String JSONValue)

That’s some simplified Haskell, but the important thing is: we’re almost done. All we need is some code to serialize/deserialize. We could even have Haskell make up its own code to do this for us, but it wouldn’t actually be the JSON syntax.

Compare that to the equivalent code we’d have to write in, say, Java. 7 short lines versus, what? A few hundred? And the code we write to consume/build these data structures is also going to be quite a bit simpler and shorter.

Let me pick another example. How about a (simplified) representation of C types?

data CType
 = CTypeBuiltin   [Qualifier] BuiltinCType
 | CTypePointer   [Qualifier] CType
 | CTypeTag       [Qualifier] CName
 | CTypeArray     CType (Maybe Int)
 | CTypeFunction  CType [CType]

Not too shabby for 6 lines. But consider what it would take to do this with classes.

Then consider what choice you’d make, if you wanted to store a table mapping names to types in SQLite.

As for the relative advantages and disadvantages:

The lack of algebraic data types in modern languages and relational databases is, in my opinion, the biggest cause of frustration, error and bad code since the lack of generics. I hope one day even conservative languages (like C and Java) will incorporate them.