## A newbie’s guide to understanding Clang AST construction

I’ve just jumped into reading Clang code a bit, and these are some notes on how it works. Accuracy completely disavowed, I’m obviously no expert on Clang internals! (But, you know, corrections welcome.) I’m mostly interested in C, so I’ll probably skip some stuff important for C++ (and ObjC) here.

## A high-level overview

The general structure for compiling a file looks standard, something like “Lex -> Parse -> Sema -> AST -> whatever” where “whatever” might be something like “CodeGen”. These are all names of modules you’ll find under `clang/lib/` and the corresponding `clang/include/clang/`.

Lex and Parse have mostly the obvious responsibilities, with the exception that Parse appears to handle management of scopes (for name lookup). Sema is a module that tries to separate out from Parse the semantic analysis and AST construction code. Sema is pretty much a bunch of functions, called by Parse, that do type checking and AST construction.

That’s right: type checking is done during parsing of the source file, and not performed on the AST. I’m assuming this is an awful requirement imposed by C++ syntax. (And Parse handling scopes is probably the result of C syntax and the typedef hack.)

On the bright side, it does mean that ASTs, once constructed, have all type information already filled out. (For those familiar with the “AST typing problem” typically discussed in functional languages, Clang is essentially resolving this by not having a tree data structure without full type information at all.)

## Scope overview

The scoping mechanism is somewhat nifty. To see an example, look for `ParseForStatement` in `Parse/ParseStmt.cpp`. To open a scope, it first decides what kind (just continue/break scope, or also including declarations) and then creates a `ParserScope` object, which does the following:

```/// ParseScope - Introduces a new scope for parsing. The kind of
/// scope is determined by ScopeFlags. Objects of this type should
/// be created on the stack to coincide with the position where the
/// parser enters the new scope, and this object's constructor will
/// create that new scope. Similarly, once the object is destroyed
/// the parser will exit the scope.```

It’s just a lifetime management object: the scope opens where the constructor runs, and closes either explicitly, or when the destructor is called.

Related to scoping, of course, is looking up identifiers. I’ll avoid the full details, let’s just look at how `ActOnIdExpression` in `Sema/SemaExpr.cpp` works. First, look up results are stored in a `LookupResult` (`Sema/Lookup.h` and `Sema/SemaLookup.cpp`), but actual lookups are done by function call e.g. `LookupParsedName` (`Sema/SemaLookup.cpp`).

That just calls `LookupName`, which gets an iterator from an `IdResolver` which, you know what? Let’s not go further on that. Anyway, yeah, lookups happen, and at least you can see the `Scope` being passed into `LookupParsedName`.

## Typechecking overview

I won’t look too much at the diagnostic system. It’s neat, if you want to check it out yourself. Handles things like translations (I assume), and emitting a relevant snippet of the original code, to give context to an error.

The first thing to do, when trying to show an example of type checking, is to think of something in C that isn’t type correct. This is a trick in the land of C where `if(0["Hello, world!"]);` is perfectly okay.

So, let’s go with structs. `if(expr_of_type_struct_foo)` ought to be an error, so long as that’s not a pointer, anyway. Looking at `ActOnIfStmt` goes nowhere, unfortunately, so we back up to looking at the control flow for `ParseIfStatement` and discover it calls `ActOnBooleanCondition` on the expression first.

This leads to `CheckBooleanCondition` and we find that it:

1. transforms the expression with `DefaultFunctionArrayLvalueConversion`, and
2. checks `isScalarType()` on the type of the resulting expression.

This leaves us with looking at the default conversions, but C’s rules are complex and, well:

```ExprResult Sema::DefaultFunctionArrayConversion(Expr *E) {
ExprResult Sema::DefaultLvalueConversion(Expr *E) {
ExprResult Sema::DefaultFunctionArrayLvalueConversion(Expr *E) {
ExprResult Sema::UsualUnaryConversions(Expr *E) {
ExprResult Sema::DefaultArgumentPromotion(Expr *E) {

There they are in `Sema/SemaExpr.cpp`.

## AST overview

Let’s pick function declarations as our example. Here’s a Create function for a `FunctionDecl` (found in `AST/Decl.h`):

```static FunctionDecl *Create(ASTContext &C, DeclContext *DC,
SourceLocation StartLoc,
const DeclarationNameInfo &NameInfo,
QualType T, TypeSourceInfo *TInfo,
StorageClass SC,
bool isInlineSpecified,
bool hasWrittenPrototype,
bool isConstexprSpecified = false);```

Here are a few notes about it:

• `ASTContext` appears to be relevant to memory management only: it’s essentially a region that owns all AST node memory.

• `DeclContext` is how contextual information is handled in the AST. The children have a pointer to their parent. We can find out what’s relevent (edited slightly:)

```tedinski@smorgasbord:~/repos/llvm/tools/clang/include/clang\$ egrep -R "class.*:.*public DeclContext" `find . -name "*.h"`
./AST/DeclCXX.h:class LinkageSpecDecl : public Decl, public DeclContext {
./AST/Decl.h:class TranslationUnitDecl : public Decl, public DeclContext {
./AST/Decl.h:class NamespaceDecl : public NamedDecl, public DeclContext,
./AST/Decl.h:class FunctionDecl : public DeclaratorDecl, public DeclContext,
./AST/Decl.h:class BlockDecl : public Decl, public DeclContext {
./AST/Decl.h:class CapturedDecl : public Decl, public DeclContext {
./AST/DeclObjC.h:class ObjCMethodDecl : public NamedDecl, public DeclContext {
./AST/DeclObjC.h:class ObjCContainerDecl : public NamedDecl, public DeclContext {```

That gives us a good idea of what our DeclContext means.

• `SourceLocation` The actual representation of this is annoyingly complicated, but it identifies a specific character in a specific file, plus extra information, like include call stack, etc. I’m going to ignore the details but infer from the name “`StartLoc`” that it’s the start of the declarator. I tried to verify this and fell into a rabbit hole.

• `DeclarationNameInfo` Before we figure out what this is, let’s back up a second. What’s a `FunctionDecl`?

```class FunctionDecl : public DeclaratorDecl, public DeclContext,
public Redeclarable
class DeclaratorDecl : public ValueDecl
class ValueDecl : public NamedDecl
class NamedDecl : public Decl ```

Okay. What do each of these progressively want?

```Decl(Kind DK, DeclContext *DC, SourceLocation L)
NamedDecl(Kind DK, DeclContext *DC, SourceLocation L, DeclarationName N)
DeclaratorDecl(Kind DK, DeclContext *DC, SourceLocation L,
DeclarationName N, QualType T, TypeSourceInfo *TInfo,
SourceLocation StartL)```

Okay, let’s start with `DeclarationName`

`DeclarationName(const IdentifierInfo *II)`

`IdentifierInfo` is gross, and I don’t want to look at it. I hope it’s just an artifact of trying to be super-efficient. But it seem to just be a name, plus some other info of (to me) unknown importance that’s hopefully not relevant to C. On to `DeclarationNameInfo`:

```struct DeclarationNameInfo {
private:
/// Name - The declaration name, also encoding name kind.
DeclarationName Name;
/// Loc - The main source location for the declaration name.
SourceLocation NameLoc;
/// Info - Further source/type location info for special kinds of names.
DeclarationNameLoc LocInfo;```

Not bad. There’s an awful lot of operations on this, that all seem to use `LocInfo` and be for C++ only. So that’s probably not relevant. So it’s pretty much just a name and location. Okay.

• Going back to the things that go into `FunctionDecl`, way up there, next up is `QualType`, but I’m going to save this for last.

• Then there’s `TypeSourceInfo`. I just don’t even know. So it’s probably something to do with locating the type part of the declarator (as opposed to the identifier part) in the source files. It’s implementation is an abomination of who knows what, though.

• Next is `StorageClass`. From `Basic/Specifiers.h`. These are none/extern/static/etc.

• The booleans parameters to a `FunctionDecl` look mostly syntactic. I’d say the `hasWrittenPrototype` is semantic though.

Then, there are more pieces the parser fills after construction:

`NewFD->setParams(Params);`

I’m going to stop there in looking at ASTs. There’s probably a lot of confusion in that little dump of information up there, but let’s return to `QualType` instead.

## Type representation overview

At a conceptual level, this works simply enough, but the focus on efficiency makes everything more difficult (as usual.) Types are rather simple: there is `QualType`, which is essentially a pair of a `Qualifiers` and a `Type`. Qualifiers are things like const, restrict, and volatile. Type has constructors for just about everything: `BuiltinType`, `ComplexType`, `TypedefType`, `PointerType`, etc. Even `ParenType`, apparently just to keep all the syntax relevant. Structs are apparently called `RecordType`.

To see a full list, you can consult `AST/TypeNodes.def`.

If a type wraps other types, like `PointerType` for example, then those wrapped types are `QualType`s. This is, of course, how things like “`const char * const t`” get represented.

Type checking then proceed by a truly shocking number of “isX” boolean functions you can find on Type in `AST/Type.h`. The general structure seems to be that if you need to examine the types in any depth, you will, for example, call `isBuiltinType()` and if true call `castAs<BuiltinType>()`. Or, I suppose, call `getAs()` and check for null.

There’s quite a lot I haven’t explored, but I hope this exercise in tracking down things and grepping through the source has been interesting to someone.

## Lazy Webbing a Personal Cloud?

The shutdown of Reader left me… sad. Feedly has done okay as a replacement, though I’m not sure what their business plan is. I’d discovered Tiny Tiny RSS, which actually looks quite nice, and has the distinct advantage of staying up as long as you decide to keep it up. But… well, I’m not really a “host things myself” person at the moment. (Hello, from my wordpress.com blog!) And just an RSS reader isn’t enough to motivate me to start.

Drew Crawford has written a step by step guide to setting up a mail server, complete with IMAP, spam filtering, push, and encryption. It’s excellent. I’m posting pretty much to save that link for myself later. Because RSS and email is almost enough motivation.

So my friendly lazyweb, are there good open source replacements for Dropbox?

How about Google calendar or docs? (Or, since docs seems like a lot, maybe just the word processing part of it?) Contacts? (Actually wait, isn’t IMAP supposed to do that?)

Now that I think about it, I wonder how hard it is to set up a private VPN server.

And what host has the cheapest VM with decent disk space? (Since only you’d be accessing your own box, I feel like bandwidth, cpu, etc aren’t such a big deal.)

Also, what other neat stuff can you cram into your personal “cloud?” (After all, everything cool is a cloud now, right?)

Oh. And who’s working on the scripted/chef/puppet/flavor of the month version of Drew’s post? I know you exist.

Hey, the title of this post did say Lazy.

Update 2013-7-20: Saw this link on HN about replacing dropbox. Doesn’t quite cut it though. For one thing, needs inotify, not “sleep 2″. For another, well, the android app isn’t open source. And, you know, other platforms exist, too.

Update 2013-8-20: Ansible scripts that set up a lot of this stuff. Nice!

Posted in Dear lazyweb, Personal cloud | 1 Comment

## The use of closed nonterminals

Motivated by a modular well-definedness analysis for higher-order attribute grammars with forwarding that I developed (for which I just noticed I really need to get a pdf up), I added the concept of “closed nonterminals” to Silver this last year.

Nonterminals in Silver allow extensions to introduce new productions, but only so long as those productions forward back down to the host language.  Extensions are then free to introduce new synthesized attributes, confident that they can always be evaluated via forwarding.  It’s a nice way of solving a very, very strong form of the expression problem.

Closed nonterminals are introduced because we wanted to sometimes make the dual trade-off.  Instead, extensions are allowed to introduce new attributes, but only so long as those attributes have a default value.  Then, new (not necessarily forwarding) productions can be introduced by extensions, confident that if they lack an explicit equation, there will be a default value instead.

In some sense, this switches from a “datatype-like” form of extension to an “object-like” form of extension.  We were motivated to introduce this to handle transformations from concrete syntax tree to abstract syntax trees.  Without closed nonterminals, you’d have a really stupid time trying to write what your new concrete syntax productions have to forward to.  When they’re closed, you don’t have to bother with that, and instead you just can’t add any new truly meaningful attributes to the concrete syntax tree… which is fine because we don’t want to.

At the time of introduction, I also noticed one other specific usage for closed nonterminals… as part of an environment pattern.  We’d write something like:

```nonterminal Defs with ...
closed nonterminal Def with ...```

To allow extensions to introduce entirely new kinds of definitions into the environment.

Since then, I’d started to feel the urge to use them a little more often, nearly always the same case of handling lists of things: Defs with Def, top level Dcls with Dcl, etc.  However, I had resisted so far, because I wasn’t sure why this would be okay.  Perhaps down the road, we’d suddenly discover this was the wrong decision?  I’d want an explanation for why this is right, and not just convenient now, before I’d start to allow it to be used.  There was a good story for concrete syntax, and for the environment pattern, but not so far for much else.

Simply choosing to make all “list element” nonterminals closed all the time is definitely wrong: Stmts, for example, very very much does not mean Stmt can be a closed nonterminal.  It’d eliminate the ability of an extension to introduce any sort of new analysis.

But, I think I’ve come up with an explanation: any list that exists to only permit an unordered collection of syntax can have its “list element” be a closed nonterminal.  A  good example is modifiers: “public static final,” “final public static,” who cares? The point is simply to recognize what’s there.

I’m still not completely convinced this is sufficient justification for using a closed nonterminal, but it appears to be a necessary condition for some particular class of valid uses.  Modifiers seems very safe, but the idea of making top-level declarations (which meet this criterion, at least for Silver itself) closed still makes me nervous…  The trouble is, we’re running into some extensions we want to write that demand it be closed, but I don’t yet know what sort of interesting extensions would be precluded by doing so.  Which, of course, isn’t evidence that there aren’t any. :)

## Extending C

I’m looking for a few good language extensions to C.

I want to demonstrate the practicality of composable language extensions, and to do that, I’m going to extend C with a bunch of neato stuff.  But I haven’t yet determined just what neato stuff that will be.

I have a number of good ideas motivated by C libraries.  For example:

• MPI. Parallelism is always good, and there should be quite a lot of lowish-hanging fruit for a language extension for MPI.
• Bison. Parsing is really well-addressed by DSLs, so this is an obvious choice.
• Lua. There’s a lot of regularity to the interface between these languages that could be very well addressed by a language extension.  This isn’t extending C with Lua, but rather an extension to C that makes using the Lua C API dramatically easier.
• OpenCL. I haven’t fully investigated OpenCL yet, but I very much imagine this will be as amenable to a language extension as MPI is. And GPU programming is another obvious hot topic.

However, all of these involve potentially complicated design work.  In the interest of laziness time-efficiency (and as a bonus, to speak more directly to the research community,) I’d instead like to implement language extensions others have already pioneered, but implement them as composable language extensions.  This way, the focus is on the advantages of this approach, and the particular extensions are already regarded as interesting.  Of course, this leaves me adrift: there are a lot of papers that extend C.  And I haven’t really paid attention to this area in the past, so I’m somewhat clueless as to which are actually regarded as good.

So, dear #lazyweb, what are your favorite papers that extend C?

## An algebra of language extension

I recently discovered (evidently I’m slow in keeping up with recent conferences) a nice paper from LDTA ’12 called “Language composition untangled [pdf].” I’m going to pull two concepts out of that paper, and talk about them a little bit more.

They (Erdweg, Giarrusso, Rendel) introduce two operators for talking about language composition.  (Well, more really, but I’m just interested in these two.)  The first they called “language extension” and they write $H \triangleleft E$ to represent the language formed by composing a host language H with an extension E that was written specifically for H.  There are two algebraic properties to note: it’s not commutative and it’s not associative.

The second operator they call “language unification” (I really don’t like this term, but oh well!) and they write $L_1 \uplus_g L_2$ to represent the language $L_1$ composed with a separate language $L_2$ using the “glue code” g.  There are two algebraic properties to note: it’s not associative, but it is commutative.

So far we have very little theory here.  We can at best say that $L_1 \uplus_g L_2 = L_2 \uplus_g L_1$. Not very exciting.  So let’s take a detour and ask ourselves “well, what do we want to say?”

Well, let’s start by relating the two operators.  Can we say they’re the same in some fashion? Well, that depends.  We can write something like $L_1 \uplus_g L_2 = L_1 \triangleleft (L_2 \cup g)$.  In essence this says we can turn a language unification into an extension by merging in the glue code with one or the other languages, which seems logical.  However, this introduces a new operator: that informal little $\cup$ that talks about merging in the glue code to a language.  So to admit this kind of equation, we need to force a property onto our language composition framework.  We need to be able to manipulate the specifications in this way, and consequently the specifications can only say things that CAN be manipulated in this way.  Okay, what else might we want?

Suppose we have two language extensions: $H \triangleleft E_1$ and $H \triangleleft E_2$.  We might want some ability to obtain a language with both extensions at the same time.  For example, $(H \triangleleft E_1) \uplus_g (H \triangleleft E_2)$.  That’s nice. We can write that.  But it does leave us with a little problem: we already had the two extended languages that worked just fine, but now all of a sudden to get them both to work together we need this pesky g glue code to fix up the composition.  We can’t get a language with both extensions without it.  Well, let’s assert something about our language composition framework again:  if we have $H \triangleleft E_1$ and $H \triangleleft E_2$, let’s say we can obtain $H \triangleleft (E_1 \uplus_{\emptyset} E_2)$.  That is, for two extensions to a common host, we do “language unification” without any glue code ($g = \emptyset$)

There’s a nice property that $\uplus_{\emptyset}$ has, too: it’s associative.  In fact, it seems that it is also associative with any “language unification” operation: $(X \uplus_g Y) \uplus_{\emptyset} Z = X \uplus_g (Y \uplus_{\emptyset} Z)$ seems a little bit weird, but I think that’s because $\uplus_{\emptyset}$ is just plain weird in the first place.  It kinda only makes sense if you think about it in terms of its operands having some common host language, somewhere deep down.  Certainly, that’s the only way we’ve said you can obtain it.  Anyway, we also have a nice property that $L \uplus_{\emptyset} L = L$, which doesn’t necessarily hold if $g \neq \emptyset$.

Okay, what can we do with these properties?  Well, for one we can compose any number of extensions to H$H \triangleleft (E_1 \uplus_{\emptyset} E_2 \uplus_{\emptyset} E_3)$ is legit.  That’s neato.

Let’s look a little bit at $(H \triangleleft E_1) \uplus_{\emptyset} (H \triangleleft E_2)$.  We can rewrite this as $(H \uplus_g (E_1 \setminus g)) \uplus_{\emptyset} (H \uplus_h (E_2 \setminus h))$. Taking advantage of the associativity of $\uplus_{\emptyset}$ and commutativity of $\uplus_x$ in general, we can produce $((H \uplus_{\emptyset} H) \uplus_g (E_1 \setminus g)) \uplus_H (E_2 \setminus h)$, which we can rewrite back to $(H \triangleleft E_1) \triangleleft E_2$.  And we can flip it the other way around, too.  Cool!

Let’s try something different, starting again with $(H \uplus_g (E_1 \setminus g)) \uplus_{\emptyset} (H \uplus_g (E_2 \setminus h))$, we can do the rewrites in a different direction and get $H \uplus_g (H \uplus_h ((E_1 \setminus g) \uplus_{\emptyset} (E_2 \setminus h)))$.  Let’s take some small liberties to write that as $H \uplus_{g \cup h} (E_1 \uplus_{\emptyset} E_2 \setminus (g \cup h)) = H \triangleleft (E_1 \uplus_{\emptyset} E_2)$.  Awesome!  These things that seem to be the same, are!  Well… with a few liberties about how we treat that glue code, that is.

So it would seem that capability to obtain $H \triangleleft (E_1 \uplus_{\emptyset} E_2)$ lets us “lift” an extension to “depend” on another extension if we so choose.

Let’s get crazier.  Suppose we have $H \triangleleft E_1$, $H \triangleleft E_2$, $(H \triangleleft E_1) \triangleleft E_3$, and $(H \triangleleft E_2) \triangleleft E_4$.  Can we compose them all?  Well, we can certainly produce $H \triangleleft (E_1 \uplus_{\emptyset} E_2)$. And from there we can use to above to obtain $(H \triangleleft E_1) \triangleleft E_2$, and of course with that we can obtain $(H \triangleleft E_1) \triangleleft (E_2 \uplus_{\emptyset} E_3)$.  Reversing direction, we can get right back to $H \triangleleft (E_1 \uplus_{\emptyset} E_2 \uplus_{\emptyset} E_3)$, and from there it’s obvious we’ll be able to manipulate in $E_4$ as well.

Alright, so what’s the point of all this?  Well, I found it interesting to try to think about what sort of properties we need to be able to do the kind of complicated composition in the last paragraph.  I’m not sure I’ve formulated things well enough yet, but so far we needed:

• The ability to produce $H \triangleleft (E_1 \uplus_{\emptyset} E_2)$ from $H \triangleleft E_1$ and $H \triangleleft E_2$.
• The equalities $H \triangleleft (E_1 \uplus_{\emptyset} E_2) = (H \triangleleft E_1) \uplus_{\emptyset} (H \triangleleft E_2) = (H \triangleleft E_1) \triangleleft E_2 = (H \triangleleft E_2) \triangleleft E_1$, which although they do not even mention it, seem to boil down to some ability to manipulate glue code like a set.

Our little attribute grammar playground, Silver, gets the second property easily, since for attribute grammars you pretty much can define $X \triangleleft Y = X \cup Y$ and $X \uplus_g Y = X \cup Y \cup g$. Although this does mean extensions (and glue code) can’t go do anything like delete from a host language. Or overwrite equations from the host language. (Well, we never had that in the first place! Hooray!)

The tricky part is the first property. That composition has to you know, work. And like, not just be a broken compiler.  Getting that first property will be my thesis. Whheeee!

## Violating encapsulation — can it work out?

If you ask ten people to define object-oriented programming, you’ll probably get ten different answers.  While it’s probably not one of the defining features of OO (my answer: dynamic dispatch, object identity, and yes, mutable state), certainly one of the central concepts is encapsulation.

It’s pretty easy to see why encapsulation is useful generally (although it can become muddled when people start piling on getters and setters.) So one of the major puzzlers that an OO programmer might experience when trying to actually use a functional language (like Haskell, OCaml, and F#) is that you don’t see encapsulation at first. In fact, algebraic data types seem to deliberately and completely violate encapsulation!

And… it all seems to work out? How should the OO programmer make sense of this, when it’s often deemed extremely important in OO land?

## Meaningful types

One possibility is simply that programmers are encouraged to give types stronger meanings.  Ask programmers how to represent money. A decent programmer is going to say “an integer.”  A poor programmer (for most domains) will say “a float” or “a string!”  A Haskell programmer might reply “Cents.”

`newtype Cents = Cents Int`

Since the immediately googlable explanations of newtype are terrible, I’ll given one of my own here.  Typically, any given type from a library has awful semantic content.  An ‘Int’? Of what? In what units? The type doesn’t say.  An ‘[Int]‘? What does that mean, besides a list of integers?  The more you have to rely on context to get meaning, the more likely a programmer is to make mistakes from missed or forgotten contextual information.

Newtype lets you create a type, that you can then decree to have some semantic content. ‘newtype Cents = Cents Int’ and then document the meaning of the type as “cents in US currency.”  This new type then carries it meaning explicitly, instead of implicitly as context the programmer must remember.

And it’s not just a type alias. You’re forced to do a little song and dance to make the type checker happy, reminding yourself of this meaning everywhere you use it by wrap/unwrapping (zero runtime cost, but you’re not supposed to care yet) what you’re working with.  For example, spot the errors in the following context-less piece of code:

```addDollars :: Cents -> Int -> Cents
addDollars (Cents cents) dollars = Cents (cents + dollars)```

But consider if it had been

```addDollars :: Cents -> Int -> Cents
addDollars (Cents asdf) qwerty = Cents (asdf + qwerty)```

Choosing descriptive variable names is wise. But choosing descriptive types may be more important. All we’re missing is a Dollars type, and the awful variable names here wouldn’t have even mattered.

The benefits of this start to compound significantly once you’re working with data that’s aggregated together somehow.  Ponder the difference between these two functions:

```newRect :: int -> int -> int -> int -> Rectangle
newRect :: Point2D -> Width -> Height -> Rectangle```

It’s compiler-checked documentation. It immediately raises a possible code smell any time you see ‘newRect (Point2 x y) (Width h) (Height w)’ in your code. It’s damn invaluable as a refactoring tool. And so on.

## No null, yes garbage collection, no mutation

I think, in this century, most people can appreciate the advantages of garbage collection. Without it, ANY sort of interesting data structure must somehow suffer from managing state: whether any given pointer is currently valid, invalid, or null. “Who” is responsible for freeing a pointer, once allocated, that implicitly propagates through the program. And so forth. Obviously, not having GC increases the frailty of the data representation, and it’s then extremely helpful to encapsulate the data to help manage these invariants. (Of course, then there are things like the various smart pointers in C++ that can help deal with these small concerns.)

But, even in a language like Java, any reference may still be null, and this still presents many of the same state-managing problems to a data structure that might not need it otherwise.  After all, if full implementation details are exposed, then the user is perfectly capable of constructing a data structure with nulls in inappropriate places. And frequently, there’s no indication of what’s an appropriate or inappropriate place, except that some will raise null pointer exceptions sometimes and others will be handled just fine.

Or worse, given that OO languages typically have unrestricted mutation, they might modify the content of a data structure in any way — and this part is important — invisibly.  Call a function, pass it a reference to a data structure. Will that data structure get modified?

In a functional setting, of course not!  Even in the ML family, which does not require stateful functions to involve monads or anything of the sort, you know that a value of a type without any ‘ref’s in it won’t change.  In Java it’s hopeless, but in C++ you might get away with ‘const.’  If your users don’t hate you for using const, because there’s so much const-incorrect code out there they have to work with.  Oh well, they can just cast it away, hahahaha… sigh.

The end result of all this is that the values of a type in a functional language are actually just values, and not state representations.  Yes, it’s still quite possible to construct nonsense.  But, it’s a lot more obvious when and where values are getting constructed.  There are stronger invariants on what a value might look like (we only have to worry about the values that a value is composed of, not null references, not pointer ownership). And with newtype, we can be sure those values have an obvious meaning to the programmer who is using or constructing them.  To construct nonsense, you pretty much have to know you’re constructing nonsense.

## Representation

Another somewhat interesting aspect of why this works is that (duh?) the features to make it work are also present.  Any time you have a data structure where a value may be one of several different constructions, an OO language forces you to make use of subtyping (each construction is a child class.)  And your base type typically has few internal details, instead there’s simply a set of functions the subtypes should appropriately implement, and you don’t know which subtype it really is, so you don’t have access to those details.  You’re almost forced into encapsulation.

Algebraic data types, on the other hand, fix the list of constructors, so the details are right there. And pattern matching actually makes using those details quite pleasant.  I’d wager that nearly all situations where we see class with a lot of getters/setters is an example of a situation just screaming out for data types instead of objects.

## Finally, we do encapsulate!

Because it is actually a good idea!  The major difference seems to be that the OO-style forces you to encapsulate immediately, which isn’t always the best thing.  All manner of crap has been invented in the OO world to violate encapsulation in limited ways.  Java’s “default” protection modifier, unfortunate things like ‘friend’ classes, sometimes reflection, and so forth.  The public method or class with documentation that just says “internal API – do not use” is hardly unheard of!

The unit of encapsulation isn’t always one type! So these things become necessary.

In a functional setting, you work with internal details exposed… until you hit a boundary (often a module) where you decide it’s time to hide it.  Then you have a type class, or a module signature (depending on whether you’re in a Haskell-like or ML-like language. Have you ever been in a language?) that defines an interface, much as you would see in an OO interface.

Then, your module hides the implementation details. Users of the module don’t see what they shouldn’t see. Tadaa!

## The bottom line

A common theme celebrated by functional programmers is that their programs “just work” after the type checker is satisfied.  I think the biggest part of this is controlled state, and I think many would agree with this.

But I think the second biggest part of it is this exposed representation — internally to a module, of course.  One of the problems with encapsulation and abstraction is that we nearly always end up making abstractions leaky.  Real programs are messy, there are unfortunate corner cases, programmers makes mistakes, good design is hard, and so on and so forth.

When you’re working in a module with exposed representation, you’re basically programming at the “metal” of what you’re really doing.  (Pause a moment to let the C programmers in the room stop laughing at my use of that metaphor for writing Haskell, please.)  There aren’t any leaky abstractions, except perhaps those from other modules you’re using.  What it is, is what it is.  It’s not bytes and pointers, and you may not know what the exactly assembly generated is, but you can have an equally good (or better!) idea of what’s going to happen when you run it (if you’re willing to temporarily ignore a couple of important operational details, like how much memory it will consume.  Correctness first, optimization second!)

As a result, not only are you less likely to forget some detail and mistakenly trust a leaky abstraction, because what you’re actually doing is right there in front of you, but the language has been designed around helping you remind yourself of all those details.

So that’s 1400 words I hope is interesting to someone.

I’ve been thinking about all this recently because I’m working on extensible languages, specifically the meta-programming language Silver. (We’re also in the very early stages of building a google code project site for Silver. That link probably isn’t useful now, but probably by 2012…) One of the potential pitfalls is that, to really allow a compiler to be extensible, a lot of the implementation details need to be exposed so an extension can… well, do it’s thing with them.  So a major question we have to be able to answer is: why isn’t this just a completely horrible spaghetti code scenario?

My hope is that we can get enough compiler- and language-enforced properties that:

• We can completely and totally rule out MANY serious ways in which an extension can misbehave. Much like functional languages do with purity and data types, but even stronger properties are needed.
• We can be confident that, for the few ways that are impossible to rule out, we get the same kind of “experience” that functional programmers do living dangerously with internal implementation details.  That is, making sure those details are brought to the programmer’s attention when coming anywhere close to them.
• We can automatically gather ALL relevant information about a point of possible misbehavior, not missing any important details, so an IDE, for example, can display it for the programmer. Or just so the programmer has some straightforward idea of where to look for these things themselves. Or at the very least so we have some hope of categorizing them so they can be tackled by a few standard coding conventions.
• We can still do encapsulation when appropriate.

Of course, the real goal is to make language extension seamless and problem free for the user of the extended language. If the user can’t be sure that trying to use extensions X and Y together won’t make the compiler explode, it’s probably game over for trying to convince anyone to use extensions at all.

But much like laziness was a kind of “forcing function” for purity in Haskell, I suspect making sure things can’t devolve into spaghetti code will be a valuable “forcing function” for ensuring language extensions will “just work” for the end users.

Okay, now that’s 1900 words I hope is interesting to someone.

## Project hosting with test framework integration?

So I’ve been looking around at open source project hosting websites, and there’s one thing I think is really, desperately missing from nearly all of them (github, bitbucket, and google code. All look equally good to me.)

Test framework integration.

Now, I realize they can’t offer to run the tests for me (too limited/abusable), but why not just write down a format, and let me upload test results with an API? That’d still provide:

• A nice web interface to test result information. (“4 failures” *click* “oh, those.”)
• History of failures / fixes; differences between branches.
• One click link to the test code that failed.
• Maybe some integration with issue tracking (by noting in the test metadata what bugs they’re related to)
• Even better: provide not just for true/false success/failure, but also numerical data, like performance, so benchmarks can be tracked over time.
• All kinds of other wonderful things I’m too lame to think of.

Of course, there’s another problem: I can’t even find any stand-alone tools that let me do this.  There’s all kinds of tools out there for bug/task management; where are the test result database tools?

Isn’t this a dead-obvious thing every project should have these days, like version control and issue tracking? What am I missing?

(I know TAP exists, but it doesn’t really provide any of the necessary metadata in its protocol, and none of the TAP consumers seem to provide any of the above features, that I saw.)

Update: It appears the keywords I was missing were “continuous integration.” I’ve now found TeamCity, Husdon/Jenkins, etc. So shift the question: which of these do you prefer, and which one accepts some reasonable format of test output that I can give it?  My tests are in our own test framework by necessity since they’re all in our own language…

Although, the above still stands: those project hosting websites should offer an API that a continuous integration server can talk to.