## An algebra of language extension

I recently discovered (evidently I’m slow in keeping up with recent conferences) a nice paper from LDTA ’12 called “Language composition untangled [pdf].” I’m going to pull two concepts out of that paper, and talk about them a little bit more.

They (Erdweg, Giarrusso, Rendel) introduce two operators for talking about language composition.  (Well, more really, but I’m just interested in these two.)  The first they called “language extension” and they write $H \triangleleft E$ to represent the language formed by composing a host language H with an extension E that was written specifically for H.  There are two algebraic properties to note: it’s not commutative and it’s not associative.

The second operator they call “language unification” (I really don’t like this term, but oh well!) and they write $L_1 \uplus_g L_2$ to represent the language $L_1$ composed with a separate language $L_2$ using the “glue code” g.  There are two algebraic properties to note: it’s not associative, but it is commutative.

So far we have very little theory here.  We can at best say that $L_1 \uplus_g L_2 = L_2 \uplus_g L_1$. Not very exciting.  So let’s take a detour and ask ourselves “well, what do we want to say?”

Well, let’s start by relating the two operators.  Can we say they’re the same in some fashion? Well, that depends.  We can write something like $L_1 \uplus_g L_2 = L_1 \triangleleft (L_2 \cup g)$.  In essence this says we can turn a language unification into an extension by merging in the glue code with one or the other languages, which seems logical.  However, this introduces a new operator: that informal little $\cup$ that talks about merging in the glue code to a language.  So to admit this kind of equation, we need to force a property onto our language composition framework.  We need to be able to manipulate the specifications in this way, and consequently the specifications can only say things that CAN be manipulated in this way.  Okay, what else might we want?

Suppose we have two language extensions: $H \triangleleft E_1$ and $H \triangleleft E_2$.  We might want some ability to obtain a language with both extensions at the same time.  For example, $(H \triangleleft E_1) \uplus_g (H \triangleleft E_2)$.  That’s nice. We can write that.  But it does leave us with a little problem: we already had the two extended languages that worked just fine, but now all of a sudden to get them both to work together we need this pesky g glue code to fix up the composition.  We can’t get a language with both extensions without it.  Well, let’s assert something about our language composition framework again:  if we have $H \triangleleft E_1$ and $H \triangleleft E_2$, let’s say we can obtain $H \triangleleft (E_1 \uplus_{\emptyset} E_2)$.  That is, for two extensions to a common host, we do “language unification” without any glue code ($g = \emptyset$)

There’s a nice property that $\uplus_{\emptyset}$ has, too: it’s associative.  In fact, it seems that it is also associative with any “language unification” operation: $(X \uplus_g Y) \uplus_{\emptyset} Z = X \uplus_g (Y \uplus_{\emptyset} Z)$ seems a little bit weird, but I think that’s because $\uplus_{\emptyset}$ is just plain weird in the first place.  It kinda only makes sense if you think about it in terms of its operands having some common host language, somewhere deep down.  Certainly, that’s the only way we’ve said you can obtain it.  Anyway, we also have a nice property that $L \uplus_{\emptyset} L = L$, which doesn’t necessarily hold if $g \neq \emptyset$.

Okay, what can we do with these properties?  Well, for one we can compose any number of extensions to H$H \triangleleft (E_1 \uplus_{\emptyset} E_2 \uplus_{\emptyset} E_3)$ is legit.  That’s neato.

Let’s look a little bit at $(H \triangleleft E_1) \uplus_{\emptyset} (H \triangleleft E_2)$.  We can rewrite this as $(H \uplus_g (E_1 \setminus g)) \uplus_{\emptyset} (H \uplus_h (E_2 \setminus h))$. Taking advantage of the associativity of $\uplus_{\emptyset}$ and commutativity of $\uplus_x$ in general, we can produce $((H \uplus_{\emptyset} H) \uplus_g (E_1 \setminus g)) \uplus_H (E_2 \setminus h)$, which we can rewrite back to $(H \triangleleft E_1) \triangleleft E_2$.  And we can flip it the other way around, too.  Cool!

Let’s try something different, starting again with $(H \uplus_g (E_1 \setminus g)) \uplus_{\emptyset} (H \uplus_g (E_2 \setminus h))$, we can do the rewrites in a different direction and get $H \uplus_g (H \uplus_h ((E_1 \setminus g) \uplus_{\emptyset} (E_2 \setminus h)))$.  Let’s take some small liberties to write that as $H \uplus_{g \cup h} (E_1 \uplus_{\emptyset} E_2 \setminus (g \cup h)) = H \triangleleft (E_1 \uplus_{\emptyset} E_2)$.  Awesome!  These things that seem to be the same, are!  Well… with a few liberties about how we treat that glue code, that is.

So it would seem that capability to obtain $H \triangleleft (E_1 \uplus_{\emptyset} E_2)$ lets us “lift” an extension to “depend” on another extension if we so choose.

Let’s get crazier.  Suppose we have $H \triangleleft E_1$, $H \triangleleft E_2$, $(H \triangleleft E_1) \triangleleft E_3$, and $(H \triangleleft E_2) \triangleleft E_4$.  Can we compose them all?  Well, we can certainly produce $H \triangleleft (E_1 \uplus_{\emptyset} E_2)$. And from there we can use to above to obtain $(H \triangleleft E_1) \triangleleft E_2$, and of course with that we can obtain $(H \triangleleft E_1) \triangleleft (E_2 \uplus_{\emptyset} E_3)$.  Reversing direction, we can get right back to $H \triangleleft (E_1 \uplus_{\emptyset} E_2 \uplus_{\emptyset} E_3)$, and from there it’s obvious we’ll be able to manipulate in $E_4$ as well.

Alright, so what’s the point of all this?  Well, I found it interesting to try to think about what sort of properties we need to be able to do the kind of complicated composition in the last paragraph.  I’m not sure I’ve formulated things well enough yet, but so far we needed:

• The ability to produce $H \triangleleft (E_1 \uplus_{\emptyset} E_2)$ from $H \triangleleft E_1$ and $H \triangleleft E_2$.
• The equalities $H \triangleleft (E_1 \uplus_{\emptyset} E_2) = (H \triangleleft E_1) \uplus_{\emptyset} (H \triangleleft E_2) = (H \triangleleft E_1) \triangleleft E_2 = (H \triangleleft E_2) \triangleleft E_1$, which although they do not even mention it, seem to boil down to some ability to manipulate glue code like a set.

Our little attribute grammar playground, Silver, gets the second property easily, since for attribute grammars you pretty much can define $X \triangleleft Y = X \cup Y$ and $X \uplus_g Y = X \cup Y \cup g$. Although this does mean extensions (and glue code) can’t go do anything like delete from a host language. Or overwrite equations from the host language. (Well, we never had that in the first place! Hooray!)

The tricky part is the first property. That composition has to you know, work. And like, not just be a broken compiler.  Getting that first property will be my thesis. Whheeee!