I occasionally get ideas that I don’t have the time or opportunity to pursue, so I thought I’d organize and write about one of them. Synthetic biology is an extremely interesting area, which might have caused me to take a different career path, if I’d been born a little later, or knew more about it when I was younger. A few years back, some clever people managed to create the first 100% synthetic genome (as in “we turned bits and nucleotides into DNA”), and inserted it into a cell and it thrived.
It’s interesting to think about what the end-game of this technology is. There are multitudes of problems solvable by designing novel organisms in a computer and synthesizing them, even with a relatively crude understanding of biochemistry. So there’s a question closer to my area: what would a programming language for organisms look like?
There are a few things we can answer with relative certainty. We can look at the question from three approaches: top-down, bottom-up, and black-box. What, about programming languages in general, can we say with reasonable certainty will apply to even a biological programming language? And how far can we go from what the result must be (DNA) backwards to what the language must be like? And regardless of what the programming language looks like, what properties must be true about it?
The language will be grounded in type theory. This might (to someone not familiar with PL theory) sound a bit unreasonable, since we still live in a world where most computer programming languages aren’t grounded in type theory, yet. But type theory is the fundamental theory of programming languages (and they’re hard at work extending that to logical and all of mathematics), so it’s not too controversial from that perspective. The only real reason our current language aren’t grounded in type theory is our current languages are a mess of historical accidents.
If this seems hard to imagine, the next probable property might help: the ambient monad for a biological language will be totally different from computer programming languages. (That is, different from languages for Von Neumann/Turing machines.) Most of our programming languages pervasively allow things like state, mutation, and IO. Even Haskell, with its reputation for being “pure,” pervasively permits non-termination of computations. A biological language will likely share none of these “ambient monads”, though I can make no guesses at the moment as to what ones we might find convenient to use. (I suppose there will be a sort of ambient state for the cell, but it will be different from how we typically think of state in current languages.)
If that’s still hard to believe, let me try it a different way: a type theory that permits no effects is really just another way of writing down data structures. Once you start wanting to describe a complicated enough data structure, you start wanting more and more abstraction, modularity, and composition features until you end up with a programming language. And that should (darn it) be grounded in type theory.
So next, bottom-up: the language must compile down to DNA. Of course. But we can also draw a nice line in the sand between regulatory genes and (I don’t know of a word for “non-regulatory gene” so I’ll call them) IO genes. I don’t know enough biology to know if “real” (i.e. evolved) organisms obey anything even remotely like a nice separation rule between these two kinds of genes (and evolution suggests it almost certainly does not), but it doesn’t matter. Just as a C compiler cannot generate all possible assembly programs, we can live with our bio language only generating genomes that are “well-behaved” in this way.
But this means that IO genes are the “foreign functions” of the language, and 100% of the purpose of our “program” is to generate the regulatory genes. We’ll almost certainly never write a program that somehow compiles down to the genes for constructing “chlorophyll a”. That’s too complicated. Too much chemistry, and the algorithms involved in figuring out a chemical structure like that are complex. (In the “complexity theory” sense.) You don’t want a compiler solving a problem like that, you want to solve it once and for all, study it carefully, and these re-use the result. Happily, this means evolution gives us tons of building blocks right from the start.
The regulatory side is perfectly reasonable, though. We can decide how we’re going to go about promoting and suppressing expression of IO genes, and then generate a set of regulatory genes that does this exactly according to the program. Again, we’re taking advantage of the fact that we don’t need to be able to reproduce all the different kinds of regulation that have evolved. We only need enough that we can regulate gene expression at all.
Foreign “IO” genes are what the name suggests: both inputs and outputs of the program. That is, some of these will be pure sensing: they detect something about the environment of the cell and cause a change in gene expression. Meanwhile, others will be pure output: they will cause something physical to happen only when they are expressed, but will cause no (direct) effects on gene expression. Others may be both. But this is not the only sensing that can go on: many functional parts of the cell (for example, stoma) will “sense” but purely within their own chemistry, and not directly controlled by gene expression.
The regulatory genes generated by the compiler will be intra-cell only. Probably. It’s possible to rely on “foreign IO” genes to accomplish communication with the environment, including other cells. And this is likely a good idea, because there are a lot of different ways cell can communicate, so it might be unwise to try to fix a few in stone and bake them into the language.
Metadata will be associated with all foreign genes. We’ll want to be able to simulate our programs in the machine, in order to debug and test them. To do that, we need to be able to abstract far away from the actual chemical machinery of the cell, because otherwise it is totally computationally infeasible. Even if inter-cell communication is part of the core of the language and thus does not need to be part of the metadata for foreign genes, we’ll still want to be able to run statistics on things like oxygen exchange, to make sure no cells will be starved and things like that. Since these are the result of the physical effects of expressed genes (i.e. the IO of the cell), we’ll need information on what those effects will be to simulate them without having to resort to simulating the chemistry.
So, finally, if it’s not obvious, I should note that I’m no biologist. This is just interesting. So with these few ideas in mind, the next question is: what’s the first step? If we designed a prototype language of this sort, we’d probably want to follow the work on synthetic genomes. Take the first synthetic genome, separate it into regulatory and “IO” genes as best we can, and then rewrite all the regulatory parts within the language, operating on the IO genes as “foreign functions”. Or at least, do so for a small part of it at first. (After all, a “trivial” program exactly reproducing the organism would consist of all the current genes as IO genes, and no program code at all. So we can start with parts and grow to the whole.)
Next, compile it, put it into a cell, and see if the new-but-not-really-different organism manages to survive. And behaves the same. This also happens to be basic science: you’d be verifying your understanding of the regulatory network’s behavior by creating a totally synthetic gene regulatory network.
And then, as the technology to synthesize genomes becomes easier, and the loop between “design genome -> test organism -> measure results” becomes tighter, the scientific opportunities start to explode.