A newbie’s guide to understanding Clang AST construction

I’ve just jumped into reading Clang code a bit, and these are some notes on how it works. Accuracy completely disavowed, I’m obviously no expert on Clang internals! (But, you know, corrections welcome.) I’m mostly interested in C, so I’ll probably skip some stuff important for C++ (and ObjC) here.

A high-level overview

The general structure for compiling a file looks standard, something like “Lex -> Parse -> Sema -> AST -> whatever” where “whatever” might be something like “CodeGen”. These are all names of modules you’ll find under clang/lib/ and the corresponding clang/include/clang/.

Lex and Parse have mostly the obvious responsibilities, with the exception that Parse appears to handle management of scopes (for name lookup). Sema is a module that tries to separate out from Parse the semantic analysis and AST construction code. Sema is pretty much a bunch of functions, called by Parse, that do type checking and AST construction.

That’s right: type checking is done during parsing of the source file, and not performed on the AST. I’m assuming this is an awful requirement imposed by C++ syntax. (And Parse handling scopes is probably the result of C syntax and the typedef hack.)

On the bright side, it does mean that ASTs, once constructed, have all type information already filled out. (For those familiar with the “AST typing problem” typically discussed in functional languages, Clang is essentially resolving this by not having a tree data structure without full type information at all.)

Scope overview

The scoping mechanism is somewhat nifty. To see an example, look for ParseForStatement in Parse/ParseStmt.cpp. To open a scope, it first decides what kind (just continue/break scope, or also including declarations) and then creates a ParserScope object, which does the following:

/// ParseScope - Introduces a new scope for parsing. The kind of
/// scope is determined by ScopeFlags. Objects of this type should
/// be created on the stack to coincide with the position where the
/// parser enters the new scope, and this object's constructor will
/// create that new scope. Similarly, once the object is destroyed
/// the parser will exit the scope.

It’s just a lifetime management object: the scope opens where the constructor runs, and closes either explicitly, or when the destructor is called.

Related to scoping, of course, is looking up identifiers. I’ll avoid the full details, let’s just look at how ActOnIdExpression in Sema/SemaExpr.cpp works. First, look up results are stored in a LookupResult (Sema/Lookup.h and Sema/SemaLookup.cpp), but actual lookups are done by function call e.g. LookupParsedName (Sema/SemaLookup.cpp).

That just calls LookupName, which gets an iterator from an IdResolver which, you know what? Let’s not go further on that. Anyway, yeah, lookups happen, and at least you can see the Scope being passed into LookupParsedName.

Typechecking overview

I won’t look too much at the diagnostic system. It’s neat, if you want to check it out yourself. Handles things like translations (I assume), and emitting a relevant snippet of the original code, to give context to an error.

The first thing to do, when trying to show an example of type checking, is to think of something in C that isn’t type correct. This is a trick in the land of C where if(0["Hello, world!"]); is perfectly okay.

So, let’s go with structs. if(expr_of_type_struct_foo) ought to be an error, so long as that’s not a pointer, anyway. Looking at ActOnIfStmt goes nowhere, unfortunately, so we back up to looking at the control flow for ParseIfStatement and discover it calls ActOnBooleanCondition on the expression first.

This leads to CheckBooleanCondition and we find that it:

  1. transforms the expression with DefaultFunctionArrayLvalueConversion, and
  2. checks isScalarType() on the type of the resulting expression.

This leaves us with looking at the default conversions, but C’s rules are complex and, well:

ExprResult Sema::DefaultFunctionArrayConversion(Expr *E) {
ExprResult Sema::DefaultLvalueConversion(Expr *E) {
ExprResult Sema::DefaultFunctionArrayLvalueConversion(Expr *E) {
ExprResult Sema::UsualUnaryConversions(Expr *E) {
ExprResult Sema::DefaultArgumentPromotion(Expr *E) {
ExprResult Sema::DefaultVariadicArgumentPromotion(Expr *E, VariadicCallType CT,

There they are in Sema/SemaExpr.cpp.

AST overview

Let’s pick function declarations as our example. Here’s a Create function for a FunctionDecl (found in AST/Decl.h):

static FunctionDecl *Create(ASTContext &C, DeclContext *DC,
                            SourceLocation StartLoc,
                            const DeclarationNameInfo &NameInfo,
                            QualType T, TypeSourceInfo *TInfo,
                            StorageClass SC,
                            bool isInlineSpecified,
                            bool hasWrittenPrototype,
                            bool isConstexprSpecified = false);

Here are a few notes about it:

  • ASTContext appears to be relevant to memory management only: it’s essentially a region that owns all AST node memory.

  • DeclContext is how contextual information is handled in the AST. The children have a pointer to their parent. We can find out what’s relevent (edited slightly:)

    tedinski@smorgasbord:~/repos/llvm/tools/clang/include/clang$ egrep -R "class.*:.*public DeclContext" `find . -name "*.h"`
    ./AST/DeclCXX.h:class LinkageSpecDecl : public Decl, public DeclContext {
    ./AST/Decl.h:class TranslationUnitDecl : public Decl, public DeclContext {
    ./AST/Decl.h:class NamespaceDecl : public NamedDecl, public DeclContext, 
    ./AST/Decl.h:class FunctionDecl : public DeclaratorDecl, public DeclContext,
    ./AST/Decl.h:class BlockDecl : public Decl, public DeclContext {
    ./AST/Decl.h:class CapturedDecl : public Decl, public DeclContext {
    ./AST/DeclObjC.h:class ObjCMethodDecl : public NamedDecl, public DeclContext {
    ./AST/DeclObjC.h:class ObjCContainerDecl : public NamedDecl, public DeclContext {

    That gives us a good idea of what our DeclContext means.

  • SourceLocation The actual representation of this is annoyingly complicated, but it identifies a specific character in a specific file, plus extra information, like include call stack, etc. I’m going to ignore the details but infer from the name “StartLoc” that it’s the start of the declarator. I tried to verify this and fell into a rabbit hole.

  • DeclarationNameInfo Before we figure out what this is, let’s back up a second. What’s a FunctionDecl?

    class FunctionDecl : public DeclaratorDecl, public DeclContext,
                         public Redeclarable 
    class DeclaratorDecl : public ValueDecl 
    class ValueDecl : public NamedDecl 
    class NamedDecl : public Decl 

    Okay. What do each of these progressively want?

    Decl(Kind DK, DeclContext *DC, SourceLocation L)
    NamedDecl(Kind DK, DeclContext *DC, SourceLocation L, DeclarationName N)
    DeclaratorDecl(Kind DK, DeclContext *DC, SourceLocation L,
                   DeclarationName N, QualType T, TypeSourceInfo *TInfo,
                   SourceLocation StartL)

    Okay, let’s start with DeclarationName

    DeclarationName(const IdentifierInfo *II)

    IdentifierInfo is gross, and I don’t want to look at it. I hope it’s just an artifact of trying to be super-efficient. But it seem to just be a name, plus some other info of (to me) unknown importance that’s hopefully not relevant to C. On to DeclarationNameInfo:

    struct DeclarationNameInfo {
    private:
      /// Name - The declaration name, also encoding name kind.
      DeclarationName Name;
      /// Loc - The main source location for the declaration name.
      SourceLocation NameLoc;
      /// Info - Further source/type location info for special kinds of names.
      DeclarationNameLoc LocInfo;

    Not bad. There’s an awful lot of operations on this, that all seem to use LocInfo and be for C++ only. So that’s probably not relevant. So it’s pretty much just a name and location. Okay.

  • Going back to the things that go into FunctionDecl, way up there, next up is QualType, but I’m going to save this for last.

  • Then there’s TypeSourceInfo. I just don’t even know. So it’s probably something to do with locating the type part of the declarator (as opposed to the identifier part) in the source files. It’s implementation is an abomination of who knows what, though.

  • Next is StorageClass. From Basic/Specifiers.h. These are none/extern/static/etc.

  • The booleans parameters to a FunctionDecl look mostly syntactic. I’d say the hasWrittenPrototype is semantic though.

Then, there are more pieces the parser fills after construction:

NewFD->setParams(Params);

I’m going to stop there in looking at ASTs. There’s probably a lot of confusion in that little dump of information up there, but let’s return to QualType instead.

Type representation overview

At a conceptual level, this works simply enough, but the focus on efficiency makes everything more difficult (as usual.) Types are rather simple: there is QualType, which is essentially a pair of a Qualifiers and a Type. Qualifiers are things like const, restrict, and volatile. Type has constructors for just about everything: BuiltinType, ComplexType, TypedefType, PointerType, etc. Even ParenType, apparently just to keep all the syntax relevant. Structs are apparently called RecordType.

To see a full list, you can consult AST/TypeNodes.def.

If a type wraps other types, like PointerType for example, then those wrapped types are QualTypes. This is, of course, how things like “const char * const t” get represented.

Type checking then proceed by a truly shocking number of “isX” boolean functions you can find on Type in AST/Type.h. The general structure seems to be that if you need to examine the types in any depth, you will, for example, call isBuiltinType() and if true call castAs<BuiltinType>(). Or, I suppose, call getAs() and check for null.

There’s quite a lot I haven’t explored, but I hope this exercise in tracking down things and grepping through the source has been interesting to someone.

Advertisements
This entry was posted in Compilers and tagged , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s