Abstract syntax tree
An abstract syntax tree (AST) is a structured representation used in computer science to model the hierarchical organisation of source code written in a formal language. It is expressed as a tree data structure in which each node denotes a specific syntactic construct in the program. Unlike the full grammatical detail preserved in concrete syntax trees, the AST presents a simplified and abstracted view of the code, omitting inessential punctuation, grouping symbols, and syntactic noise while retaining the logical and structural relationships essential for analysis and transformation.
ASTs are fundamental to the design and operation of modern compilers, interpreters, program analysis tools, and transformation systems. They serve as a bridge between raw source code and the internal representations used during optimisation, semantic checking, and code generation.
Nature and Purpose of ASTs
An AST focuses on the logical components of a program rather than its surface-level syntax. Structural elements such as parentheses are implicit in the tree hierarchy and therefore not represented as explicit nodes. Control structures, for example, an if statement, may be captured as a single node with branches corresponding to the condition, the true branch, and the false branch.
Because of this abstraction, ASTs differ from parse trees, which are produced directly by parsers based on the context-free grammar of a language. While parse trees reflect every token and grammatical rule, ASTs distil these down to their semantic essence. Once a parse tree is built, subsequent compiler stages annotate or convert it into an AST enriched with type information, source positions, and other metadata that facilitate analysis and error reporting.
ASTs are widely used beyond compilation, including in code editors, refactoring tools, program verification environments, and clone detection systems.
ASTs in Compilers
One of the principal applications of ASTs is in compilers. The AST is generally produced at the end of the syntax analysis phase and becomes the primary intermediate representation for further compilation stages. Its structural clarity and reduced complexity make it suitable for transformations, semantic analysis, and optimisation.
An AST provides several advantages:
- ModifiabilityIt can be extended with annotations and attributes, such as types, scope information, or inferred properties, without altering the original source code.
- SimplificationIt avoids syntactic details such as brackets and semicolons, focusing on meaningful constructs.
- Detailed contextIt often contains source code positions for diagnostics, error reporting, and debugging support.
- DisambiguationProgramming language features that cannot be fully expressed in a context-free grammar—such as context-dependent type usage, operator overloading, or dynamically introduced identifiers—can be handled through the contextual information attached to AST nodes.
As compilation progresses, the AST may undergo verification during semantic analysis. This includes checking type correctness, ensuring identifiers are properly declared, and validating expressions and statements according to the language’s rules. Symbol tables are often generated during this phase, mapping identifiers to declarations and attributes.
After semantic verification, the AST may be translated into an intermediate representation (IR), serving as the basis for machine-dependent or machine-independent optimisations and eventually target code generation.
Design Considerations
The design of an AST is tightly linked to the structure of the programming language and its compiler architecture. Essential design criteria include:
- Preservation of type information and declaration contextEach variable, function, or type must be associated with its location and attributes.
- Explicit ordering of executionThe tree structure must reflect the exact order in which statements and operations occur.
- Correct representation of binary and unary operationsLeft and right operands must be clearly distinguished.
- Storage of identifiers and assigned valuesAssignment nodes must capture both the target and the expression being assigned.
Languages with constructs requiring variable numbers of elements, such as argument lists, necessitate AST node types capable of containing dynamic collections of child nodes. To verify compiler correctness, it must be possible to unparse—i.e., regenerate—readable source code from the AST, replicating its behaviour upon recompilation.
Additional Applications
AST differencingTree differencing identifies the changes between two ASTs by generating an edit script. This is used in version-control systems, refactoring tools, and automated repair systems. Each edit action corresponds to tree operations such as inserting, deleting, or modifying nodes.
Clone detectionBecause ASTs capture abstract program structure, they are effective for identifying code fragments that are syntactically different but semantically similar. AST-based clone detection supports quality assurance, refactoring, and security analysis.
Related Concepts
ASTs are associated with several other structures and concepts:
- ASGs (abstract semantic graphs), which represent semantic rather than syntactic structure.
- Directed acyclic graphs (DAGs), used when common subexpressions or shared constructs must be represented without duplication.
- DOM (Document Object Model), which organises structured documents hierarchically, similar to ASTs for code.
- Concrete syntax trees, which include all syntactic details recognised by the parser.
- SRTs (syntax representation trees) and related structures used in specific language families.