Implementing a Simple Compiler in C: Parsing and Code Generation

Introduction to Building a Compiler in C

Writing a compiler from scratch is one of the most effective ways to deepen your understanding of how high-level programming languages translate into machine-executable code. While modern compilers are complex systems involving many optimization passes, the core principles remain accessible: parsing the source text into a structured representation and then generating target code from that structure. In this article, we explore these two stages in detail using the C language, providing a practical foundation that you can extend into a full compiler implementation. We assume familiarity with C programming and basic data structures like linked lists and trees.

The journey from source code to executable output can be broken into several phases: lexical analysis, syntactic analysis (parsing), semantic analysis, intermediate code generation, optimization, and final code generation. For our simple compiler, we focus on the first two and the final code generation step, omitting semantic analysis and optimization for clarity. By the end of this article, you will have built a working expression compiler that reads arithmetic expressions and outputs instructions for a stack-based virtual machine.

Parsing: From Characters to Syntax Trees

Parsing is the process of analyzing a sequence of tokens (the output of lexical analysis) to determine its grammatical structure according to a formal grammar. The result is typically a parse tree or an abstract syntax tree (AST). For our compiler, we use a recursive descent parser, a top-down approach that directly maps grammar rules to mutually recursive functions in C.

Lexical Analysis: Tokenizing the Input

Before parsing, the source code must be broken into meaningful units called tokens. In a simple arithmetic expression such as 3 + 4 * 2, the tokens are: integer 3, operator +, integer 4, operator *, and integer 2. A lexer (also called a tokenizer) reads the input character by character and groups them into tokens according to predefined patterns. Each token typically carries a type (e.g., NUMBER, PLUS, TIMES) and an optional value (the actual integer for a number).

Implementing a lexer in C is straightforward: iterate over a string, skip whitespace, classify characters using isdigit() or direct comparisons, and produce token structures. For our simple compiler, we define an enumeration for token types and store the token sequence in an array or generate them on the fly using a function getNextToken(). The lexer is the first component your parser will call, consuming the input stream and returning tokens one at a time.

Grammar and Recursive Descent Parsing

A grammar defines the syntax of the language. For arithmetic expressions with addition, subtraction, multiplication, division, and parentheses, a typical grammar is:

expr    -> term (('+' | '-') term)*
term    -> factor (('*' | '/') factor)*
factor  -> NUMBER | '(' expr ')'

This grammar is left-recursive? Actually, the * repetition (Kleene star) avoids left recursion in top-down parsers. In a recursive descent parser, you write a function for each nonterminal: parseExpr(), parseTerm(), parseFactor(). Each function consumes tokens according to the grammar rule and returns an AST node. For example, parseExpr() first calls parseTerm() to get the left operand, then loops while the next token is a plus or minus, parsing right operands and building a binary operator node.

One key advantage of recursive descent is that the code structure mirrors the grammar, making it easy to extend and debug. Error reporting integrates naturally: when an unexpected token appears, you can print a descriptive message indicating expected tokens and the current position in input.

Handling Operator Precedence and Associativity

The grammar above implicitly handles precedence: multiplication and division bind tighter than addition and subtraction because term appears in expr only after term is fully resolved. For associativity, left-associative operators (like +) are handled naturally by the left-to-right loops. For right-associative operators (like exponentiation if you add them), you would adjust the loop to parse the right operand recursively.

Building the Abstract Syntax Tree (AST)

During parsing, we construct an AST that represents the essential structure of the expression without superfluous details like parentheses (which are encoded in the tree shape). For each binary operation, we create a node with fields for the operator kind, left child, and right child. For numbers, we create a leaf node storing the integer value. An AST node in C might look like:

typedef struct ASTNode {
    enum { NUMBER_NODE, BINOP_NODE } type;
    union {
        struct { int value; } number;
        struct { char op; struct ASTNode *left, *right; } binop;
    };
} ASTNode;

This tree is later consumed by the code generator to emit target instructions.

Code Generation: Translating the AST to Target Instructions

Once we have a valid AST, the next step is to generate executable code. For simplicity, we target a stack-based virtual machine that simulates a simple processor with a stack and a few instructions. This approach sidesteps the complexities of register allocation and ISA-specific details while illustrating the core ideas of code generation.

Design of the Virtual Machine Instruction Set

Our VM will support the following instructions:

PUSH n – push integer n onto the stack.
ADD – pop the top two values, add them, push the result.
SUB – pop top two, subtract (second minus top), push result.
MUL – pop top two, multiply, push result.
DIV – pop top two, integer divide (second divided by top), push result.
PRINT – pop the top value and output it (for demonstration).

To generate these instructions, we traverse the AST in post-order (left, right, root). For a NUMBER node, we emit PUSH value. For a BINOP node, we recursively generate code for the left child, then the right child, then emit the appropriate arithmetic instruction (ADD, SUB, MUL, DIV). This ensures operands are pushed onto the stack in the correct order before the operation consumes them.

Example: Generating Code for an AST

Consider the AST for 3 + 4 * 2. After parsing, the AST reflects precedence: the multiplication node becomes the right child of the addition node. Traversing post-order yields: push 3, push 4, push 2, multiply, add, print. The generated instruction sequence would be:

PUSH 3
PUSH 4
PUSH 2
MUL
ADD
PRINT

This program, when executed on our VM, correctly computes 3 + (4*2) = 11.

Integrating with the Parser

Our compiler driver will orchestrate the phases: call the lexer to get tokens, feed them to the parser which builds the AST, then pass the AST to the code generator that prints the instruction sequence. For a real compiler, you would write the instruction sequence to a file or directly execute it in a VM. For our example, we simply output human-readable text.

Complete Example: Building an Expression Compiler in C

Let's put everything together into a working compiler for arithmetic expressions. We'll implement the lexer, parser, AST builder, and code generator in a single C file (or a few modules). Below is a condensed version of the key functions.

Lexer Implementation

#include <stdio.h>
#include <ctype.h>
#include <stdlib.h>

typedef enum { TOK_NUMBER, TOK_PLUS, TOK_MINUS, TOK_MUL, TOK_DIV, TOK_LPAREN, TOK_RPAREN, TOK_END } TokenType;

typedef struct {
    TokenType type;
    int value; // only for TOK_NUMBER
} Token;

Token cur_token;

void advance() {
    // ... read next token from stdin (omitted for brevity)
}

Parser and AST Construction

typedef struct ASTNode ASTNode;
ASTNode* parseExpr();
ASTNode* parseTerm();
ASTNode* parseFactor() {
    if (cur_token.type == TOK_NUMBER) {
        ASTNode* node = malloc(sizeof(ASTNode));
        node->type = NUMBER_NODE;
        node->number.value = cur_token.value;
        advance();
        return node;
    } else if (cur_token.type == TOK_LPAREN) {
        advance();
        ASTNode* node = parseExpr();
        if (cur_token.type != TOK_RPAREN) { error("missing )"); }
        advance();
        return node;
    } else {
        error("unexpected token");
        return NULL;
    }
}
// parseTerm and parseExpr follow the grammar with loops for binary ops.

Code Generator

void generate(ASTNode* node) {
    if (node->type == NUMBER_NODE) {
        printf("PUSH %d\n", node->number.value);
    } else if (node->type == BINOP_NODE) {
        generate(node->binop.left);
        generate(node->binop.right);
        switch (node->binop.op) {
            case '+': printf("ADD\n"); break;
            case '-': printf("SUB\n"); break;
            case '*': printf("MUL\n"); break;
            case '/': printf("DIV\n"); break;
        }
    }
}

With a main function that initializes the lexer, calls parseExpr(), then calls generate() followed by a PRINT instruction, we have a complete simple compiler. The resulting output can be fed to a simple stack machine interpreter (which you can also write in C).

Extending the Compiler

The compiler we built handles only integer arithmetic. Real programming languages require much more: variables, control flow, functions, arrays, and type checking. Here are some natural extensions:

Variables and Assignment: Add a symbol table to store variable names and values. Extend the grammar with statements like id = expr ; and generate code to store and load from memory.
Control Flow: Introduce if, while, and for constructs. This requires conditional jumps in the VM instruction set, e.g., JUMP_IF_ZERO and JUMP.
Functions: Add function definitions and calls. This introduces the need for a call stack, parameters, and return values, significantly increasing complexity.
Type System: Add static type checking (e.g., forbid adding a string to an integer) by extending the AST with type information and performing semantic analysis after parsing.
Optimization: Implement constant folding (evaluating constant expressions at compile time) or peephole optimization on the generated code.

Each of these extensions deepens your understanding of compiler construction and can be added incrementally to the foundational parser and code generator we developed.

Resources and Further Reading

To continue your journey into compiler development, consider these external resources:

Recursive Descent Parser (Wikipedia) – a detailed explanation of the parsing technique used in this article.
ANTLR – a powerful parser generator that automates lexer and parser creation, useful for more complex grammars.
The Dragon Book (Wikipedia) – the classic textbook on compiler design, covering all phases in depth.

Conclusion

Implementing a simple compiler in C demystifies how source code becomes executable. By tackling parsing and code generation from scratch, you gain direct insight into lexical analysis, grammar design, recursive descent parsing, AST construction, and instruction emission. The stack-based virtual machine approach provides a clear, manageable target for learning code generation without the overhead of a real CPU instruction set. With the foundation laid here, you are well-equipped to expand your compiler to support more language features, explore optimization techniques, or even target a real architecture like x86 or ARM. The process is challenging but tremendously rewarding, offering a deeper appreciation for every language you use.