Kaleidoscope Ch 2 : Parser & AST

Chapter 2: Parser & AST

This chapter of the Kaleidoscope tutorial shows how to extend our toy language from raw tokens into structured meaning. We do this by building a recursive‑descent parser and representing the result as an Abstract Syntax Tree (AST).

1. Why an AST?

Tokens alone (def, x, 42, +, etc.) do not capture structure. For example, x + y * 2 is a stream of tokens, but we need to know that multiplication binds tighter than addition. An AST expresses this directly:

   +
  / \
 x   *
    / \
   y   2

This tree form is critical because later stages (IR generation, optimization, interpretation) all rely on walking structured nodes, not raw text.

2. AST Node Classes

Each construct in Kaleidoscope has a dedicated AST node class.

class ExprAST {
public:
  virtual ~ExprAST() = default;
};

// Number literal
class NumberExprAST : public ExprAST {
  double Val;
public:
  NumberExprAST(double Val) : Val(Val) {}
};

// Variable reference
class VariableExprAST : public ExprAST {
  std::string Name;
public:
  VariableExprAST(const std::string &Name) : Name(Name) {}
};

// Binary operation
class BinaryExprAST : public ExprAST {
  char Op;
  std::unique_ptr<ExprAST> LHS, RHS;
public:
  BinaryExprAST(char Op, std::unique_ptr<ExprAST> LHS,
                std::unique_ptr<ExprAST> RHS)
      : Op(Op), LHS(std::move(LHS)), RHS(std::move(RHS)) {}
};

The design is deliberately simple: each node only stores what is logically necessary. For example, a binary operation doesn’t know about precedence—it just stores its operator and operands. Precedence is resolved during parsing, not encoded into the AST.

3. Token Management

The parser uses one‑token lookahead. We keep a global CurTok and advance by asking the lexer for the next token.

static int CurTok;
static int getNextToken() { return CurTok = gettok(); }

Every parsing function assumes CurTok holds the next token to be processed. When it consumes something, it calls getNextToken() to advance. This discipline is key: forgetting to advance is one of the most common parsing bugs.

4. Parsing Primaries

The base cases of our grammar are numbers, parenthesized expressions, and identifiers (which may be variables or function calls).

Numbers

static std::unique_ptr<ExprAST> ParseNumberExpr() {
  auto Result = std::make_unique<NumberExprAST>(NumVal);
  getNextToken(); // consume number
  return std::move(Result);
}

This is the simplest: wrap the numeric value in a NumberExprAST and move on.

Parentheses

static std::unique_ptr<ExprAST> ParseParenExpr() {
  getNextToken(); // eat '('
  auto V = ParseExpression();
  if (!V) return nullptr;

  if (CurTok != ')')
    return LogError("expected ')'");
  getNextToken(); // eat ')'
  return V;
}

Parentheses are important because they explicitly override operator precedence. The parser checks for a matching ) and ensures the enclosed expression is valid.

Identifiers: Variable or Function Call

static std::unique_ptr<ExprAST> ParseIdentifierExpr() {
  std::string IdName = IdentifierStr;
  getNextToken(); // eat identifier

  if (CurTok != '(') {
    // variable
    return std::make_unique<VariableExprAST>(IdName);
  }

  // function call
  getNextToken(); // eat '('
  std::vector<std::unique_ptr<ExprAST>> Args;
  if (CurTok != ')') {
    while (true) {
      if (auto Arg = ParseExpression())
        Args.push_back(std::move(Arg));
      else return nullptr;

      if (CurTok == ')') break;
      if (CurTok != ',')
        return LogError("Expected ')' or ',' in argument list");
      getNextToken(); // eat ','
    }
  }
  getNextToken(); // eat ')'
  return std::make_unique<CallExprAST>(IdName, std::move(Args));
}

This illustrates how the parser peeks ahead: if the identifier is followed by (, it is a call, otherwise just a variable. This trick avoids ambiguity without complicating the grammar.

5. Expressions and Operator Precedence

The heart of the parser is expression parsing. The goal is to ensure a + b * c is parsed as a + (b * c), not (a + b) * c.

Precedence Table

static std::map<char, int> BinopPrecedence;

static int GetTokPrecedence() {
  if (!isascii(CurTok)) return -1;
  int Prec = BinopPrecedence[CurTok];
  if (Prec <= 0) return -1;
  return Prec;
}

The table might assign * = 40, + = 20, - = 20, < = 10. Higher numbers mean stronger binding.

Parsing Expressions

static std::unique_ptr<ExprAST> ParseExpression() {
  auto LHS = ParsePrimary();
  if (!LHS) return nullptr;
  return ParseBinOpRHS(0, std::move(LHS));
}

This function starts with a primary, then hands control to ParseBinOpRHS, which manages binary operations based on precedence.

Parsing RHS with Precedence Climbing

static std::unique_ptr<ExprAST> ParseBinOpRHS(int ExprPrec,
                                              std::unique_ptr<ExprAST> LHS) {
  while (true) {
    int TokPrec = GetTokPrecedence();
    if (TokPrec < ExprPrec)
      return LHS;

    int BinOp = CurTok;
    getNextToken(); // eat binop

    auto RHS = ParsePrimary();
    if (!RHS) return nullptr;

    int NextPrec = GetTokPrecedence();
    if (TokPrec < NextPrec) {
      RHS = ParseBinOpRHS(TokPrec+1, std::move(RHS));
      if (!RHS) return nullptr;
    }

    LHS = std::make_unique<BinaryExprAST>(BinOp, std::move(LHS), std::move(RHS));
  }
}

This is a precedence climbing algorithm. It repeatedly parses RHS expressions and groups them correctly. If a later operator has higher precedence, it parses that first before combining.

6. Functions: Prototypes, Definitions, and Externs

Functions are built in two steps: a prototype (name + argument list) and an optional body expression.

class PrototypeAST {
  std::string Name;
  std::vector<std::string> Args;
public:
  PrototypeAST(const std::string &Name, std::vector<std::string> Args)
      : Name(Name), Args(std::move(Args)) {}
};

class FunctionAST {
  std::unique_ptr<PrototypeAST> Proto;
  std::unique_ptr<ExprAST> Body;
public:
  FunctionAST(std::unique_ptr<PrototypeAST> Proto,
              std::unique_ptr<ExprAST> Body)
      : Proto(std::move(Proto)), Body(std::move(Body)) {}
};

Parsing rules:

def foo(x y) x+y → prototype + body → FunctionAST
extern sin(x) → prototype only → PrototypeAST
top‑level expression → anonymous function

This design allows the REPL to treat every top‑level construct uniformly.

7. Driver Loop

The REPL loop checks CurTok and dispatches appropriately:

static void MainLoop() {
  while (true) {
    switch (CurTok) {
    case tok_eof: return;
    case ';': getNextToken(); break; // ignore
    case tok_def: HandleDefinition(); break;
    case tok_extern: HandleExtern(); break;
    default: HandleTopLevelExpression(); break;
    }
  }
}

This makes the language interactive: users can type definitions, externs, or expressions, and the parser responds immediately.

8. Example Walkthroughs

Input
```
def add(x y) x + y*2
```
- Prototype: add(x,y)
- Body: x + (y*2)
- AST shape: FunctionAST(Proto("add",[x,y]), BinaryExprAST('+',Var(x),BinaryExprAST('*',Var(y),Num(2))))
Input
```
extern sin(a)
```
- AST: PrototypeAST("sin",[a])
Input
```
add(3,4)
```
- AST: CallExprAST("add",[Num(3),Num(4)])
Input
```
(1+2)*3
```
- AST: BinaryExprAST('*', BinaryExprAST('+',Num(1),Num(2)), Num(3))

These examples demonstrate how ASTs cleanly capture intended precedence and structure.

9. Common Pitfalls

Forgetting to advance with getNextToken() → parser stalls.
Not validating closing ) → parenexpr fails.
Misclassifying identifiers without peeking '('.
Missing operator precedence entry → operators parsed incorrectly.
Failing to consume a token after error → REPL loop stuck.

10. Why this design works

This design is simple yet powerful:

Each AST node is minimal and maps directly to codegen in the next chapter.
Recursive descent mirrors the grammar directly, making it easy to extend.
Precedence climbing handles all binary operators with one function.
The driver loop provides an interactive, incremental environment.

Summary

Chapter 2 introduces parsing and AST construction.
It defines a grammar, builds node classes, writes parsing functions for each construct, implements operator precedence, and ties it together in a REPL.

With this, Kaleidoscope can now parse definitions, externs, and expressions into structured ASTs, paving the way for LLVM IR code generation in Chapter 3.