Kaleidoscope Ch 2 : Parser & AST
Chapter 2: Parser & AST
This chapter of the Kaleidoscope tutorial shows how to extend our toy language from raw tokens into structured meaning. We do this by building a recursive‑descent parser and representing the result as an Abstract Syntax Tree (AST).
1. Why an AST?
Tokens alone (def, x, 42, +, etc.) do not capture structure. For example, x + y * 2 is a stream of tokens, but we need to know that multiplication binds tighter than addition. An AST expresses this directly:
+
/ \
x *
/ \
y 2
This tree form is critical because later stages (IR generation, optimization, interpretation) all rely on walking structured nodes, not raw text.
2. AST Node Classes
Each construct in Kaleidoscope has a dedicated AST node class.
class ExprAST {
public:
virtual ~ExprAST() = default;
};
// Number literal
class NumberExprAST : public ExprAST {
double Val;
public:
NumberExprAST(double Val) : Val(Val) {}
};
// Variable reference
class VariableExprAST : public ExprAST {
std::string Name;
public:
VariableExprAST(const std::string &Name) : Name(Name) {}
};
// Binary operation
class BinaryExprAST : public ExprAST {
char Op;
std::unique_ptr<ExprAST> LHS, RHS;
public:
BinaryExprAST(char Op, std::unique_ptr<ExprAST> LHS,
std::unique_ptr<ExprAST> RHS)
: Op(Op), LHS(std::move(LHS)), RHS(std::move(RHS)) {}
};
The design is deliberately simple: each node only stores what is logically necessary. For example, a binary operation doesn’t know about precedence—it just stores its operator and operands. Precedence is resolved during parsing, not encoded into the AST.
3. Token Management
The parser uses one‑token lookahead. We keep a global CurTok and advance by asking the lexer for the next token.
static int CurTok;
static int getNextToken() { return CurTok = gettok(); }
Every parsing function assumes CurTok holds the next token to be processed. When it consumes something, it calls getNextToken() to advance. This discipline is key: forgetting to advance is one of the most common parsing bugs.
4. Parsing Primaries
The base cases of our grammar are numbers, parenthesized expressions, and identifiers (which may be variables or function calls).
Numbers
static std::unique_ptr<ExprAST> ParseNumberExpr() {
auto Result = std::make_unique<NumberExprAST>(NumVal);
getNextToken(); // consume number
return std::move(Result);
}
This is the simplest: wrap the numeric value in a NumberExprAST and move on.
Parentheses
static std::unique_ptr<ExprAST> ParseParenExpr() {
getNextToken(); // eat '('
auto V = ParseExpression();
if (!V) return nullptr;
if (CurTok != ')')
return LogError("expected ')'");
getNextToken(); // eat ')'
return V;
}
Parentheses are important because they explicitly override operator precedence. The parser checks for a matching ) and ensures the enclosed expression is valid.
Identifiers: Variable or Function Call
static std::unique_ptr<ExprAST> ParseIdentifierExpr() {
std::string IdName = IdentifierStr;
getNextToken(); // eat identifier
if (CurTok != '(') {
// variable
return std::make_unique<VariableExprAST>(IdName);
}
// function call
getNextToken(); // eat '('
std::vector<std::unique_ptr<ExprAST>> Args;
if (CurTok != ')') {
while (true) {
if (auto Arg = ParseExpression())
Args.push_back(std::move(Arg));
else return nullptr;
if (CurTok == ')') break;
if (CurTok != ',')
return LogError("Expected ')' or ',' in argument list");
getNextToken(); // eat ','
}
}
getNextToken(); // eat ')'
return std::make_unique<CallExprAST>(IdName, std::move(Args));
}
This illustrates how the parser peeks ahead: if the identifier is followed by (, it is a call, otherwise just a variable. This trick avoids ambiguity without complicating the grammar.
5. Expressions and Operator Precedence
The heart of the parser is expression parsing. The goal is to ensure a + b * c is parsed as a + (b * c), not (a + b) * c.
Precedence Table
static std::map<char, int> BinopPrecedence;
static int GetTokPrecedence() {
if (!isascii(CurTok)) return -1;
int Prec = BinopPrecedence[CurTok];
if (Prec <= 0) return -1;
return Prec;
}
The table might assign * = 40, + = 20, - = 20, < = 10. Higher numbers mean stronger binding.
Parsing Expressions
static std::unique_ptr<ExprAST> ParseExpression() {
auto LHS = ParsePrimary();
if (!LHS) return nullptr;
return ParseBinOpRHS(0, std::move(LHS));
}
This function starts with a primary, then hands control to ParseBinOpRHS, which manages binary operations based on precedence.
Parsing RHS with Precedence Climbing
static std::unique_ptr<ExprAST> ParseBinOpRHS(int ExprPrec,
std::unique_ptr<ExprAST> LHS) {
while (true) {
int TokPrec = GetTokPrecedence();
if (TokPrec < ExprPrec)
return LHS;
int BinOp = CurTok;
getNextToken(); // eat binop
auto RHS = ParsePrimary();
if (!RHS) return nullptr;
int NextPrec = GetTokPrecedence();
if (TokPrec < NextPrec) {
RHS = ParseBinOpRHS(TokPrec+1, std::move(RHS));
if (!RHS) return nullptr;
}
LHS = std::make_unique<BinaryExprAST>(BinOp, std::move(LHS), std::move(RHS));
}
}
This is a precedence climbing algorithm. It repeatedly parses RHS expressions and groups them correctly. If a later operator has higher precedence, it parses that first before combining.
6. Functions: Prototypes, Definitions, and Externs
Functions are built in two steps: a prototype (name + argument list) and an optional body expression.
class PrototypeAST {
std::string Name;
std::vector<std::string> Args;
public:
PrototypeAST(const std::string &Name, std::vector<std::string> Args)
: Name(Name), Args(std::move(Args)) {}
};
class FunctionAST {
std::unique_ptr<PrototypeAST> Proto;
std::unique_ptr<ExprAST> Body;
public:
FunctionAST(std::unique_ptr<PrototypeAST> Proto,
std::unique_ptr<ExprAST> Body)
: Proto(std::move(Proto)), Body(std::move(Body)) {}
};
Parsing rules:
def foo(x y) x+y→ prototype + body →FunctionASTextern sin(x)→ prototype only →PrototypeAST- top‑level expression → anonymous function
This design allows the REPL to treat every top‑level construct uniformly.
7. Driver Loop
The REPL loop checks CurTok and dispatches appropriately:
static void MainLoop() {
while (true) {
switch (CurTok) {
case tok_eof: return;
case ';': getNextToken(); break; // ignore
case tok_def: HandleDefinition(); break;
case tok_extern: HandleExtern(); break;
default: HandleTopLevelExpression(); break;
}
}
}
This makes the language interactive: users can type definitions, externs, or expressions, and the parser responds immediately.
8. Example Walkthroughs
- Input
def add(x y) x + y*2- Prototype:
add(x,y) - Body:
x + (y*2) - AST shape:
FunctionAST(Proto("add",[x,y]), BinaryExprAST('+',Var(x),BinaryExprAST('*',Var(y),Num(2))))
- Prototype:
- Input
extern sin(a)- AST:
PrototypeAST("sin",[a])
- AST:
- Input
add(3,4)- AST:
CallExprAST("add",[Num(3),Num(4)])
- AST:
- Input
(1+2)*3- AST:
BinaryExprAST('*', BinaryExprAST('+',Num(1),Num(2)), Num(3))
- AST:
These examples demonstrate how ASTs cleanly capture intended precedence and structure.
9. Common Pitfalls
- Forgetting to advance with
getNextToken()→ parser stalls. - Not validating closing
)→ parenexpr fails. - Misclassifying identifiers without peeking
'('. - Missing operator precedence entry → operators parsed incorrectly.
- Failing to consume a token after error → REPL loop stuck.
10. Why this design works
This design is simple yet powerful:
- Each AST node is minimal and maps directly to codegen in the next chapter.
- Recursive descent mirrors the grammar directly, making it easy to extend.
- Precedence climbing handles all binary operators with one function.
- The driver loop provides an interactive, incremental environment.
Summary
Chapter 2 introduces parsing and AST construction.
It defines a grammar, builds node classes, writes parsing functions for each construct, implements operator precedence, and ties it together in a REPL.
With this, Kaleidoscope can now parse definitions, externs, and expressions into structured ASTs, paving the way for LLVM IR code generation in Chapter 3.