Kaleidoscope Ch 3 : LLVM IR
Chapter 3: Generating LLVM IR
In Chapter 3, Kaleidoscope moves from parsed structure (the AST) to executable form by emitting LLVM IR.
1. Big picture: how codegen works
- Each AST node implements a
codegen()method:- ExprAST nodes return an
llvm::Value*(the SSA value that holds the result). - PrototypeAST returns an
llvm::Function*(the IR level function declaration). - FunctionAST returns an
llvm::Function*(the full definition after emitting the body).
- ExprAST nodes return an
- We keep global-ish LLVM singletons:
LLVMContext– owns core LLVM data structuresModule– container for all the IR we emit (functions, globals, etc.)IRBuilder<>– a helper to create instructions into the current basic block
- We maintain a symbol table (map) from variable name →
Value*to resolve identifiers inside a function body.
Everything is a double in Kaleidoscope (for now), so types are uniform: Type::getDoubleTy(TheContext).
2. LLVM objects we need
static std::unique_ptr<llvm::LLVMContext> TheContext;
static std::unique_ptr<llvm::IRBuilder<>> Builder;
static std::unique_ptr<llvm::Module> TheModule;
static std::map<std::string, llvm::Value*> NamedValues;
static void InitializeModule() {
TheContext = std::make_unique<llvm::LLVMContext>();
TheModule = std::make_unique<llvm::Module>("my cool jit", *TheContext);
Builder = std::make_unique<llvm::IRBuilder<>>(*TheContext);
NamedValues.clear();
}
TheModuleis the IR container (think: a translation unit).Builderinserts instructions at the end of the currently active BasicBlock.
3. Codegen for expressions
3.1 Number literal
llvm::Value *NumberExprAST::codegen() {
return llvm::ConstantFP::get(*TheContext, llvm::APFloat(Val));
}
This returns a floating-point constant of type double. No instructions are emitted; it’s a constant value.
3.2 Variable reference
llvm::Value *VariableExprAST::codegen() {
auto *V = NamedValues[Name];
if (!V) return LogErrorV("Unknown variable name");
return V; // function argument or previously bound value
}
Variables resolve through the current function’s symbol table. In Chapter 3, variables are just function arguments, so NamedValues is populated from the function’s prototype when we enter the function body.
3.3 Binary operations
llvm::Value *BinaryExprAST::codegen() {
llvm::Value *L = LHS->codegen();
llvm::Value *R = RHS->codegen();
if (!L || !R) return nullptr;
switch (Op) {
case '+': return Builder->CreateFAdd(L, R, "addtmp");
case '-': return Builder->CreateFSub(L, R, "subtmp");
case '*': return Builder->CreateFMul(L, R, "multmp");
case '<': {
// 'fcmp ult' yields i1; convert to double 0.0/1.0 to stay in double world
llvm::Value *Cmp = Builder->CreateFCmpULT(L, R, "cmptmp");
return Builder->CreateUIToFP(Cmp, llvm::Type::getDoubleTy(*TheContext), "booltmp");
}
default:
return LogErrorV("invalid binary operator");
}
}
- All arithmetic is in floating point (
double). - Comparisons return a boolean (
i1) which we promote to double so the language stays 1‑type.
3.4 Function call
llvm::Value *CallExprAST::codegen() {
llvm::Function *CalleeF = TheModule->getFunction(Callee);
if (!CalleeF) return LogErrorV("Unknown function referenced");
if (CalleeF->arg_size() != Args.size())
return LogErrorV("Incorrect # arguments passed");
std::vector<llvm::Value*> ArgsV;
ArgsV.reserve(Args.size());
for (auto &Arg : Args) {
auto *V = Arg->codegen();
if (!V) return nullptr;
ArgsV.push_back(V);
}
return Builder->CreateCall(CalleeF, ArgsV, "calltmp");
}
We look up the callee by name in the module, codegen each argument, and emit a call instruction.
4. Prototypes and function definitions
4.1 Prototype → llvm::Function*
llvm::Function *PrototypeAST::codegen() {
std::vector<llvm::Type*> Doubles(Args.size(),
llvm::Type::getDoubleTy(*TheContext));
auto *FT = llvm::FunctionType::get(llvm::Type::getDoubleTy(*TheContext),
Doubles, /*isVarArg=*/false);
auto *F = llvm::Function::Create(FT, llvm::Function::ExternalLinkage,
Name, TheModule.get());
// Name the arguments
unsigned Idx = 0;
for (auto &Arg : F->args())
Arg.setName(Args[Idx++]);
return F;
}
- Return type and all parameters are
double. - We name each IR argument to match source names (handy for debugging and mapping to
NamedValues).
4.2 Function definition → create block, bind args, emit body
llvm::Function *FunctionAST::codegen() {
// 1) Get or insert the function declaration
llvm::Function *TheFunction = TheModule->getFunction(Proto->getName());
if (!TheFunction) TheFunction = Proto->codegen();
if (!TheFunction) return nullptr;
// 2) Create a basic block and set insertion point
llvm::BasicBlock *BB = llvm::BasicBlock::Create(*TheContext, "entry", TheFunction);
Builder->SetInsertPoint(BB);
// 3) Bind arguments into the NamedValues table
NamedValues.clear();
for (auto &Arg : TheFunction->args())
NamedValues[std::string(Arg.getName())] = &Arg;
// 4) Emit the body
if (llvm::Value *RetVal = Body->codegen()) {
Builder->CreateRet(RetVal);
// 5) Validate the generated code for consistency
llvm::verifyFunction(*TheFunction);
return TheFunction;
}
// 6) Error reading body: remove function
TheFunction->eraseFromParent();
return nullptr;
}
- We always create an
"entry"block and insert code there. - On success we generate a
ret doubleinstruction; on failure we erase the partial function to keep the module clean.
Note: Optimization passes come in the next chapter in the original tutorial. Here we just ensure the IR verifies.
5. Example: IR you’ll see
Source (Kaleidoscope)
def add(x y) x + y*2
High‑level IR shape (simplified)
define double @add(double %x, double %y) {
entry:
%multmp = fmul double %y, 2.000000e+00
%addtmp = fadd double %x, %multmp
ret double %addtmp
}
%xand%yare the named IR arguments.- Temporary names (
%multmp,%addtmp) come from theIRBuilderhints we provided.
6. Symbol resolution & scoping
- In Chapter 3, variables are parameters only; there are no
let‑style locals yet. - The current function’s symbol table is just
NamedValues, populated from the prototype. - Later chapters add local variables and the “
alloca+ mem2reg” pattern.
7. Utility: verifying and dumping the module
- After generating functions, we can print the entire module:
TheModule->print(llvm::errs(), nullptr); - Always run:
llvm::verifyFunction(*F);to catch malformed IR early.
8. Putting it together (minimal driver idea)
- Parse input with the driver from Chapter 2.
- On
def: codegen theFunctionAST. - On
extern: codegen thePrototypeAST(declaration only). - On top‑level expression: codegen the anonymous wrapper function and print the IR.
- Keep accumulating into
TheModuleso functions can call each other.
9. Common pitfalls in codegen
- Mismatched types: everything must be
double→ make sure calls and operators all useDoubleTy. - Unknown variable: if it’s not in
NamedValues, you likely forgot to bind a parameter. - Unterminated block: every basic block in a function must end with a terminator (
ret,br, etc.). - No verification: skipping
verifyFunctionhides bugs until later. - Forgetting to initialize the Module/Builder before codegen.
10. Where we go next
- Chapter 4 adds optimizations and more realistic handling of variables (alloca + promotable SSA).
- Later, we add conditionals, loops, JIT, and even user‑defined operators; the
codegen()pattern scales naturally.
Summary
Chapter 3 wires your AST to real LLVM IR using IRBuilder, Module, and LLVMContext.
By giving every AST node a codegen() method, the compiler cleanly maps high‑level constructs to SSA form.
With verifiable IR printing, you can now “see” your language execute on LLVM.