Chapter 3: Generating LLVM IR

In Chapter 3, Kaleidoscope moves from parsed structure (the AST) to executable form by emitting LLVM IR.


1. Big picture: how codegen works

  • Each AST node implements a codegen() method:
    • ExprAST nodes return an llvm::Value* (the SSA value that holds the result).
    • PrototypeAST returns an llvm::Function* (the IR level function declaration).
    • FunctionAST returns an llvm::Function* (the full definition after emitting the body).
  • We keep global-ish LLVM singletons:
    • LLVMContext – owns core LLVM data structures
    • Module – container for all the IR we emit (functions, globals, etc.)
    • IRBuilder<> – a helper to create instructions into the current basic block
  • We maintain a symbol table (map) from variable name → Value* to resolve identifiers inside a function body.

Everything is a double in Kaleidoscope (for now), so types are uniform: Type::getDoubleTy(TheContext).


2. LLVM objects we need

static std::unique_ptr<llvm::LLVMContext> TheContext;
static std::unique_ptr<llvm::IRBuilder<>> Builder;
static std::unique_ptr<llvm::Module> TheModule;
static std::map<std::string, llvm::Value*> NamedValues;

static void InitializeModule() {
  TheContext = std::make_unique<llvm::LLVMContext>();
  TheModule  = std::make_unique<llvm::Module>("my cool jit", *TheContext);
  Builder    = std::make_unique<llvm::IRBuilder<>>(*TheContext);
  NamedValues.clear();
}
  • TheModule is the IR container (think: a translation unit).
  • Builder inserts instructions at the end of the currently active BasicBlock.

3. Codegen for expressions

3.1 Number literal

llvm::Value *NumberExprAST::codegen() {
  return llvm::ConstantFP::get(*TheContext, llvm::APFloat(Val));
}

This returns a floating-point constant of type double. No instructions are emitted; it’s a constant value.

3.2 Variable reference

llvm::Value *VariableExprAST::codegen() {
  auto *V = NamedValues[Name];
  if (!V) return LogErrorV("Unknown variable name");
  return V; // function argument or previously bound value
}

Variables resolve through the current function’s symbol table. In Chapter 3, variables are just function arguments, so NamedValues is populated from the function’s prototype when we enter the function body.

3.3 Binary operations

llvm::Value *BinaryExprAST::codegen() {
  llvm::Value *L = LHS->codegen();
  llvm::Value *R = RHS->codegen();
  if (!L || !R) return nullptr;

  switch (Op) {
    case '+': return Builder->CreateFAdd(L, R, "addtmp");
    case '-': return Builder->CreateFSub(L, R, "subtmp");
    case '*': return Builder->CreateFMul(L, R, "multmp");
    case '<': {
      // 'fcmp ult' yields i1; convert to double 0.0/1.0 to stay in double world
      llvm::Value *Cmp = Builder->CreateFCmpULT(L, R, "cmptmp");
      return Builder->CreateUIToFP(Cmp, llvm::Type::getDoubleTy(*TheContext), "booltmp");
    }
    default:
      return LogErrorV("invalid binary operator");
  }
}
  • All arithmetic is in floating point (double).
  • Comparisons return a boolean (i1) which we promote to double so the language stays 1‑type.

3.4 Function call

llvm::Value *CallExprAST::codegen() {
  llvm::Function *CalleeF = TheModule->getFunction(Callee);
  if (!CalleeF) return LogErrorV("Unknown function referenced");

  if (CalleeF->arg_size() != Args.size())
    return LogErrorV("Incorrect # arguments passed");

  std::vector<llvm::Value*> ArgsV;
  ArgsV.reserve(Args.size());
  for (auto &Arg : Args) {
    auto *V = Arg->codegen();
    if (!V) return nullptr;
    ArgsV.push_back(V);
  }

  return Builder->CreateCall(CalleeF, ArgsV, "calltmp");
}

We look up the callee by name in the module, codegen each argument, and emit a call instruction.


4. Prototypes and function definitions

4.1 Prototype → llvm::Function*

llvm::Function *PrototypeAST::codegen() {
  std::vector<llvm::Type*> Doubles(Args.size(),
                                   llvm::Type::getDoubleTy(*TheContext));
  auto *FT = llvm::FunctionType::get(llvm::Type::getDoubleTy(*TheContext),
                                     Doubles, /*isVarArg=*/false);
  auto *F = llvm::Function::Create(FT, llvm::Function::ExternalLinkage,
                                   Name, TheModule.get());

  // Name the arguments
  unsigned Idx = 0;
  for (auto &Arg : F->args())
    Arg.setName(Args[Idx++]);

  return F;
}
  • Return type and all parameters are double.
  • We name each IR argument to match source names (handy for debugging and mapping to NamedValues).

4.2 Function definition → create block, bind args, emit body

llvm::Function *FunctionAST::codegen() {
  // 1) Get or insert the function declaration
  llvm::Function *TheFunction = TheModule->getFunction(Proto->getName());
  if (!TheFunction) TheFunction = Proto->codegen();
  if (!TheFunction) return nullptr;

  // 2) Create a basic block and set insertion point
  llvm::BasicBlock *BB = llvm::BasicBlock::Create(*TheContext, "entry", TheFunction);
  Builder->SetInsertPoint(BB);

  // 3) Bind arguments into the NamedValues table
  NamedValues.clear();
  for (auto &Arg : TheFunction->args())
    NamedValues[std::string(Arg.getName())] = &Arg;

  // 4) Emit the body
  if (llvm::Value *RetVal = Body->codegen()) {
    Builder->CreateRet(RetVal);
    // 5) Validate the generated code for consistency
    llvm::verifyFunction(*TheFunction);
    return TheFunction;
  }

  // 6) Error reading body: remove function
  TheFunction->eraseFromParent();
  return nullptr;
}
  • We always create an "entry" block and insert code there.
  • On success we generate a ret double instruction; on failure we erase the partial function to keep the module clean.

Note: Optimization passes come in the next chapter in the original tutorial. Here we just ensure the IR verifies.


5. Example: IR you’ll see

Source (Kaleidoscope)

def add(x y) x + y*2

High‑level IR shape (simplified)

define double @add(double %x, double %y) {
entry:
  %multmp = fmul double %y, 2.000000e+00
  %addtmp = fadd double %x, %multmp
  ret double %addtmp
}
  • %x and %y are the named IR arguments.
  • Temporary names (%multmp, %addtmp) come from the IRBuilder hints we provided.

6. Symbol resolution & scoping

  • In Chapter 3, variables are parameters only; there are no let‑style locals yet.
  • The current function’s symbol table is just NamedValues, populated from the prototype.
  • Later chapters add local variables and the “alloca + mem2reg” pattern.

7. Utility: verifying and dumping the module

  • After generating functions, we can print the entire module:
    TheModule->print(llvm::errs(), nullptr);
    
  • Always run:
    llvm::verifyFunction(*F);
    

    to catch malformed IR early.


8. Putting it together (minimal driver idea)

  • Parse input with the driver from Chapter 2.
  • On def: codegen the FunctionAST.
  • On extern: codegen the PrototypeAST (declaration only).
  • On top‑level expression: codegen the anonymous wrapper function and print the IR.
  • Keep accumulating into TheModule so functions can call each other.

9. Common pitfalls in codegen

  • Mismatched types: everything must be double → make sure calls and operators all use DoubleTy.
  • Unknown variable: if it’s not in NamedValues, you likely forgot to bind a parameter.
  • Unterminated block: every basic block in a function must end with a terminator (ret, br, etc.).
  • No verification: skipping verifyFunction hides bugs until later.
  • Forgetting to initialize the Module/Builder before codegen.

10. Where we go next

  • Chapter 4 adds optimizations and more realistic handling of variables (alloca + promotable SSA).
  • Later, we add conditionals, loops, JIT, and even user‑defined operators; the codegen() pattern scales naturally.

Summary

Chapter 3 wires your AST to real LLVM IR using IRBuilder, Module, and LLVMContext.
By giving every AST node a codegen() method, the compiler cleanly maps high‑level constructs to SSA form.
With verifiable IR printing, you can now “see” your language execute on LLVM.

Updated: