Chapter 1: Introduction and the Lexer

Chapter 1 introduces the language features and the overall plan for building the compiler.


Language Features

Kaleidoscope supports:

  • Floating-point numbers (double)
  • Variables
  • Arithmetic (+ - * /)
  • Functions and function calls

It is intentionally minimal but expressive enough to test compiler techniques.


First Step: The Lexer

The very first coding step is the lexer, which translates raw characters into tokens.
A token is just a categorized unit like an identifier, number, or keyword.

Example snippet:

// Token for identifiers like variable names
tok_identifier = -4;

// Token for numeric literals
tok_number = -5;

👉 These constants tell the compiler how to label pieces of text.
If the lexer sees foo, it returns tok_identifier. If it sees 42.0, it returns tok_number.


Handling Keywords

Inside the lexer, keywords like def and extern are recognized specially:

if (IdentifierStr == "def") return tok_def;
if (IdentifierStr == "extern") return tok_extern;

👉 This means when the user types def foo(x), the lexer doesn’t treat "def" as a normal identifier but as a function definition keyword.


Numbers

The lexer also collects sequences of digits (and .) as numbers:

NumVal = strtod(NumStr.c_str(), nullptr);
return tok_number;

👉 Here, the string of digits is converted into a floating-point value (double).
The compiler can now distinguish between identifiers like x and numbers like 3.14.


Key Takeaways

  • Chapter 1 defines what the language is and how we’ll build it.\
  • The lexer is the first building block, responsible for turning raw input into tokens.\
  • These tokens are the foundation for the parser and AST in later chapters.

Updated: