Kaleidoscope Ch 1 : Lexer

Chapter 1: Introduction and the Lexer

Chapter 1 introduces the language features and the overall plan for building the compiler.

Language Features

Kaleidoscope supports:

Floating-point numbers (double)
Variables
Arithmetic (+ - * /)
Functions and function calls

It is intentionally minimal but expressive enough to test compiler techniques.

First Step: The Lexer

The very first coding step is the lexer, which translates raw characters into tokens.
A token is just a categorized unit like an identifier, number, or keyword.

Example snippet:

// Token for identifiers like variable names
tok_identifier = -4;

// Token for numeric literals
tok_number = -5;

👉 These constants tell the compiler how to label pieces of text.
If the lexer sees foo, it returns tok_identifier. If it sees 42.0, it returns tok_number.

Handling Keywords

Inside the lexer, keywords like def and extern are recognized specially:

if (IdentifierStr == "def") return tok_def;
if (IdentifierStr == "extern") return tok_extern;

👉 This means when the user types def foo(x), the lexer doesn’t treat "def" as a normal identifier but as a function definition keyword.

Numbers

The lexer also collects sequences of digits (and .) as numbers:

NumVal = strtod(NumStr.c_str(), nullptr);
return tok_number;

👉 Here, the string of digits is converted into a floating-point value (double).
The compiler can now distinguish between identifiers like x and numbers like 3.14.

Key Takeaways

Chapter 1 defines what the language is and how we’ll build it.\
The lexer is the first building block, responsible for turning raw input into tokens.\
These tokens are the foundation for the parser and AST in later chapters.