Kaleidoscope Ch 1 : Lexer
Chapter 1: Introduction and the Lexer
Chapter 1 introduces the language features and the overall plan for building the compiler.
Language Features
Kaleidoscope supports:
- Floating-point numbers (
double) - Variables
- Arithmetic (
+ - * /) - Functions and function calls
It is intentionally minimal but expressive enough to test compiler techniques.
First Step: The Lexer
The very first coding step is the lexer, which translates raw
characters into tokens.
A token is just a categorized unit like an identifier, number, or
keyword.
Example snippet:
// Token for identifiers like variable names
tok_identifier = -4;
// Token for numeric literals
tok_number = -5;
👉 These constants tell the compiler how to label pieces of text.
If the lexer sees foo, it returns tok_identifier. If it sees 42.0,
it returns tok_number.
Handling Keywords
Inside the lexer, keywords like def and extern are recognized
specially:
if (IdentifierStr == "def") return tok_def;
if (IdentifierStr == "extern") return tok_extern;
👉 This means when the user types def foo(x), the lexer doesn’t treat
"def" as a normal identifier but as a function definition keyword.
Numbers
The lexer also collects sequences of digits (and .) as numbers:
NumVal = strtod(NumStr.c_str(), nullptr);
return tok_number;
👉 Here, the string of digits is converted into a floating-point value
(double).
The compiler can now distinguish between identifiers like x and
numbers like 3.14.
Key Takeaways
- Chapter 1 defines what the language is and how we’ll build it.\
- The lexer is the first building block, responsible for turning raw input into tokens.\
- These tokens are the foundation for the parser and AST in later chapters.