ML Compilation : Introduction

Note on References
This blog series is based on the official MLC.ai book, which provides a comprehensive introduction to machine learning compilation.
The content here is not an original textbook — rather, it is my personal study notes and summaries of the MLC.ai chapters.
I restructured the material with additional explanations, code highlights, and commentary to make it easier for myself (and hopefully others) to follow.
If you want the authoritative source, please check the original MLC.ai book.

Chapter 1: Introduction to Machine Learning Compilation (MLC)

This chapter defines machine learning compilation (MLC) and lays out the end‑to‑end path from a model’s development form (framework code + weights) to its deployment form (an optimized module that runs efficiently on a specific device). It also motivates why compilers are needed for ML and sketches the stack we will build in later chapters.

1. What is ML Compilation?

Definition. Machine learning compilation (MLC) is the process of transforming and optimizing model execution from its development form (e.g., PyTorch/JAX/TF graphs and weights) to its deployment form (a lean, hardware‑specific artifact that runs efficiently).

Development form: model code written in a DL framework + trained parameters.
Deployment form: an artifact (library/binary/IR module) optimized for a given target (CPU, GPU, mobile SoC, browser WebGPU, microcontroller).
The goal is to minimize latency and memory while maximizing portability and utilizing accelerators.

2. Why do we need ML compilers?

Heterogeneous hardware: CPUs, NVIDIA/AMD/Intel GPUs, Apple Silicon, NPUs, phone DSPs, browsers (WebGPU). One model, many targets.
Performance portability: avoid rewriting kernels for each backend; let the compiler map high‑level ops to efficient code.
Optimization space is huge: operator fusion, layout transforms, tiling, vectorization, threading, memory planning, quantization, etc. A compiler can search this space systematically.
Universal deployment: same stack can place models on device—from mobile to web via WebGPU.

3. The MLC / TVM stack at a glance

A typical pipeline (names vary by project) looks like this:

Frontends: ingest models from PyTorch, TensorFlow, JAX (via export).
Graph‑level IR (e.g., Relay / Relax): apply graph rewrites like fusion, constant folding, shape specialization.
Tensor‑level IR (TIR): lower ops to loop nests and schedules (tiling, unrolling, vectorization, thread/block binding).
Auto‑tuning / Meta‑Schedule: search scheduling choices to find the best kernel variants.
Runtime & Deployment: generate target‑specific modules and run with a lightweight runtime on CPU/GPU/accelerators; or bundle for WebGPU/mobile.

4. A mental model: development → deployment

Think of compilation as progressive refinement:

Framework Model (PyTorch/JAX)
        │   export
        ▼
 High‑Level Graph IR   (fuse ops, fold consts, layout/shape transforms)
        │   lower
        ▼
 Tensor IR (loops + schedules)  (tile, vectorize, parallelize, memory plan)
        │   tune / codegen
        ▼
 Target Artifact (CPU/GPU/WebGPU/Mobile) + Minimal Runtime

Each stage narrows choices and introduces concrete decisions (layouts, tiling sizes, thread/block mapping) until we have efficient, runnable code on the target.

5. Tiny, practical code glimpses

Even in the intro, it’s helpful to see what “end‑to‑end” feels like. Here are illustrative snippets that you’ll encounter in later chapters/tools (TVM‑style).

5.1 Compile a traced model (sketch)

# Given: a framework-exported model (e.g., via torch.export or ONNX)
import tvm
from tvm import relax
from tvm.runtime import Module

# 1) Import model graph → Relax/Relay (details vary)
mod = import_from_pytorch_or_onnx("model.pt")  # pseudo-API for illustration

# 2) Graph-level transforms (fusion, constant folding, layout rewrites)
mod = relax.transform.FuseOps()(mod)
mod = relax.transform.SimplifyExpr()(mod)

# 3) Lower to TIR and build for a target (e.g., CUDA)
with tvm.target.Target("cuda"):
    ex = tvm.relax.build(mod, target="cuda")
# 4) Save or deploy ex (a runtime module)
ex.export_library("model_cuda.so")

What this shows: a model comes in, goes through graph passes, then builds into a deployable module for a backend. (APIs differ across versions—the point is the shape of the pipeline.)

6. Optimization themes you will see throughout the book

Operator fusion: cut intermediate memory traffic and launch overhead.
Layout & shape specialization: choose NHWC vs NCHW, block layouts, static shapes.
Loop scheduling: tile, unroll, vectorize, bind to threads/warps/cores.
Memory planning: reuse buffers, pre‑allocate workspace, reduce peak memory.
Quantization: reduce precision (e.g., int8, fp16) for better throughput with acceptable accuracy.
Auto‑tuning / Meta‑Schedule: search “good” schedules per op/subgraph.

7. Universal deployment: CPU, GPU, Mobile, Web

MLC’s philosophy is Python‑first development + universal deployment—compile once, run everywhere: servers, laptops, phones, and browsers. Examples:

Apache TVM compiles pre‑trained models to deployable modules across backends.
MLC LLM / WebLLM bring LLMs to mobile and WebGPU with the same toolchain ideas.

8. Takeaways

MLC is about turning framework models into optimized, device‑ready artifacts.
The stack spans graph IR → tensor IR → codegen → runtime.
Compilers help us navigate a vast optimization space and target heterogeneous hardware without hand‑crafting kernels for each device.
The same ideas power modern projects like TVM, MLC‑LLM, and WebLLM, enabling on‑device and in‑browser AI.

Summary

This introduction frames the rest of the book: starting from model graphs in popular frameworks, we lower them to IRs, transform and tune them, and deploy the result on any target—from servers to WebGPU in browsers. Subsequent chapters dive into each layer and its optimizations.