minseo choi · cs @ johns hopkins · mlsys researcher

I make ML models run fast.

GPU kernels · ML compilers · LLM inference systems

github ↗ linkedin ↗ resume.pdf ↗

minseo@jhu — zsh

minseo@jhu:~$ whoami

Minseo Choi — CS @ Johns Hopkins · MLSys researcher

minseo@jhu:~$ tail -n 3 research.log

[run] JHU Medicine · six-model FP8 pipeline on 2×B200 → 2.89× decode

[run] JHU DSAI · suffix decoding → 1.43× throughput

[ok ] 119 posts · 9 projects · FlashAttention-2 from scratch

this terminal works — try

$ scroll --down experience · projects · writing · contact

$ cat experience.log

may 2026 — now

Research Assistant · JHU Data Science & AI Institute — BLAB

Suffix decoding on a decode-bound data pipeline — 1.43× throughput

Profiled decode-bound rewriting workloads in a large-scale data-generation pipeline, found high copy-fraction, and deployed draft-free suffix decoding — 1.43× throughput with no extra training or VRAM.

suffix decoding · speculative decoding · GPU profiling

feb 2026 — now

Research Assistant · Johns Hopkins Medicine — PALS Lab

Six-model FP8 safety pipeline on 2×B200 — 2.89× decode throughput

70B generator + five critic agents served with TensorRT-LLM and Triton Inference Server. Fixed orchestration-bound GPU underutilization via generator isolation, async BLS, and KV-cache block reuse — ~3,100 conversations/hour at ~3.3s P95. Now extending to heterogeneous SGLang + TensorRT-LLM serving. With NVIDIA Safety & JHTV; ICML 2027 submission planned.

TensorRT-LLM · Triton Inference Server · SGLang · FP8 · B200

jan 2026 — now

Course Assistant · JHU — Computer Systems Fundamentals

Cache behavior, assembly, and performance debugging in C/C++

oct 2023 — apr 2025

Drill Instructor & Senior Squad Leader · Republic of Korea Army

Trained and mentored 2,000+ recruits

$ ls projects/ --featured

flash-attention FlashAttention-2 from scratch in Triton/CUDA — ~2× faster, ~5× less peak memory vs naive attention

Reimplementation of FlashAttention-2 with tiled online-softmax kernels in both Triton and CUDA, eliminating O(N²) score materialization — ~2× speedup and ~5× peak-memory reduction over naive PyTorch attention at 4K sequence length.

src ↗ notes →

medical-triton Fused Triton kernels for CT/MRI pipelines — ~9.7× faster than unfused PyTorch ops

GPU-accelerated CT/MRI enhancement pipeline: DICOM preprocessing, DL denoising, and post-processing. The pre/post chains (dtype conversion, windowing, normalization, clipping) run as fused Triton kernels — ~9.7× speedup by eliminating per-op kernel launches and intermediate HBM round-trips.

src ↗

kaleidoscope A small functional language, from lexer to LLVM JIT

A functional language built stage by stage on LLVM — lexer, parser, AST, IR codegen, optimization passes, and JIT execution — extended with IR/CFG visualization tooling as a foundation for ML compiler work.

src ↗ notes →

toy-ml-compiler End-to-end MLIR pipeline — IR parsing, pass execution, toolchain integration

An experimental ML compiler built on MLIR — parsing tensor programs into IR, running custom passes, and integrating the full toolchain, applying lessons from the Kaleidoscope build.

src ↗ notes →

all 9 projects →

$ tail writing.log

2026-03 FlashAttention-4: When Tensor Cores Got Too Fast for Everything Else #hardware

2026-02 Spark, Cerebras, and the Future of Low-Latency AI Inference #hardware

2026-02 MLIR Is Not Just Another IR #compiler

2026-01 vLLM and PagedAttention: Why KV Cache Management Matters #inference

mlsys index (30) → · ml index (89) → · archive →

$ nvidia-smi --query-focus

gpu: CUDA · Triton · Mojo · memory hierarchy · kernel fusion
serving: TensorRT-LLM · vLLM · SGLang · Triton Inference Server · KV-cache reuse
compilers: MLIR · LLVM · TableGen · SSA/CFG · optimization passes
infra: C/C++ · Python · PyTorch · SLURM · Prometheus/Grafana · Linux

$ ping minseo

Looking for MLSys internships & research — GPU performance, ML compilers, or inference infrastructure. If that's your team, reach out.

linkedin ↗ github ↗ resume.pdf ↗