Deep dives into papers, research ideas, and machine learning concepts
Deep dive into L1 and L2 regularization with four intuitive explanations: single coefficient optimization, sparse vs dense vectors, geometric interpretation, and probabilistic view with Gaussian and Laplacian priors. Understand why Lasso performs variable selection.
A comprehensive guide to tokenization in LLMs. Learn why UTF-8 encoding isn't enough, how Byte Pair Encoding (BPE) works with practical Python implementations, and understand why tokenization is the "problematically important" root of many LLM limitations.
A deep dive into S-LoRA's system design for efficiently serving thousands of LoRA adapters. Exploring unified paging, heterogeneous batching with custom CUDA kernels, and novel tensor parallelism strategies for multi-GPU inference.
Register Buffer and Register Parameter in PyTorch
A comprehensive deep dive into neural network pruning fundamentals. From Optimal Brain Damage to SparseGPT, exploring the mathematics, algorithms, and practical techniques for compressing LLMs by 2-4x while maintaining performance.
Our SpookyBench reveals a fundamental limitation: while humans achieve 98% accuracy on temporal patterns, state-of-the-art VLMs score 0%. Here's why and what it means for video understanding.
GBLM-Pruner shows that gradients can guide pruning decisions without expensive weight updates. We explore the theory and practice behind this approach and its implications for LLM compression.
Reflections from implementing heterogeneous inference for Hymba and Jamba on AMD Ryzen AI SoCs. How hardware constraints shape algorithmic decisions and vice versa.
How we unified LoRA, VPT, Adapters, and SSF under a single mathematical framework and used evolutionary search to find optimal per-layer structures with zero inference overhead.