ML Research & Papers - Mukul Ranjan

How does Lasso Regression (L1) Encourage Zero Coefficients but not L2?

June 2022 · 8 min read

Deep dive into L1 and L2 regularization with four intuitive explanations: single coefficient optimization, sparse vs dense vectors, geometric interpretation, and probabilistic view with Gaussian and Laplacian priors. Understand why Lasso performs variable selection.

Tokenizers in Large Language Models: Byte Pair Encoding and Beyond

January 2026 · 14 min read

A comprehensive guide to tokenization in LLMs. Learn why UTF-8 encoding isn't enough, how Byte Pair Encoding (BPE) works with practical Python implementations, and understand why tokenization is the "problematically important" root of many LLM limitations.

S-LoRA: Serving Thousands of Concurrent LoRA Adapters

January 2026 · 9 min read

A deep dive into S-LoRA's system design for efficiently serving thousands of LoRA adapters. Exploring unified paging, heterogeneous batching with custom CUDA kernels, and novel tensor parallelism strategies for multi-GPU inference.

Register Buffer and Register Parameter in PyTorch

January 2026 · 6 min read

Register Buffer and Register Parameter in PyTorch

Understanding the Pruning of Large Language Models

January 2026 · 16 min read

A comprehensive deep dive into neural network pruning fundamentals. From Optimal Brain Damage to SparseGPT, exploring the mathematics, algorithms, and practical techniques for compressing LLMs by 2-4x while maintaining performance.

Why Vision-Language Models Fail at Temporal Reasoning[Coming Soon]

December 2025 · 10 min read

Our SpookyBench reveals a fundamental limitation: while humans achieve 98% accuracy on temporal patterns, state-of-the-art VLMs score 0%. Here's why and what it means for video understanding.

Gradient-Based Pruning: Does Gradient Help in LLM Pruning?[WIP]

November 2025 · 12 min read

GBLM-Pruner shows that gradients can guide pruning decisions without expensive weight updates. We explore the theory and practice behind this approach and its implications for LLM compression.

Hardware-Aware ML: Lessons from Hybrid Models on AMD NPUs[Coming Soon]

October 2025 · 15 min read

Reflections from implementing heterogeneous inference for Hymba and Jamba on AMD Ryzen AI SoCs. How hardware constraints shape algorithmic decisions and vice versa.

GLoRA: Unifying Parameter-Efficient Fine-Tuning Methods[WIP]

September 2025 · 11 min read

How we unified LoRA, VPT, Adapters, and SSF under a single mathematical framework and used evolutionary search to find optimal per-layer structures with zero inference overhead.