Writing on AI, LLMs, and systems

Short-form notes on the architectures, trade-offs, and economics behind production AI — mostly shared on LinkedIn.

Gemma logo
LLMInference

How Gemma 4’s Built-In Draft Models Change Speculative Decoding

Gemma 4 ships Multi-Token Prediction and paired draft models as a first-class feature — removing the friction of finding, matching, and deploying separate drafters.

Read on LinkedIn
AI systems icon
Systems EngineeringLLM

LLM Inferencing is a Systems Engineering Challenge

Why serving LLMs at scale is a split-brain problem: prefill is compute-bound, decoding is memory-bandwidth-bound, and the KV cache is the hidden enemy.

Read on LinkedIn
Transformer encoder block diagram
TransformersArchitecture

LLM Evolution: From Attention to Memory Management

A 2017–2026 arc from Transformer self-attention through Flash Attention, MLA, CSA/HCA, and sparse retrieval — attention is becoming memory management.

Read on LinkedIn
DeepSeek logo
DeepSeekLong Context

DeepSeek V4: Hierarchical Attention for Efficient Long-Context Retrieval

DeepSeek V4 tackles the economics of million-token context through CSA, HCA, and a token → compressed → global memory hierarchy.

Read on LinkedIn