Field Notes / LinkedIn

Writing on AI, LLMs, and systems

Short-form notes on the architectures, trade-offs, and economics behind production AI — mostly shared on LinkedIn.

LLMInference

How Gemma 4’s Built-In Draft Models Change Speculative Decoding

Gemma 4 ships Multi-Token Prediction and paired draft models as a first-class feature — removing the friction of finding, matching, and deploying separate drafters.

Read on LinkedIn

Systems EngineeringLLM

LLM Inferencing is a Systems Engineering Challenge

Why serving LLMs at scale is a split-brain problem: prefill is compute-bound, decoding is memory-bandwidth-bound, and the KV cache is the hidden enemy.

Read on LinkedIn

TransformersArchitecture

LLM Evolution: From Attention to Memory Management

A 2017–2026 arc from Transformer self-attention through Flash Attention, MLA, CSA/HCA, and sparse retrieval — attention is becoming memory management.

Read on LinkedIn

DeepSeekLong Context

DeepSeek V4: Hierarchical Attention for Efficient Long-Context Retrieval

DeepSeek V4 tackles the economics of million-token context through CSA, HCA, and a token → compressed → global memory hierarchy.

Read on LinkedIn