Ten posts that rebuild the core idea behind every modern language model: that the meaning of a word can be captured by its position in a geometric space. Starting from raw text and ending with a working Skip-Gram training loop, each piece of machinery is motivated by the problem it solves. No leaps, no black boxes.
Each part builds on the last. Read in order. Sidebar on each page tracks where you are.
Starting from "a computer sees bytes, not meaning," ten posts later you've built every piece of the Skip-Gram algorithm. You've seen why each component exists. You've watched the core loop produce the distributional hypothesis in real time: dot product, softmax, cross-entropy, gradient, update.
That same loop scales up. Widen the context window and you get the setup for transformers. Make the score computation smarter than a dot product and you get attention. Stack many layers and you get the depth of modern LLMs. But the fundamental move is the one you've just seen: represent as vectors, score via geometry, normalise to probability, compare to truth, update.
Everything modern NLP can do, it does because this loop, run hard enough, discovers structure nobody explicitly taught it.