Blog Series

Before Transformers

How embeddings learn meaning, from first principles.

Ten posts that rebuild the core idea behind every modern language model: that the meaning of a word can be captured by its position in a geometric space. Starting from raw text and ending with a working Skip-Gram training loop, each piece of machinery is motivated by the problem it solves. No leaps, no black boxes.

Complete · 10 parts

What this series is

A careful construction of word embeddings from scratch. Every equation is derived from a question you could have asked yourself. Every interactive is something you can play with to build intuition.

Who it's for

Readers who've heard of transformers and want to understand the machinery underneath. Engineers, students, curious people. Comfort with high-school algebra is enough. You don't need to already know any ML.

The question this series answers

How does a machine that only understands numbers come to know that "dog" and "cat" are similar, without anyone ever telling it so?

The ten posts

Each part builds on the last. Read in order. Sidebar on each page tracks where you are.

Act IWhy text alone can't carry meaning

Part 1Representation

Text Was a Problem Long Before AI

From telegraph codes to Unicode. How text got turned into numbers at all, and why encoding solved storage but not meaning.

Why it matters: Every later idea rests on the fact that a number like 97 (ASCII "a") carries no semantic information. Language never actually got through the door.

Part 2Similarity

From Characters to Concepts

Geometry solved similarity for images. Can it work for text? Why one-hot vectors put every word equidistant from every other, and why that's useless for learning.

Why it matters: This post names the problem that the rest of the series solves. We need a space where geometry reflects meaning.

Act IIFrom usage patterns to a geometric space

Part 3Distributional hypothesis

From Words to Spaces

A tiny corpus shows how meaning emerges from counting who appears next to whom. The co-occurrence matrix beautifully captures structure, and immediately reveals why it can't scale.

Why it matters: This is the core idea. Words with similar contexts mean similar things. Everything that follows is about making this observation practical.

Part 4Two matrices

From Counting to Learning

Replace the explicit V×V table with two compact matrices. Every word gets two roles: centre (the word being looked at) and context (the word being a neighbour). Factorisation, not memorisation.

Why it matters: The two-matrix split is what makes everything else possible. Scores become dot products. Updates become gradient descent on two separate tables.

Act IIIScoring, then probability

Part 5Dot product

What the Dot Product Really Measures

The dot product as a compatibility signal between centre and context. Why similar words cluster even though they never directly interact. They just get pulled toward the same contexts.

Why it matters: The key pedagogical moment. Similarity is an emergent side effect, not a direct training goal. This reframes how every later result should be read.

Part 6Softmax

Why Raw Dot Products Aren't Enough

Scores are unbounded, sign-ambiguous, and isolated. Exponentiation fixes the sign. Dividing by Z forces contexts to compete for probability mass. Out falls softmax.

Why it matters: Softmax isn't a convention. It's the smallest transformation that turns raw scores into something that can be compared to real data.

Part 7P_data vs P_model

Two Distributions, One Goal

The corpus has a true distribution. The model has beliefs. Training means aligning one with the other. The data distribution doesn't need to be built. It's implicit in how often pairs show up.

Why it matters: Frames training as distribution matching. Repetition becomes the sampling mechanism. Nothing needs to be counted explicitly.

Act IVMeasuring error, making it a signal

Part 8Cross-entropy

Why −log P?

Why 1−p, (1−p)², and 1/p all fail as error measures. −log p uniquely combines a mild penalty for near-right and infinite penalty for confidently-wrong. Cross-entropy emerges as average surprise.

Why it matters: Demystifies the log. It's not a convention. Given what we need from a loss function, it's the only smooth choice that works.

Part 9The gradient

Creating the Learning Signal

A loss value doesn't teach. A gradient does. Through the chain rule on softmax and cross-entropy, the learning signal for every score collapses to a remarkably simple expression: p_k − q_k.

Why it matters: The central payoff of the series. The beautiful cancellation between softmax and cross-entropy is why they're always paired, and it's why learning works even when the model is catastrophically wrong.

Part 10Vectors moving

From Signal to Geometry

The gradient p_k − q_k becomes an actual shove on the embedding vectors. Context vectors slide along the centre. The centre moves as a weighted vote of all contexts. Across many examples, dog and cat end up close without ever meeting.

Why it matters: Closes the loop. Part 3 claimed meaning becomes geometry. Part 10 shows it actually happening, live, in a 2D visualisation you can watch.

What this series gave you

A working mental model of word embeddings, assembled from its own questions.

Starting from "a computer sees bytes, not meaning," ten posts later you've built every piece of the Skip-Gram algorithm. You've seen why each component exists. You've watched the core loop produce the distributional hypothesis in real time: dot product, softmax, cross-entropy, gradient, update.

That same loop scales up. Widen the context window and you get the setup for transformers. Make the score computation smarter than a dot product and you get attention. Stack many layers and you get the depth of modern LLMs. But the fundamental move is the one you've just seen: represent as vectors, score via geometry, normalise to probability, compare to truth, update.

Everything modern NLP can do, it does because this loop, run hard enough, discovers structure nobody explicitly taught it.

If you're new here

Start with Part 1 and read in order. Each post takes 15–20 minutes. The interactives reward pausing and playing. There's no quiz; the goal is intuition you can feel.

Where to go after

The natural next step is the original attention paper (Vaswani et al., 2017) and then a careful read of Jay Alammar's Illustrated Transformer. With the foundation you've built, the leap to self-attention is shorter than it looks.