Before Transformers: How Embeddings Learn Meaning

From Signal to Geometry

How a gradient on a score becomes an actual movement in embedding space.

Where we stand

Last post gave us the cleanest result in the series: \(\partial L / \partial s_k = p_k - q_k\). The learning signal for every score. Positive for classes the model overestimated. Negative for the true class it underestimated. Simple to state, clean to compute.

But scores aren't the thing we actually store. The model's memory lives in two matrices, \(W\) and \(C\), that hold the centre and context embeddings. Scores are derived. When training says "move score \(s_k\) down by this much," something has to happen to the vectors that produced \(s_k\). This post is about what that something is.

A gradient on a score is a note to the vectors: reshape yourselves so this score comes out different next time.

By the end, the full Skip-Gram training loop will be concrete. Not as an algorithm to memorise, but as a geometric process: vectors nudging each other through thousands of interactions until the shape of the space starts to encode the shape of language.

One step of gradient descent on a score

Gradient descent has one rule. Whatever the loss cares about, subtract its gradient times a small step size:

\(s_k \leftarrow s_k - \eta \cdot \dfrac{\partial L}{\partial s_k}\)

Substitute our learning signal:

\(s_k \leftarrow s_k - \eta (p_k - q_k)\)

Read the two cases. For the true class \(y\), we have \(q_y = 1\) and \(p_y \le 1\), so \((p_y - 1) \le 0\) and the minus sign flips it to positive. The true score goes up by \(\eta(1 - p_y)\). Big underconfidence gives a big push.

For any wrong class \(k \ne y\), \(q_k = 0\), so the update is \(s_k \leftarrow s_k - \eta p_k\). The score goes down in proportion to how much probability the model wasted on that class. Unlikely classes barely move. Confidently-wrong classes get slapped hard.

Everything here is familiar from last post. The new question is different. Nothing in this update tells \(W\) or \(C\) what to do. Scores are imaginary. They only exist because we computed them. So we need to translate a desired change in a score into a change in the actual parameters.

The score is a dot product. That fact is the bridge.

Given a centre word, let \(W\) be its embedding (one row of the centre matrix). For each class \(k\), let \(C_k\) be the corresponding context vector (one row of the context matrix). The score is the dot product of those two:

\(s_k = W \cdot C_k^\top\)

So \(s_k\) isn't a stored number. It's the result of multiplying two d-dimensional vectors. If we want \(s_k\) to change by \(\Delta s_k = -\eta (p_k - q_k)\), we need to ask: which small moves to \(W\) and \(C_k\) would produce that change in the dot product?

Work out the linearised change. If \(W\) moves by \(\Delta W\) and \(C_k\) moves by \(\Delta C_k\), the score changes by:

\(\Delta s_k \approx \Delta W \cdot C_k^\top + W \cdot \Delta C_k^\top\)

Two things are worth noticing here. First, changing \(C_k\) only affects one score: \(s_k\) itself. The other context vectors have their own independent rows. Second, changing \(W\) affects every score at once, because \(W\) sits inside every dot product. That's the whole coupling. One centre vector, one shared chance to influence the entire distribution.

Updating the context vector

Isolate the context side first. Hold \(W\) fixed and ask: what \(\Delta C_k\) produces the desired \(\Delta s_k\)?

\(\Delta s_k \approx W \cdot \Delta C_k^\top\)

The cleanest choice is \(\Delta C_k\) pointing along \(W\), with magnitude equal to the desired change in score:

\(\Delta C_k = -\eta (p_k - q_k) \cdot W\)

Which gives the update rule:

\(\boxed{C_k \leftarrow C_k - \eta (p_k - q_k) \cdot W}\)

Read the geometry. For the true class (\(q_k = 1\), so \(p_k - q_k\) is negative), the minus flips it: \(C_y\) moves toward \(W\). The centre word pulls its true context partner closer. For wrong classes (\(q_k = 0\), so \(p_k - q_k\) is positive), \(C_k\) moves away from \(W\). The centre pushes its false contexts out.

Every centre word, on every training example, pulls one context vector toward itself and shoves the rest away. The gap decides how hard.

One context vector, one update step

Model's current p_k: 0.20

The orange vector is W, the centre word's embedding. The blue vector is C_k, one context word's embedding. Toggle whether class k is the truth or a wrong answer. Drag the slider to set what p_k the model currently assigns. Click "Apply update" and watch C_k slide along the W direction by exactly −η(p_k − q_k) · W. Pull when it's the true class, push when it's wrong. The dashed arrow shows the correction applied. Repeat a few times to see the vector migrate.

Updating the centre vector

Now the harder side. The centre vector \(W\) appears in every single score, so a change in \(W\) affects every \(s_k\) at once. We can't satisfy one score's desire in isolation. We have to balance every score's desired change simultaneously.

Fortunately the chain rule handles this gracefully. The total gradient of the loss with respect to \(W\) is the sum of contributions from every score:

\(\dfrac{\partial L}{\partial W} = \displaystyle\sum_{k=1}^{K} \dfrac{\partial L}{\partial s_k} \cdot \dfrac{\partial s_k}{\partial W} = \displaystyle\sum_{k=1}^{K} (p_k - q_k) \cdot C_k\)

Each score's gradient multiplies the context vector it was computed against. Plug into gradient descent:

\(\boxed{W \leftarrow W - \eta \displaystyle\sum_{k=1}^{K} (p_k - q_k) \cdot C_k}\)

The expression says something beautiful. The centre vector moves in a direction formed by combining every context vector, each weighted by its gap \((p_k - q_k)\). Wrong classes (positive gap, p_k > 0) pull \(W\) away from their direction, because of the leading minus sign in gradient descent. The true class (negative gap, \(p_y - 1 < 0\)) pulls \(W\) toward \(C_y\). The result is a single movement: away from the context vectors the model was mistakenly drawn to, toward the one it should have recognised.

The centre vector's move is a vote. Every wrong context contributes a small nudge away. The true context contributes a large pull toward. The sum is the step.

Every context casts a vote on where W should go

True class

Score sliders (adjust to see how p_k changes the contributions)

Individual contributions to Δ W

Orange is W. The coloured arrows are four C_k context vectors. The dashed grey arrow is the total correction Σ_k(p_k − q_k) · C_k. Each context contributes proportional to its gap. Large green bars for the true class pull W toward it, small red bars for wrong classes push it away. Click "Apply update" to move W by −η times the correction. The arrow you see being formed is the actual step W takes. Try different true classes and different score configurations.

Both updates, running together

The two rules together describe one step of Skip-Gram training:

\(C_k \leftarrow C_k - \eta (p_k - q_k) \cdot W \qquad (\text{for every } k)\)

\(W \leftarrow W - \eta \displaystyle\sum_{k=1}^{K} (p_k - q_k) \cdot C_k\)

One training example does all of this. The centre gets one step, all \(V\) context vectors get one step each. Run over many examples, and the updates accumulate. Pairs that occur often in the corpus get many reinforcing nudges. Pairs that never occur get none. Uneven repetition becomes uneven geometry.

Remember what this looks like from Part 7. In the toy corpus I love my dog / I love my cat / You love my dog / You love my cat, "dog" and "cat" never appear next to each other. Yet after enough training, their centre vectors end up close. Why? Because whenever "my" is the centre and "dog" is the true context, \(C_{\text{dog}}\) gets pulled toward \(W_{\text{my}}\). On a different sentence, \(C_{\text{cat}}\) gets pulled toward the same \(W_{\text{my}}\). Two vectors, both pulled toward the same third, end up near each other. The distributional hypothesis, in action.

Let the whole corpus run. Watch dog and cat find each other.

Corpus: I love my dog · I love my cat · You love my dog · You love my cat
(window = 1, so each centre looks at its two immediate neighbours)

Epoch: 0 Loss: —

Within-group similarity

Words that share contexts. Should converge to +1.

Cross-group similarity

Words from different roles. Should NOT climb toward +1.

Six words from our toy corpus, embedded in 2D for visibility. At epoch zero everything is random. Click "Train 50 epochs" and watch. The two panels at the bottom track cosine similarity between centre vectors. The within-group numbers (dog↔cat, I↔You) should climb toward +1 as shared contexts pull them together. The cross-group numbers should stay low or even go negative as different syntactic roles push words apart. In 2D the space is too cramped for everything to be mutually orthogonal, so unrelated words end up actively anti-aligned. In the higher dimensions real embeddings use, they'd simply be near-perpendicular. Either way, dog and cat end up close, and they got there without ever co-occurring.

What we've actually built

Step back and look at the whole arc.

Ten posts ago, we had text. Sequences of characters with no numerical structure. We needed a way to turn them into something a machine could learn from.

Tokenisation cut the text into units. One-hot encoding gave each unit a numerical address. But addresses don't carry meaning. So we moved from identity vectors to embedding vectors, where geometric relationships could stand in for semantic ones.

Then came the question of how to choose those embedding vectors. The distributional hypothesis said: a word is defined by its company. We turned this into a concrete objective. Two vectors per word (centre and context). A dot product between them measured compatibility. Softmax turned compatibility into probability. Cross-entropy measured disagreement between the model's probabilities and the corpus. The gradient of that cross-entropy reduced, beautifully, to \(p_k - q_k\). And now we've seen that gradient become an actual shove on the vectors themselves.

Everything that's fancy in modern language models grows out of this loop. Transformers are more expressive, attention is more flexible, scale is bigger. But the fundamental move is the same: represent things as vectors, compute a score via some geometric operation, normalise to a probability, compare to the truth, and update.

We started with text as a stranger. We end with geometry that knows what words mean because it remembers what they saw.

Where we go from here

This series ends here, but the story doesn't. Skip-Gram embeddings were the entry point. Once you have a way of training geometric representations from text, the path opens up in every direction.

Context windows widen, and attention mechanisms emerge to handle them. Single-layer models stack into deep networks. Position information gets folded in. Training data scales from toy corpora to the entire internet. The core loop stays the same: dot products, softmax, cross-entropy, gradient, update. Everything later is an elaboration.

If you've followed all ten posts, you've seen every one of the load-bearing ideas. From here, the rest is engineering, scale, and the beautiful discovery that this basic loop, run hard enough, learns things nobody explicitly taught it.

Thanks for reading.

From Signal to Geometry

Where we stand

One step of gradient descent on a score

The score is a dot product. That fact is the bridge.

Updating the context vector

Updating the centre vector

Both updates, running together

What we've actually built

Where we go from here

Comments