Before Transformers: How Embeddings Learn Meaning

Creating the Learning Signal

From a single scalar of surprise to a precise push on every score.

A number is not enough

Last post we landed on a loss: \(L = -\log p_y\). It says how surprised the model should be that the truth was \(y\), given how little probability it assigned to that outcome. Beautiful quantity. Mathematically clean. Practically useless on its own.

Say the loss on some example is 3.2. The model is wrong. Fine. But what does the model do with that number? Which knob should it turn? Should every parameter move a little? Move a lot? In which direction? The scalar 3.2 has no address on it.

A loss measures pain. Learning requires knowing where the pain is coming from.

Learning needs blame. The model has to trace the loss back through its own computations and figure out which internal quantities were responsible. Once it knows, it can nudge them in the direction that hurts less.

Where the blame actually goes

Inside the model, probabilities are not free-standing variables. They come from scores, via softmax:

\(p_j = \dfrac{e^{s_j}}{\sum_m e^{s_m}}\)

And scores come from embeddings, via dot products. The computation chains:

embeddings → scores \(s_k\) → probabilities \(p_j\) → loss \(L\)

The model doesn't control \(p_j\) directly. It controls \(s_k\). So the useful question isn't "how does \(L\) react when \(p_j\) changes?" but "how does \(L\) react when \(s_k\) changes?" Since scores affect loss only through probabilities, the chain rule tells us to add up every path:

\(\dfrac{\partial L}{\partial s_k} = \displaystyle\sum_{j=1}^{K} \dfrac{\partial L}{\partial p_j} \cdot \dfrac{\partial p_j}{\partial s_k}\)

The sum isn't bureaucratic. Every probability depends on every score, because the softmax denominator is shared. Change one score and you've changed the entire distribution. That's what the sum is counting.

First piece: how the loss feels changes in probability

Start with \(\partial L / \partial p_j\). The loss is \(L = -\log p_y\), where \(y\) is the true class. It depends on exactly one probability, \(p_y\), and not on the others.

Differentiate:

\(\dfrac{\partial L}{\partial p_y} = -\dfrac{1}{p_y}, \qquad \dfrac{\partial L}{\partial p_j} = 0 \ \text{ for all } j \ne y\)

Two things worth noticing. When \(p_y\) is large, the gradient is small. The model is already close to right, so don't push hard. When \(p_y\) is tiny, the gradient is huge. The model was confidently wrong, and the loss wants a strong correction. The sign is negative because we're measuring how \(L\) responds when \(p_y\) goes up, and increasing \(p_y\) decreases the surprise.

The loss listens to exactly one probability. Every other entry is ignored. This will matter in a minute.

Second piece: how one score shakes every probability

Now the harder derivative, \(\partial p_j / \partial s_k\). Bump score \(s_k\) up by a hair. What happens to \(p_j\)?

Two effects fight for control. If \(j = k\), the numerator \(e^{s_j}\) grows. If \(j \ne k\), only the denominator grows. In both cases, the shared denominator couples everything together.

Working through the algebra (product rule on \(e^{s_j} / Z\), with \(Z = \sum_m e^{s_m}\)) gives a compact answer:

\(\dfrac{\partial p_j}{\partial s_k} = p_j \left(\delta_{jk} - p_k\right)\)

Where \(\delta_{jk}\) is 1 if \(j = k\) and 0 otherwise. Two cases live inside this formula.

When \(j = k\): \(\partial p_k / \partial s_k = p_k(1 - p_k)\). Positive but bounded. Raising a score raises its own probability, with diminishing returns as \(p_k\) approaches 1.

When \(j \ne k\): \(\partial p_j / \partial s_k = -p_j p_k\). Negative. Raising one score steals mass from every other class in proportion to how much mass they each had.

Changing one score never changes one probability. It redistributes them all.

One score changes every probability at once

Adjust raw scores

	probability affected →
	p₁	p₂	p₃	p₄	p₅

The matrix shows ∂p_j/∂s_k for a five-class softmax. Each row is one score being nudged upward by a tiny amount. Each column is the probability that responds. Diagonal cells (outlined) are positive: raising a score helps its own probability. Off-diagonal cells are negative: every other probability takes a hit. Move the sliders to change the underlying distribution. The "balanced" preset is where the dynamics are strongest. Every bump redistributes real mass. The "sharp" preset shows softmax saturation: when one class already dominates, nothing much can move, because there's little mass left to redistribute. Notice that every row sums to exactly zero. Raising a score creates no probability, only shuffles it.

The collapse

Two pieces in hand. Substitute them into the chain rule:

\(\dfrac{\partial L}{\partial s_k} = \displaystyle\sum_{j=1}^{K} \left(-\dfrac{q_j}{p_j}\right) \cdot p_j (\delta_{jk} - p_k)\)

Here I've written \(\partial L / \partial p_j = -q_j / p_j\), which generalises \(-1/p_y\): under a one-hot target, \(q_j = 1\) when \(j = y\) and 0 otherwise. Notice the \(p_j\) in the numerator of the softmax derivative and the \(p_j\) in the denominator of the loss derivative. They cancel.

\(\dfrac{\partial L}{\partial s_k} = \displaystyle\sum_{j=1}^{K} \left(-q_j\right)(\delta_{jk} - p_k)\)

Now split the sum into two:

\(\dfrac{\partial L}{\partial s_k} = -\displaystyle\sum_{j=1}^{K} q_j \delta_{jk} + p_k \sum_{j=1}^{K} q_j\)

Each piece simplifies cleanly. The first sum has \(\delta_{jk}\) selecting exactly one term, \(j = k\), giving \(-q_k\). The second sum is \(p_k\) times the sum of a probability distribution, which is 1. So:

\(\boxed{\dfrac{\partial L}{\partial s_k} = p_k - q_k}\)

The learning signal is just the gap: what the model believes about class k, minus what the truth says about class k.

That's the whole gradient. No logarithms left over. No reciprocals about to explode. Just a difference between two probabilities, one per class.

The cancellation is not an accident. The \(p_j\) that appeared in the loss derivative (from differentiating \(-\log p_y\)) is exactly the one that appeared in the softmax derivative (from the product rule on \(e^{s_j}/Z\)). Cross-entropy and softmax fit together so cleanly because one was designed to undo the other's complications. Using different functions would leave you with uglier algebra and worse gradients.

Reading the signal

Once \(\partial L / \partial s_k = p_k - q_k\) is on the page, the whole training dynamic becomes easy to read.

For the true class \(k = y\): the target says \(q_y = 1\), so the gradient is \(p_y - 1\). Since \(p_y \le 1\), this is always negative (unless the model is already perfect). Gradient descent subtracts the gradient, so it adds \((1 - p_y)\) to \(s_y\). The true score goes up. The more underconfident the model was, the bigger the push.

For every wrong class \(k \ne y\): the target says \(q_k = 0\), so the gradient is just \(p_k\). Gradient descent subtracts this, pulling \(s_k\) down by exactly the probability the model had wasted on that wrong class. Classes the model already deemed unlikely barely move. Classes the model was dangerously confident about take a hard hit.

Pull the truth up by how far it fell short. Push every wrong answer down by how much it stole.

Total probability is preserved. The amount going up on the true class exactly balances the amount going down on the wrong classes, because \(\sum_k (p_k - q_k) = 1 - 1 = 0\). No probability mass is created or destroyed. Training just moves it to where it belongs.

Watch the push-pull land

Set up the model's guess

Raw scores (logits)

Learning signal: p_k − q_k

Each bar is the gradient for one score. Right = push down (wrong class). Left = push up (true class).

Step: 0 loss: —

Pick a true class, set your initial scores, and hit "Apply one step." Every score moves by η × (p_k − q_k). The true class gets a positive kick, every wrong class gets a proportional pullback. The probabilities on the left redistribute in real time. Run multiple steps or hit converge to watch the distribution slide toward a one-hot over the true class. Loss decreases monotonically. Nothing external coordinates this. Every score is just responding to its own personal gap from the truth.

Why small probabilities don't kill the learning

There's a subtle worry lurking. The softmax-derivative formula said \(\partial p_k / \partial s_k = p_k(1 - p_k)\). When \(p_k\) is tiny, this is also tiny. That sounds bad. If the model assigned \(p_y = 0.001\) to the truth, the softmax gradient feels that true class "barely responds" to score changes. Learning should stall.

Except it doesn't. Because the loss gradient is \(-1/p_y\), which blows up exactly as fast as the softmax gradient shrinks. Watch the product:

\(\dfrac{\partial L}{\partial s_y} = \dfrac{\partial L}{\partial p_y} \cdot \dfrac{\partial p_y}{\partial s_y} = \left(-\dfrac{1}{p_y}\right) \cdot p_y(1 - p_y) = p_y - 1\)

The dangerous factor \(1/p_y\) cancels against the shrinking factor \(p_y\). What's left is bounded between −1 and 0. When the model is catastrophically wrong (\(p_y \to 0\)), the gradient on the true score goes to exactly \(-1\), the strongest correction possible. When the model is already right (\(p_y \to 1\)), the gradient goes to 0, as it should. No vanishing gradient, no exploding gradient, no saturation, no stall.

Two curves that misbehave. One that doesn't.

Probability on the true class: 0.50

∂p/∂s = p(1 − p) saturates near 0 and 1 ∂L/∂p = −1/p blows up near 0 ∂L/∂s = p − 1 stays bounded

Three functions on the same axis. The blue curve is the softmax's sensitivity to its own score: tiny near 0 and near 1, largest at p = 0.5. The red curve is the loss's sensitivity to probability: it explodes as p shrinks. Each one, alone, would produce pathological learning. Their product is the orange curve, and it's tame: a smooth line from −1 at the left edge to 0 at the right. Cross-entropy and softmax were made to cancel each other's worst tendencies. Drag the slider and watch the red curve tower while the orange one holds steady.

Putting it together

We started with a scalar loss and one obvious question: how does each score need to change? The chain rule gave us a two-piece expression. The first piece said the loss only cares about the true class. The second piece said each score shakes every probability at once. Substituting them both into the chain rule, something collapsed. The probabilities in numerator and denominator cancelled. The sum over classes reduced to two terms, and both simplified. What was left was \(p_k - q_k\): a gap between what the model believes and what the truth requires.

That expression is more than a convenience. It is the entire meaning of a gradient step in a softmax-cross-entropy model. Every coordinate of every score update is just "how far off am I on this class, and in which direction." Nothing fancier is going on.

Training a softmax classifier is, mathematically, the process of closing a gap. One probability at a time.

Where we go next

We now know how to update scores. But scores aren't parameters. They're computed from embeddings, via dot products of centre and context vectors. The gradient \(p_k - q_k\) tells us how each score should move. The remaining question is how that desired score movement translates into a change in the embedding matrices W and C. The next post walks through that one carefully: how a push on the score becomes a push on the geometry of the space.

Creating the Learning Signal

A number is not enough

Where the blame actually goes

First piece: how the loss feels changes in probability

Second piece: how one score shakes every probability

The collapse

Reading the signal

Why small probabilities don't kill the learning

Putting it together

Where we go next

Comments