Before Transformers: How Embeddings Learn Meaning

Why −log P?

From measuring a single prediction to measuring the distance between two distributions.

Where we stand

We have two distributions now. \(P_{\text{data}}\) is encoded implicitly by the corpus, revealed through repeated sampling of centre-context pairs. \(P_{\text{model}}\) comes from vector geometry, via a softmax over dot products. Training means reshaping the geometry until these two distributions agree.

Agreement needs a number. We need a single quantity that says: this model's belief disagrees with the data by this much. A scalar we can watch go down as training progresses. Something we can differentiate to get a direction for the vectors to move.

We have a pair of distributions to compare. We need a way to score the comparison.

The answer lives inside one innocuous-looking expression: \(-\log P_{\text{model}}(x)\). This post is about where that minus sign and that logarithm come from, and why no other choice would do the job.

What a good penalty function should look like

Fix a single training example. The corpus has handed us a real pair like (my, dog). The model has produced a probability for the correct context: \(p = P_{\text{model}}(\text{dog} \mid \text{my})\). This one number expresses the model's belief about this specific event.

We want to convert \(p\) into a penalty \(f(p)\). The properties we want are easy to state.

When the model is right and confident, with \(p\) close to 1, the penalty should be near zero. When the model is right but hesitant, the penalty should be modest. When the model is confidently wrong about the truth, with \(p\) near zero, the penalty should be severe. It has to be smooth, because gradient descent needs continuous feedback, not step functions. And critically, it should treat different regions of \(p\) asymmetrically. Going from 0.9 to 0.8 shouldn't hurt much. Going from 0.1 to 0.001 should hurt a lot more, even though the numerical change is smaller.

That last requirement is the one most candidates fail.

Why the obvious candidates don't work

The simplest idea is \(f(p) = 1 - p\). When the model is right, penalty zero. When the model assigns zero probability to the truth, penalty one. Feels reasonable. But look at what it says about confident nonsense. If the model assigns \(p = 0.1\) to the true word, the penalty is 0.9. If it assigns \(p = 10^{-6}\), the penalty is 0.999999. The difference between "bad" and "catastrophically bad" is a rounding error. The function saturates. Language models that hallucinate with high confidence need to feel the sting of that confidence, and this function can't deliver it.

Squaring it and using \(f(p) = (1-p)^2\) amplifies the middle but doesn't solve the ceiling. The penalty is still bounded above by 1. Confidently wrong and mildly wrong still look similar.

Going the other way with \(f(p) = 1/p\) blows up near zero, which is the right instinct. But now \(p = 1\) gives penalty 1 instead of zero. And it explodes too fast, so gradients become unstable. Useful energy, wrong shape.

What we need is a function that vanishes at \(p=1\), grows without bound as \(p\) approaches zero, and does so smoothly. One function fits.

Four candidates. Only one has the right shape.

Probability assigned to the true word: 0.30

Drag the slider to move the probability the model assigned to the truth. Watch each penalty's value at that point. The first three candidates are bounded or badly behaved. Only the fourth has the right shape: quiet when the model is correct, merciless when the model is confidently wrong, smoothly in between. That fourth curve is \(-\log p\).

Enter −log p

The function \(f(p) = -\log p\) does exactly what we need. At \(p = 1\) it's zero. At \(p = 0.5\) it's about 0.69. At \(p = 0.1\) it's 2.3. At \(p = 0.01\) it's 4.6. As \(p\) approaches zero it goes to infinity, smoothly, without any nasty numerical cliffs.

Information theory has a name for this quantity. It's the surprise of an event with probability \(p\). An event that was almost certain is unsurprising when it happens. An event that was almost impossible is shocking when it happens. \(-\log p\) quantifies that shock.

\(-\log p\) is the surprise of the truth, given what the model believed.

For a single training example with true context \(w_o\) given centre \(w_i\), the loss is:

\(\ell(w_i, w_o) = -\log P_{\text{model}}(w_o \mid w_i)\)

Small when the model had already assigned high probability to the truth. Large when the model thought the truth was unlikely. Infinite in the limit where the model assigned zero probability, which is exactly the behaviour we want. A model that says "this could never happen" and then sees it happen is maximally wrong.

From one example to a distribution comparison

One example is a single data point. The corpus hands us many of them. Average the surprise over all training pairs:

\(\displaystyle \frac{1}{N} \sum_{n=1}^{N} -\log P_{\text{model}}(w_o^{(n)} \mid w_i^{(n)})\)

Recall the fact from last post: the training pairs are samples from \(P_{\text{data}}\). Frequent pairs appear often, rare pairs appear rarely. So the average above isn't arbitrary. It's the expected surprise, weighted by how reality actually happens:

\(\mathbb{E}_{(w_i, w_o) \sim P_{\text{data}}}\left[-\log P_{\text{model}}(w_o \mid w_i)\right]\)

This quantity has a name: the cross-entropy of \(P_{\text{data}}\) relative to \(P_{\text{model}}\), written \(H(P_{\text{data}}, P_{\text{model}})\). It measures the average shock the model experiences as the corpus plays out.

Cross-entropy is the average surprise of reality, when judged by the model's beliefs.

When \(P_{\text{model}}\) and \(P_{\text{data}}\) agree, the model is rarely surprised and cross-entropy is low. When they disagree, the model is constantly shocked and cross-entropy is high. The single scalar we wanted has appeared, not by decree, but by averaging a well-motivated per-example loss over the data.

Two models trying to match the same truth

Preset scenarios

Model A uniformly confused

drag the sliders to change its belief

Model B almost correct

drag the sliders to change its belief

Ground truth: P_data( · | "my" )

—

Three panels. Orange at the bottom is what the corpus actually shows. Purple and blue are two candidate models. Each candidate gets its own cross-entropy score. Drag the sliders to change what each model believes and watch the losses respond. A model that puts its probability mass where the data does gets a low score. A model that spreads itself thin or points in the wrong direction gets a high one. You can only lower a model's loss by making its beliefs look more like P_data.

From cross-entropy to a familiar single-term expression

Cross-entropy in its general form is a sum over all possible outcomes:

\(\displaystyle H(q, p) = -\sum_{k=1}^{K} q_k \log p_k\)

Here \(q\) is the true distribution and \(p\) is the model's. For each possible class \(k\), we weight the surprise \(-\log p_k\) by how much probability the truth assigns to that class.

In classification, including word prediction, the truth for a single observed example is sharp. If the context word is "dog", the truth says dog happened, everything else didn't. This makes \(q\) a one-hot vector: \(q_y = 1\) for the true class \(y\), and \(q_k = 0\) for every other \(k\).

Substitute this one-hot \(q\) into the sum:

\(H(q, p) = -\left[0 \cdot \log p_1 + \cdots + 1 \cdot \log p_y + \cdots + 0 \cdot \log p_K\right]\)

Every term except one gets multiplied by zero. The whole sum collapses to a single term:

\(H(q, p) = -\log p_y\)

The sum doesn't disappear because the other classes don't matter. It disappears because the truth, on this example, pointed at exactly one of them.

Watch the sum collapse

Which word actually occurred?

■ surviving term ■ term killed by q_k=0

Each row is one term of the cross-entropy sum: \(q_k \cdot \log p_k\). Pick which word actually occurred. Every row where \(q_k = 0\) contributes zero regardless of what the model thinks. The row where \(q_k = 1\) is the only one that survives. What's left is a single term: the model's log-probability for the thing that happened. No classification shortcut. No special rule. Just the general cross-entropy formula applied to a distribution that's certain about what it saw.

This is why \(-\log p_y\) shows up everywhere classification appears. It isn't a separate loss function. It's what cross-entropy always reduces to when the truth is deterministic, which is exactly the situation we're in, example by example, as training pairs stream out of the corpus.

Putting it together

We wanted a scalar score for how wrong a model's probability is. Natural candidates like \(1-p\), \((1-p)^2\), and \(1/p\) all failed to scale punishment with overconfidence. The function \(-\log p\) did, with no numerical cliffs.

For a single example, this gave a per-example loss equal to the model's surprise at seeing the truth. Averaging that surprise over all training pairs, weighted by how frequently the corpus actually produced each pair, gave us cross-entropy: the average surprise of reality under the model's beliefs.

And because the truth on any single example is deterministic, the general cross-entropy formula collapses to exactly what we started with: \(-\log p_y\). The loss we'll actually use is the loss we were building toward all along.

\(-\log P\) isn't a trick or a convention. It's the only function that measures disagreement between a deterministic truth and a soft belief in the way learning needs.

Where we go next

Cross-entropy gives us a number. Training needs more than a number. To actually improve, the model has to know which scores to change and by how much. The next post is about turning this scalar loss into a direction: how a single number, \(-\log p_y\), produces a push on every one of the model's raw scores, and why the resulting learning signal turns out to be the simplest possible expression: \(p_k - q_k\).

Why −log P?

Where we stand

What a good penalty function should look like

Why the obvious candidates don't work

Enter −log p

From one example to a distribution comparison

From cross-entropy to a familiar single-term expression

Putting it together

Where we go next

Comments