Before Transformers: How Embeddings Learn Meaning

Two Distributions, One Goal

How repeated experience turns random probabilities into meaningful pressure.

The gap between form and substance

Last time we turned raw dot products into a probability distribution over context words. Mathematically the distribution is perfect: positive, sums to one, expresses relative preference. Every probability we've computed so far used vectors that were chosen arbitrarily. The maths worked on placeholders.

A distribution can be valid in form and completely wrong in substance. It can express preferences that have nothing to do with language. At this stage, if the model believes something, it believes it by coincidence.

We've built a way to express belief. What remains is to teach the model what to believe.

Now that probability is on the table, a new possibility opens up. Beliefs can be compared against what actually happens in text. When the model assigns high probability to the wrong context, that mismatch becomes visible. Anything visible can exert pressure.

Two distributions, one question

The probability distribution from our vectors is \(P_{\text{model}}(c \mid v)\). It has all the right mathematical properties, but it expresses the preferences of the model, not the preferences of the data.

Look at the co-occurrence matrix from a new angle. Fix a centre word \(v\). Look across its row. Each entry counts how many times a context word \(c\) appeared near \(v\). Sum the row and divide:

\(P_{\text{data}}(c \mid v) = \frac{\text{count}(v, c)}{\sum_{c'} \text{count}(v, c')}\)

This is the data's own conditional distribution. It answers: given that \(v\) appeared, how likely was it to be accompanied by \(c\)? It's empirical. It reflects what actually happened.

Build a corpus. Watch each row become a distribution.

Each row sums to exactly 1. That's what makes it a probability distribution. Look at the "dog" and "cat" rows in the default corpus — both put all their mass on "my", meaning P(my | dog) = P(my | cat) = 1.00. The row for "love" distributes its mass between "I", "You" and "my". Add sentences and watch rows redistribute. Nothing here is designed. The distribution is whatever the sentences produce.

We have two distributions answering the same question over the same vocabulary, from different sources. \(P_{\text{data}}\) comes from observed language. \(P_{\text{model}}\) comes from vector geometry. Every difference between them is a disagreement. Disagreement is now measurable.

Learning begins when these two distributions are placed side by side. From here on, learning means aligning \(P_{\text{model}}\) with \(P_{\text{data}}\).

One vocabulary, two beliefs

Centre word

Training progress: 0%

P_data — what the corpus says P_model — what the current vectors say

Total disagreement (sum of |P_data − P_model|): — —

Orange bars show P_data — the distribution the corpus has already handed us. Blue bars show P_model — what a freshly initialised set of vectors produces. At 0% training the two distributions disagree all over the place. Drag the progress slider to 100% (or hit auto-train) and watch the blue bars drift toward the orange. That drift is the entire job: reshape the geometry until the model's belief matches what the data says.

Where does \(P_{\text{data}}\) actually come from?

Worth pausing on something that might feel circular. We keep saying we don't want to build a co-occurrence table, yet we keep referring to \(P_{\text{data}}\) as if it exists.

\(P_{\text{data}}\) isn't something we construct first and then use for training. It's already encoded in the way training data appears.

The word "my" appears four times across our four sentences. Every time, "love" is next to it. "Dog" and "cat" appear next to it only twice each. Nothing has been counted or normalised, and yet the data has already expressed a preference. Experiencing ("my", "love") four times and ("my", "dog") twice is a statement of relative likelihood.

\(P_{\text{data}}\) is not constructed. It's revealed through repeated experience.

Instead of computing \(P_{\text{data}}\), we sample from it. Every centre–context pair the corpus emits is one draw from the data distribution. Pairs that occur often are sampled often. Pairs that never occur are never sampled. The corpus doesn't describe the distribution. It instantiates it.

What training actually does, one experience at a time

The model doesn't sit in front of the entire corpus. It sees one small experience at a time: a single centre word, a single context word. A score is computed. That score becomes a probability. The vectors are nudged. The adjustment from a single example is tiny.

Training doesn't stop at one example. The same kinds of experiences keep returning. When ("my", "love") appears repeatedly, the model receives the same directional push again and again. Each push is small. They all point in the same direction. Over time they leave a visible mark on the geometry.

("My", "dog") also pushes, but only twice. The cumulative effect is weaker. This difference in repetition is enough. Without ever counting, the model is exposed to the same imbalance that exists in the data.

Training works because repetition carries structure. Frequent events carve deep grooves. Rare events leave faint traces.

A tiny numerical example

Start with the centre word "my." The corpus shows ("my", "love") four times and ("my", "dog") twice. At initialisation the model assigns roughly equal probabilities:

\(P_{\text{model}}(\text{love} \mid \text{my}) = 0.34, \quad P_{\text{model}}(\text{dog} \mid \text{my}) = 0.33, \quad P_{\text{model}}(\text{cat} \mid \text{my}) = 0.33\)

Each time ("my", "love") is encountered, the model nudges P(love) up by ~0.02 and renormalises. Four encounters, roughly +0.08:

\(P_{\text{model}}(\text{love} \mid \text{my}) \approx 0.42\)

Because probability mass is conserved, that +0.08 is pulled from "dog" and "cat", which each drop to ~0.29. "Love" rose above the others because the pair returned more often. The corpus applied pressure repeatedly in the same direction.

Watch the probability rise, one encounter at a time

Press "Next encounter" to start.

Step: 0 / 8

Each click draws one centre-context pair from the toy corpus, exactly as a training loop would. Under window=1, the word "my" has three possible neighbours: love, dog, cat. The corpus emits them in a 4:2:2 ratio. Watch P(love) climb past 0.4 as it gets pulled more often than the others. Because probabilities sum to 1, every rise in one bar forces a drop in others. Turn on auto-play to see the accumulation in real time.

The model doesn't need an explicit table of \(P_{\text{data}}\). It experiences it as uneven repetition. More repetitions produce stronger cumulative drift. Fewer repetitions produce weaker drift.

Training is not about reacting to one event. It's about letting frequency accumulate into geometry.

Where we go next

We now have two distributions in play. One is implicit in the data, revealed through repeated experience. The other lives inside the model, expressed as probabilities derived from its vectors. Training is about bringing them into alignment.

The model doesn't yet have a way to judge its own beliefs. It has preferences, but no notion of error. Next we'll make this comparison explicit: how different is \(P_{\text{model}}\) from \(P_{\text{data}}\)? Once that question is well-defined, the remaining pieces of learning fall into place.

Two Distributions, One Goal

The gap between form and substance

Two distributions, one question

Where does \(P_{\text{data}}\) actually come from?

What training actually does, one experience at a time

A tiny numerical example

Where we go next

Comments