Before Transformers: How Embeddings Learn Meaning

What the Dot Product Really Measures

How centre and context words recognise each other in embedding space.

Where we stand

Meaning isn't stored in a table. It's encoded in the geometry of a space. Words with similar histories end up close together. Each word carries two representations: a centre vector for when it's the focus, a context vector for when it's in someone else's neighbourhood. Together these let us reconstruct the relational structure that a V×V table once tried to store explicitly.

We understand what the vectors represent. What we haven't answered is how to find them.

We have a representation but no procedure. We know the vectors need to end up arranged in a way that reflects language, but we haven't said how that arrangement comes about. The goal now is precise: find values for the centre vectors \(V\) and context vectors \(C\) that best reflect the structure of language as observed in text.

What makes one set of vectors better than another

Back to our toy corpus:

I love my dog. · You love my dog. · I love my cat. · You love my cat.

"Dog" and "cat" appear in identical contexts. "I" and "You" behave symmetrically. "Love" always sits between a pronoun and "my." Any reasonable vectors should reflect these regularities. Each word \(w\) gets a centre vector \(\mathbf{v}_w \in \mathbb{R}^d\) and a context vector \(\mathbf{c}_w \in \mathbb{R}^d\). Their compatibility is:

\(s(w_i, w_j) = \mathbf{v}_{w_i}^\top \mathbf{c}_{w_j}\)

Use \(d=2\) and compare two candidate geometries.

Candidate A: vectors that work

Set \(\mathbf{v}_{\text{love}} = \begin{bmatrix} 2 \\ 1 \end{bmatrix}\), with context vectors \(\mathbf{c}_{\text{my}} = \begin{bmatrix} 1 \\ 1 \end{bmatrix}\), \(\mathbf{c}_{\text{dog}} = \begin{bmatrix} 1 \\ 0 \end{bmatrix}\), \(\mathbf{c}_{\text{cat}} = \begin{bmatrix} 1 \\ 0 \end{bmatrix}\), and \(\mathbf{c}_{\text{you}} = \begin{bmatrix} -1 \\ 0 \end{bmatrix}\).

The dot products:

\(\mathbf{v}_{\text{love}}^\top \mathbf{c}_{\text{my}} = 3\) \(\mathbf{v}_{\text{love}}^\top \mathbf{c}_{\text{dog}} = 2\) \(\mathbf{v}_{\text{love}}^\top \mathbf{c}_{\text{cat}} = 2\) \(\mathbf{v}_{\text{love}}^\top \mathbf{c}_{\text{you}} = -2\)

Observed pairs (love-my, love-dog, love-cat) get high scores. An unobserved pair (love-you as context) gets a negative score. The vectors distinguish real relationships from unlikely ones.

This geometry makes observed word relationships numerically stronger than unlikely ones.

Candidate B: vectors that don't

Now set \(\mathbf{v}_{\text{love}} = \begin{bmatrix} 1 \\ 0 \end{bmatrix}\) and give all context words the same vector \(\begin{bmatrix} 1 \\ 0 \end{bmatrix}\). Every dot product comes out to 1. The geometry makes no distinction between observed and unobserved pairs. It's numerically valid but structurally useless.

Two geometries. One works. One doesn't.

Candidate A discriminates

Observed pairs score high. An unobserved pair scores negative. The geometry carries information about what's likely and what isn't.

Candidate B collapses

All context vectors collapse into one direction. Every score comes out the same. Numerically valid, structurally empty.

Same operation, the dot product, used on both. The difference isn't in the formula. It's in where the vectors sit. Candidate A spreads context vectors across different directions, so the centre vector for "love" can tell them apart. Candidate B collapses everything onto one line, so all contexts look identical to any centre word. A working embedding space is not about having vectors, it's about having them arranged so that interactions produce meaningful differences.

Not every choice of vectors is equally meaningful, even if dot products can be computed.

So "better" means: a set of vectors whose dot products systematically assign higher values to word pairs supported by the corpus than to those that aren't. The vectors are judged entirely by the behaviour they produce when they interact.

Why the dot product, and why two vectors

This might look like an arbitrary design choice. Why the dot product? Why two separate representations?

The asymmetry comes first. In I love my dog, when "love" is the focus, it's the word being explained by its neighbours. "My" helps explain something else. These roles aren't interchangeable. That's why we need separate centre and context vectors. They encode different kinds of information about the same word.

Now we need a way to combine them. The combination has to produce a single number reflecting compatibility, increase when vectors align, and decrease when they don't. The dot product does exactly this:

\(\mathbf{v}_{w_i}^\top \mathbf{c}_{w_j} = \|\mathbf{v}_{w_i}\| \|\mathbf{c}_{w_j}\| \cos \theta\)

It responds to both direction and magnitude. Similar directions produce a large value. Opposite directions produce a negative. Orthogonal produces zero. Critically, it's linear: small changes in either vector produce small changes in the score. This smoothness is what makes incremental learning possible. You can adjust vectors one encounter at a time without causing instability.

The dot product turns geometric alignment into a numerical signal of compatibility. It's the simplest operation that makes learning geometry possible.

Why alignment happens between centre and context, not between similar words

This is the point people most often get wrong. Similar words don't get pulled directly toward each other during learning. That's not what happens.

Vectors don't align because words are similar. Words become similar because of how their vectors align with contexts.

Alignment doesn't happen between words. It happens between a centre word and its context.

In I love my dog, when "love" is the centre, the learning signal compares the centre representation of "love" with the context representations of "I", "my", and "dog". It never compares "love" and "dog" as equals.

"Dog" and "cat" end up close not because they're ever pushed toward each other. They end up close because they're both pushed toward many of the same context vectors: "my", "love", "you". Similar pressures produce similar positions. Clustering is a side effect of repeated centre-context interactions. It's what we observe after learning, not something imposed during it.

Watch alignment happen centre-to-context, not word-to-word

Step: 0 Current centre: —

centre vector being trained context vectors it's pulling toward uninvolved words

One sentence at a time. The centre word rotates through "love", "dog", "cat". For each centre, pulling-arrows fire from the centre toward its context words. Crucially, "dog" and "cat" are never each other's centre or context in the same step. They never directly interact. Yet they end up close together, because every time "my" and "love" appear, both dog and cat are being tugged toward the same fixed points in space. Similar pressure, similar landing spot. Clustering is an emergent side effect, not a rule.

The dot product as a score

From here on, we'll call the dot product between a centre vector and a context vector a score:

\(\text{score}(w_i, w_j) = \mathbf{v}_{w_i}^\top \mathbf{c}_{w_j}\)

This score is the model's internal assessment of how compatible a centre word is with a context word. It's not a probability. Not a count. Just a raw signal reflecting how strongly the current geometry believes these two words belong together.

Higher score means the model thinks this pairing is plausible. Lower means unlikely. Near zero means not enough evidence either way.

The objective restated: choose vectors so that scores for frequently observed pairs are high, and scores for rare or absent pairs are low. Learning vectors is equivalent to tuning scores so they reflect what's actually true in text.

Learning vectors is tuning scores so they reflect what's true in text.

When is a score actually good or bad?

And now the awkward question. We compute a score of 2.3 for a word pair. Is that good? What about 0.1? What about negative?

A dot product is just a number. Its scale depends on vector lengths and orientation. Without additional structure, there's no clear boundary telling us when a score is "high enough" or "too low". A score of 1.5 might be high for one centre word and low for another.

Same score, different meaning

Scale of "love": 1.0×

Scale of "eat": 1.0×

Scale of "sleep": 1.0×

Three centre words, all pointed at the same context vector "my". Drag the scale sliders. Each word's dot product with "my" is drawn as a bar. Notice how a score of 1.5 can be "very strong" for one centre word and "barely above average" for another, depending entirely on vector length. The ranking within one word is meaningful. Comparing raw scores across different words is not. A dot product is a signal, not yet a verdict.

A score tells us how strong a relationship is, but not whether that strength is good enough.

The dot product gives us a way to rank word pairs for a given centre word, but it doesn't give us a way to judge correctness. We have a signal, but not yet a metric. We can compute scores, but we can't say whether a particular score represents success or failure.

This is the gap we need to close. We need to transform raw scores into something we can evaluate, something where we can say "this score is wrong by this much". Once we can measure wrongness, we can use it to guide how vectors should move.

That conversion, from raw scores to evaluable predictions, is what turns scoring into learning. And it's where we're headed next.

One last thing

All the vectors we've used so far are arbitrary. They were chosen to illustrate ideas, not learned from data. The geometry is still untrained. The fact that we can compute scores doesn't mean those scores are meaningful yet.

This is exactly why learning is required. The goal isn't to accept scores from random vectors. It's to adjust the vectors until the scores they produce align with what we observe in real text. Everything that follows is about closing that gap.

What the Dot Product Really Measures

Where we stand

What makes one set of vectors better than another

Candidate A: vectors that work

Candidate B: vectors that don't

Why the dot product, and why two vectors

Why alignment happens between centre and context, not between similar words

The dot product as a score

When is a score actually good or bad?

One last thing

Comments