What the Dot Product Really Measures
How centre and context words recognise each other in embedding space.
Where we stand
Meaning isn't stored in a table. It's encoded in the geometry of a space. Words with similar histories end up close together. Each word carries two representations: a centre vector for when it's the focus, a context vector for when it's in someone else's neighbourhood. Together these let us reconstruct the relational structure that a V×V table once tried to store explicitly.
We have a representation but no procedure. We know the vectors need to end up arranged in a way that reflects language, but we haven't said how that arrangement comes about. The goal now is precise: find values for the centre vectors \(V\) and context vectors \(C\) that best reflect the structure of language as observed in text.
What makes one set of vectors better than another
Back to our toy corpus:
I love my dog. · You love my dog. · I love my cat. · You love my cat.
"Dog" and "cat" appear in identical contexts. "I" and "You" behave symmetrically. "Love" always sits between a pronoun and "my." Any reasonable vectors should reflect these regularities. Each word \(w\) gets a centre vector \(\mathbf{v}_w \in \mathbb{R}^d\) and a context vector \(\mathbf{c}_w \in \mathbb{R}^d\). Their compatibility is:
\(s(w_i, w_j) = \mathbf{v}_{w_i}^\top \mathbf{c}_{w_j}\)
Use \(d=2\) and compare two candidate geometries.
Candidate A: vectors that work
Set \(\mathbf{v}_{\text{love}} = \begin{bmatrix} 2 \\ 1 \end{bmatrix}\), with context vectors \(\mathbf{c}_{\text{my}} = \begin{bmatrix} 1 \\ 1 \end{bmatrix}\), \(\mathbf{c}_{\text{dog}} = \begin{bmatrix} 1 \\ 0 \end{bmatrix}\), \(\mathbf{c}_{\text{cat}} = \begin{bmatrix} 1 \\ 0 \end{bmatrix}\), and \(\mathbf{c}_{\text{you}} = \begin{bmatrix} -1 \\ 0 \end{bmatrix}\).
The dot products:
\(\mathbf{v}_{\text{love}}^\top \mathbf{c}_{\text{my}} = 3\) \(\mathbf{v}_{\text{love}}^\top \mathbf{c}_{\text{dog}} = 2\) \(\mathbf{v}_{\text{love}}^\top \mathbf{c}_{\text{cat}} = 2\) \(\mathbf{v}_{\text{love}}^\top \mathbf{c}_{\text{you}} = -2\)
Observed pairs (love-my, love-dog, love-cat) get high scores. An unobserved pair (love-you as context) gets a negative score. The vectors distinguish real relationships from unlikely ones.
Candidate B: vectors that don't
Now set \(\mathbf{v}_{\text{love}} = \begin{bmatrix} 1 \\ 0 \end{bmatrix}\) and give all context words the same vector \(\begin{bmatrix} 1 \\ 0 \end{bmatrix}\). Every dot product comes out to 1. The geometry makes no distinction between observed and unobserved pairs. It's numerically valid but structurally useless.
Same operation, the dot product, used on both. The difference isn't in the formula. It's in where the vectors sit. Candidate A spreads context vectors across different directions, so the centre vector for "love" can tell them apart. Candidate B collapses everything onto one line, so all contexts look identical to any centre word. A working embedding space is not about having vectors, it's about having them arranged so that interactions produce meaningful differences.
So "better" means: a set of vectors whose dot products systematically assign higher values to word pairs supported by the corpus than to those that aren't. The vectors are judged entirely by the behaviour they produce when they interact.
Why the dot product, and why two vectors
This might look like an arbitrary design choice. Why the dot product? Why two separate representations?
The asymmetry comes first. In I love my dog, when "love" is the focus, it's the word being explained by its neighbours. "My" helps explain something else. These roles aren't interchangeable. That's why we need separate centre and context vectors. They encode different kinds of information about the same word.
Now we need a way to combine them. The combination has to produce a single number reflecting compatibility, increase when vectors align, and decrease when they don't. The dot product does exactly this:
\(\mathbf{v}_{w_i}^\top \mathbf{c}_{w_j} = \|\mathbf{v}_{w_i}\| \|\mathbf{c}_{w_j}\| \cos \theta\)
It responds to both direction and magnitude. Similar directions produce a large value. Opposite directions produce a negative. Orthogonal produces zero. Critically, it's linear: small changes in either vector produce small changes in the score. This smoothness is what makes incremental learning possible. You can adjust vectors one encounter at a time without causing instability.
Why alignment happens between centre and context, not between similar words
This is the point people most often get wrong. Similar words don't get pulled directly toward each other during learning. That's not what happens.
Vectors don't align because words are similar. Words become similar because of how their vectors align with contexts.
In I love my dog, when "love" is the centre, the learning signal compares the centre representation of "love" with the context representations of "I", "my", and "dog". It never compares "love" and "dog" as equals.
"Dog" and "cat" end up close not because they're ever pushed toward each other. They end up close because they're both pushed toward many of the same context vectors: "my", "love", "you". Similar pressures produce similar positions. Clustering is a side effect of repeated centre-context interactions. It's what we observe after learning, not something imposed during it.
One sentence at a time. The centre word rotates through "love", "dog", "cat". For each centre, pulling-arrows fire from the centre toward its context words. Crucially, "dog" and "cat" are never each other's centre or context in the same step. They never directly interact. Yet they end up close together, because every time "my" and "love" appear, both dog and cat are being tugged toward the same fixed points in space. Similar pressure, similar landing spot. Clustering is an emergent side effect, not a rule.
The dot product as a score
From here on, we'll call the dot product between a centre vector and a context vector a score:
\(\text{score}(w_i, w_j) = \mathbf{v}_{w_i}^\top \mathbf{c}_{w_j}\)
This score is the model's internal assessment of how compatible a centre word is with a context word. It's not a probability. Not a count. Just a raw signal reflecting how strongly the current geometry believes these two words belong together.
Higher score means the model thinks this pairing is plausible. Lower means unlikely. Near zero means not enough evidence either way.
The objective restated: choose vectors so that scores for frequently observed pairs are high, and scores for rare or absent pairs are low. Learning vectors is equivalent to tuning scores so they reflect what's actually true in text.
When is a score actually good or bad?
And now the awkward question. We compute a score of 2.3 for a word pair. Is that good? What about 0.1? What about negative?
A dot product is just a number. Its scale depends on vector lengths and orientation. Without additional structure, there's no clear boundary telling us when a score is "high enough" or "too low". A score of 1.5 might be high for one centre word and low for another.
Three centre words, all pointed at the same context vector "my". Drag the scale sliders. Each word's dot product with "my" is drawn as a bar. Notice how a score of 1.5 can be "very strong" for one centre word and "barely above average" for another, depending entirely on vector length. The ranking within one word is meaningful. Comparing raw scores across different words is not. A dot product is a signal, not yet a verdict.
The dot product gives us a way to rank word pairs for a given centre word, but it doesn't give us a way to judge correctness. We have a signal, but not yet a metric. We can compute scores, but we can't say whether a particular score represents success or failure.
This is the gap we need to close. We need to transform raw scores into something we can evaluate, something where we can say "this score is wrong by this much". Once we can measure wrongness, we can use it to guide how vectors should move.
That conversion, from raw scores to evaluable predictions, is what turns scoring into learning. And it's where we're headed next.
One last thing
All the vectors we've used so far are arbitrary. They were chosen to illustrate ideas, not learned from data. The geometry is still untrained. The fact that we can compute scores doesn't mean those scores are meaningful yet.
This is exactly why learning is required. The goal isn't to accept scores from random vectors. It's to adjust the vectors until the scores they produce align with what we observe in real text. Everything that follows is about closing that gap.
Comments