Before Transformers: How Embeddings Learn Meaning

Why Raw Dot Products Aren't Enough

How scores become probabilities that let contexts compete.

Picking up the thread

Last time we got the dot product working as a compatibility signal between centre and context vectors. Larger values suggest stronger association. As sentences accumulate, vectors drift and meaning appears as structure. It feels like learning.

The raw dot product is a surprisingly awkward object to build learning around. Three problems surface the moment you look closely.

First, the score is unbounded. No ceiling, no floor. If vector magnitudes grow, the score grows automatically. The model can inflate its numbers just by stretching the geometry, even if the underlying relationships stay poorly structured.

Second, the sign is ambiguous. The score freely crosses zero. What does a negative value actually mean? Active opposition? Weak association? How much worse is −3 than −1? The number itself doesn't answer.

Third, and deepest: scores are independent. When we compute the score between a centre word and one context word, that calculation has no awareness of any other possible context. Nothing forces different contexts to compete. Nothing says that increasing the plausibility of one should decrease the plausibility of another.

The score tells us whether two vectors align, but it doesn't tell us which context should be preferred among all possibilities. There's no distribution. No conservation of plausibility. Each score floats on its own.

A dot product is unbounded, sign-ambiguous, and isolated. It can react to data, but it can't support judgment or learning.

Fixing the sign: why scores must become positive

If we want scores to act as plausibility signals, they need to live on a single, interpretable scale. They need to be positive. Not because negativity is mathematically invalid, but because learning needs a notion of strength, not opposition.

The fix is the exponential function. We take the raw dot product and pass it through \(\exp(\cdot)\). Two properties make this almost unavoidable. First, it maps every real number to a positive value. No matter how negative the score, the output is still greater than zero. Second, it preserves relative differences. If one score exceeds another by a fixed amount, the exponential guarantees a fixed multiplicative ratio between their outputs. A gap of 1 unit always means a factor of \(e\), regardless of where on the number line it occurs.

Exponentials convert additive differences into multiplicative preference.

After this transformation every score is positive. Negative scores become small plausibility; positive scores become large plausibility. The sign ambiguity is gone. Notice what hasn't happened though: these values still don't compete. Each exponentiated score exists on its own. Fixing that requires looking at all contexts together.

When contexts begin to compete

Consider I love my dog. When "love" appears, many context words are theoretically possible. "My" is plausible. "The" is plausible. In the actual sentence, only "my" appears. Its presence is also an exclusion. Every observed word implicitly says: this happened, and everything else did not.

Language creates meaning by choosing one alternative and rejecting many others.

As long as our scores are independent, this tension can't exist. The model can assign high plausibility to "my", "the", "a" and "your" simultaneously. Nothing forces preference. So instead of asking whether a single context fits, we ask: given a fixed centre word, how should plausibility be distributed across all possible context words?

Take all exponentiated scores and sum them:

\(Z(\mathbf{v}) = \sum_{j=1}^{V} \exp(\mathbf{v} \cdot \mathbf{c}_j)\)

This total forces the model to look at the entire vocabulary at once. Now each context must be read as a fraction of the whole:

\(P(i \mid \mathbf{v}) = \frac{\exp(\mathbf{v} \cdot \mathbf{c}_i)}{Z(\mathbf{v})}\)

Drag the scores. Watch them become probabilities.

Raw scores

v · c_i

range: —

After exp

exp(v · c_i)

Z = Σ exp(…) = —

After ÷ Z

exp(…) / Z

Σ P = 1.00

Drag the sliders. The first panel shows raw dot products — can be negative, unbounded. The second panel shows them after exp — all positive, but they don't add up to anything meaningful yet. The third panel divides by Z, and suddenly the bars fit inside a fixed budget of 1. Push one slider up and watch the other probabilities shrink. That shrinking is the whole point: plausibility is no longer free.

This single division introduces dependency. If "love"–"my" compatibility increases, the numerator grows, but the denominator grows too, and the share available to "the", "a" and "your" shrinks. To favour one context, the model must implicitly disfavour others. Plausibility is no longer free. It's conserved.

Preference emerges only when plausibility is shared.

Push "my" up. Watch everyone else shrink.

Raw score for my: 1.0

Σ P = 1.00 (always)

Only one score moves. All five probabilities update. As "my" climbs, its probability rises — but so does the denominator, so every other probability shrinks proportionally. The green deltas show each word gaining, the red deltas show each losing. The sum stays at exactly 1. Plausibility moved from the losers into the winner. Nothing was created.

As score gaps widen, the context with the largest dot product claims a larger share. In the extreme, if one score dominates, its probability approaches 1 while the rest approach 0. They never reach exactly zero though. Near alternatives retain nonzero probability. Uncertainty is preserved. This is why the transformation is called a soft maximum: it prefers without committing absolutely.

Soft or sharp? The same scores, dialled up and down.

Score multiplier β: 1.0

balanced

entropy H(P) = — bits max entry P_max = —

Same five scores. All we do is multiply them by β before softmax. At β near 0 the distribution is nearly flat — every context looks equally plausible. At large β it collapses onto whichever score was highest. Entropy — a measure of how much uncertainty remains — drops toward zero as β grows. The softmax approaches an argmax but never quite commits. Non-winners stay at tiny but nonzero probabilities. That "softness" is what keeps learning smooth.

Each \(P(i \mid \mathbf{v})\) is non-negative, and they sum to 1:

\(\sum_{i=1}^{V} P(i \mid \mathbf{v}) = 1\)

Without setting out to do so, we've crossed into probability. For a fixed centre word, these values form a probability mass function over the vocabulary. Probability mass is conserved. It can only move from one context word to another as the geometry changes. Meaning now lives not in individual scores, but in how belief is distributed across alternatives.

Walking through a small example

Three possible context words. Dot products with the centre vector:

\(\mathbf{v} \cdot \mathbf{c}_1 = 2, \quad \mathbf{v} \cdot \mathbf{c}_2 = 1, \quad \mathbf{v} \cdot \mathbf{c}_3 = 0\)

After exponentiation:

\(\exp(2) \approx 7.39, \quad \exp(1) \approx 2.72, \quad \exp(0) = 1\)

The total:

\(Z = 7.39 + 2.72 + 1 = 11.11\)

Divide each by \(Z\):

\(P(1 \mid \mathbf{v}) \approx 0.66, \quad P(2 \mid \mathbf{v}) \approx 0.24, \quad P(3 \mid \mathbf{v}) \approx 0.09\)

The ordering hasn't changed. The strongest match is still strongest. What's changed is the interpretation. These numbers express belief. They describe how plausibility is distributed, not just how large a score happens to be. And they sum to 1, which means increasing one must decrease the others.

Probability appears because plausibility is forced to be shared.

The model is no longer producing raw scores. It's expressing a conditional probability distribution over the vocabulary. Distributions can be compared to reality. That comparison is where learning truly begins.

Where we go next

We started with a raw dot product. Each transformation was motivated by a problem we ran into. Removing sign ambiguity. Forcing comparison. Expressing belief in a shared form. We arrived at a probability distribution over context words.

The next post asks how these probabilities should behave when they meet real data, and how disagreement between belief and observation creates the pressure that drives learning.

Why Raw Dot Products Aren't Enough

Picking up the thread

Fixing the sign: why scores must become positive

When contexts begin to compete

Walking through a small example

Where we go next

Comments