Before Transformers: How Embeddings Learn Meaning

From Words to Spaces

How language becomes geometry — the mathematical foundations that make embeddings, similarity, and attention possible.

Meaning is behaviour, not definition

Ask a philosopher where the meaning of a word lives and they might say it lives inside the word itself. "Dog" means dog. That sounds comforting, like meaning is neatly sealed inside each word like a label on a jar. The moment you try to build a machine that works with language, that idea becomes useless. A computer has no childhood memory of being chased by a dog. No sense of fur. No smell. All it sees is "d", "o", "g" and some numbers behind them.

If meaning isn't hiding inside the word, where does it come from? The answer is less mystical and more practical: we understand words because of the situations they show up in and the company they keep. You didn't learn what "dog" means by staring at the letters. You learned it because "dog" kept appearing near "walk", "bark", "leash", "pet". "Cat" builds its own orbit: "purr", "climb", "couch ownership" (which, let's be honest, is the most accurate description of cats). And some phrases, like "quantum field theory", have a very strong habit of never appearing anywhere near "cute Instagram story".

We understand words because of the situations they appear in and the neighbours they keep.

Linguists noticed this long before machine learning existed. Zellig Harris hinted at it. J. R. Firth gave it its most famous form: "You shall know a word by the company it keeps." A huge chunk of modern NLP is basically an industrial-scale attempt to turn that one sentence into maths.

Distributional hypothesis: words that occur in similar contexts tend to have similar meanings.

A tiny universe of six words

To see the idea working rather than argue about it, imagine our entire corpus is just four sentences:

I love my dog
I love my cat
You love my dog
You love my cat

Six words total. Even before we do any maths, your brain is already spotting structure. "Dog" and "cat" feel similar. "I" and "You" play the same role. "Love my" feels welded together. The challenge is making a machine notice the same things without giving it any outside knowledge. No dictionary. No labels. No "dog and cat are both pets". The only thing the machine observes is who tends to sit next to whom.

The only thing we let the machine see is neighbours. Nothing else.

Defining neighbours

"Context" can mean lots of things: same sentence, within five words, same paragraph. We'll go strict: two words are neighbours if they're adjacent. Window size of 1.

In I love my dog, the adjacent pairs are (I, love), (love, my), (my, dog). For every pair (a, b), we count both directions: (a, b) and (b, a). This gives us a symmetric matrix. We don't care about order yet. Just who bumps into whom.

Do this for all four sentences and add up the counts. Now we have something a machine can work with.

Build a tiny corpus. Watch the matrix appear.

Sentences in the corpus

Window size

Co-occurrence matrix

Neighbour graph

Add some sentences to see the structure appear.

Start with the original four sentences and notice that "dog" and "cat" get highlighted as twins — their rows are identical. Now try adding both "I love my pizza" and "You love my pizza". Pizza joins the twin group. The machine wasn't told these words are similar. Their behaviour in the corpus forced them into the same place. Switch the window size and watch the structure shift: with a wider window, more distant words start to matter.

What the matrix actually says

With the original four sentences, the rows for dog and cat are identical: [0, 0, 0, 2, 0, 0]. The rows for I and You are identical too: [0, 0, 2, 0, 0, 0]. Meanwhile love and my sit in the middle as bridges that connect subjects to pets.

Nobody told the machine that "dog" and "cat" are similar. It was shown only neighbour behaviour, and the similarity appeared as a structural fact. Just from adjacency.

Without any dictionary, the matrix quietly learned that "dog" and "cat" behave the same way.

Meaning as geometry

Something subtle happened in that table. Each row is a vector. That means each word now has a location in 6-dimensional space. Not because we designed a geometry, but because behaviour became geometry once we represented it as counts.

dog is the vector [0, 0, 0, 2, 0, 0]. cat is the same vector. They sit on top of each other. I and You sit together somewhere else. love and my are the busy intersections because they touch many neighbours.

Similar words have similar vectors. Similar vectors live near each other in space.

This is the first glimpse of what we'll eventually call an embedding space. It's tiny and clumsy here, six dimensions from four sentences. But the conceptual shift is already visible: meaning looks less like a definition and more like a location. Where a word lives, relative to other words, based on how it's used.

Why this doesn't scale

In real NLP, your vocabulary isn't 6. It might be 50,000 or 200,000. The co-occurrence matrix grows as \(V \times V\). For 50,000 words that's 2.5 billion cells. The Excel sheet that makes your laptop cry.

The problems pile up fast. The matrix is mostly zeros because most word pairs never sit next to each other. Rare words produce tiny, noisy counts. Each row is a 50,000-dimensional vector, which makes similarity calculations expensive. Co-occurrence captures something real about meaning, but it captures it in a form that's computationally awful. It's like keeping your entire life's knowledge as raw CCTV footage. Everything you care about is in there somewhere. It's just not a format you want to work with.

The same meaning, two very different shapes

Vocabulary size V: 50,000

Embedding dim d: 300

Sparse co-occurrence (one row)

V-dimensional vector

shape: (50000,)

200 KB per word, mostly zeros

Dense embedding

d-dimensional vector

shape: (300,)

1.2 KB per word, every value meaningful

Drag V. Drag d. Both vectors claim to represent the same word. The sparse version carries most of its weight in empty space; the dense one packs every coordinate with information earned through training. At V=50,000 and d=300, the dense vector is roughly 0.6% the size per word but arguably richer. That ratio is why every modern language model lives on dense vectors.

From sparse counts to dense embeddings

Co-occurrence captures real meaning but is computationally unusable. So the next question is whether we actually need all 50,000 numbers in each word's vector.

We don't. We don't need the full high-dimensional fingerprint. We need a compressed version that still preserves the important relationships: which contexts this word lives in, and how that compares to other words. So instead of a giant sparse vector like

\[\text{dog}_{\text{sparse}} = [0, 0, 0, 2, 0, 0, \dots, 0] \in \mathbb{R}^{V},\]

we want a much smaller dense vector:

\[\text{dog}_{\text{embed}} = [0.72,\; -0.11,\; 0.83,\; 0.21,\; \dots] \in \mathbb{R}^{d},\]

where \(d\) might be 50, 100, or 300. Dramatically smaller than \(V\), expressive enough to keep the similarity structure.

These smaller, dense vectors are what we call word embeddings.

They still encode "the company a word keeps", but in a form a computer can actually work with. From a matrix perspective, this is approximating the huge matrix \(M\) with a low-rank factorisation:

\[M \approx W C^{\top},\]

where \(W \in \mathbb{R}^{V \times d}\) holds one embedding per word and \(C \in \mathbb{R}^{V \times d}\) holds one context vector per word, with \(d \ll V\).

In practice nobody actually builds \(M\) and factors it. That would defeat the point. Methods like Word2Vec learn \(W\) and \(C\) directly by turning neighbour structure into a prediction game: given a word, predict its neighbours. The gradients do the compression, and the geometry emerges as a side effect of learning to predict context.

Co-occurrence gave us "meaning as a vector." Embeddings make that idea computationally tractable.

Where this is going

Early in the series, we watched text become numbers: ASCII, Unicode, tokenisation. The triumph there was representability, not meaning. In Part 2 we borrowed geometry from images to show what a learning system actually needs. Here we made a different move. We treated meaning as behaviour, let behaviour solidify into counts, and watched similarity appear as a structural fact in a tiny universe. Then we hit the real-world constraint: the naive approach works conceptually but collapses at scale.

Meaning = patterns of usage, compressed into vectors.

What we haven't done yet is show how those dense vectors are actually learned from real text. How "neighbourhood" becomes a prediction task. How a simple model turns repeated exposure into geometry. How gradients slowly sculpt an embedding space where similarity becomes measurable. That's next. We'll turn "who are my neighbours?" into a game the machine can play, and watch the space of meaning emerge as structure the model earns through prediction.

From Words to Spaces

Meaning is behaviour, not definition

A tiny universe of six words

Defining neighbours

What the matrix actually says

Meaning as geometry

Why this doesn't scale

From sparse counts to dense embeddings

Where this is going

Comments