Before Transformers: How Embeddings Learn Meaning

From Counting to Learning

The geometric leap that makes language models possible.

What co-occurrence was actually telling us

Before talking about learning or models, it's worth rewinding. Last time we built a tiny world. Six words. Four sentences. Everything small enough that structure couldn't hide behind scale.

We didn't teach the system what any word meant. No labels, no definitions, no categories. All we did was count which words appeared next to which other words. Structure emerged anyway. "Dog" and "cat" started resembling each other. "I" and "You" played the same role. "My" hung around nouns. "Love" acted as a bridge. None of this was programmed. It fell out of the data because language itself is structured, and repeated use leaves patterns behind.

Language carries structure within itself. Repeated usage makes that structure visible.

A V×V co-occurrence matrix is an explicit attempt to store those patterns. Each row is a word's behavioural fingerprint: how it relates to every other word in the vocabulary. If two rows look similar, the words tend to appear in similar contexts. The idea is correct. The question is whether storing it explicitly is the right way forward.

The co-occurrence matrix isn't naive. It's conceptually right. The question is whether storing it explicitly is the right way forward.

When V×V meets the real world

In a small controlled setting, the co-occurrence matrix feels elegant. Direct, interpretable, grounded in real usage. The problem appears when you let language be as large and messy as it actually is.

Real vocabularies aren't 6 words. They're 50,000 or 200,000. The matrix grows quadratically: V×V. A vocabulary of 100,000 words means 10 billion possible entries. Most will be zero because most word pairs never appear together. The table has to account for them anyway because they're theoretically possible.

Size isn't even the real problem. The deeper issue is rigidity. Once built, a co-occurrence matrix is frozen. It reflects the past exactly as counted, with no natural way to revise itself.

Think about the word "apple". In an older corpus, it appears near "tree", "fruit", "orchard", "pie". Then new text arrives where "apple" sits next to "phone", "software", "device". The dominant meaning has shifted. The matrix doesn't know about shifts. It just piles new counts on top of old ones. Both senses blend together in the same row with no way to decide which matters more now. The matrix preserves history, not understanding.

Watch "apple" drift from fruit to tech

Corpus era: 1990 2025 1990

Slide the corpus from 1990 to 2025. The same word's neighbours completely change, but the matrix has no way to know which meaning matters now. It just sums everything together.

The matrix stores total counts, nothing else. It doesn't know that "tree" was important in 1995 and irrelevant in 2025. Both facts sit in the same row, diluting each other. Understanding would require the ability to revise, to reweight, to learn. A static table can't do any of those.

The V×V table captures relationships but cannot adapt as understanding evolves.

And yet, abandoning V×V entirely would be a mistake. Word relationships are exactly the right abstraction. The problem is insisting that every relationship be written down explicitly and preserved forever.

What the table is really made of

Language doesn't arrive as a table. It arrives as sentences. Sentences arrive as sequences of words. Words bump into each other in small local neighbourhoods. Each encounter is minor, but together they accumulate.

Every cell in the co-occurrence matrix just answers one historical question: how many times did these two words sit next to each other? The matrix doesn't reason or infer. It remembers.

The co-occurrence matrix is accumulated experience, not learned understanding.

By building the table, we made an implicit choice: forget individual encounters, keep only totals. Collapse time. Erase order. Replace experience with a snapshot. That snapshot is useful, but once the events are gone the table has nothing left to learn from.

This opens a different possibility. If the table is just the sum of events, maybe the table itself isn't what we should be storing. Maybe learning belongs earlier, at the level of individual word encounters, not as an afterthought slapped on top of finished counts.

Keeping the V×V idea without storing the V×V table

The goal hasn't changed. We still want to capture how words relate to each other. What changes is the form.

Instead of storing a value for every possible word pair, we ask: can we represent each word in a way that lets these relationships be reconstructed when needed? Replace explicit storage with a system that produces the same information through interaction.

Look at what actually happens in text. In I love my dog, the word "love" participates in two relationships (connects to "I" on one side and "my" on the other). "I" participates in only one. Even within a single sentence, words don't contribute equally to context. Some sit at the centre. Others are at the boundary.

This asymmetry matters. A word doesn't play the same role when it's surrounded by context as when it provides context to another word. Once you notice this, a natural mathematical distinction follows: each word needs two representations.

The centre vector \(\mathbf{v}_w \in \mathbb{R}^d\) describes how the word behaves when it's the word we're trying to model. The context vector \(\mathbf{c}_w \in \mathbb{R}^d\) describes how it behaves when it appears in another word's neighbourhood.

Collecting these across the vocabulary gives us two matrices. \(V \in \mathbb{R}^{V \times d}\) holds all centre vectors. \(C \in \mathbb{R}^{d \times V}\) holds all context vectors.

Instead of storing V×V relationships directly, we store two V×d structures that can generate them.

Reconstructing a co-occurrence entry

Take "love" as the centre and "my" as context. We don't look up a stored count. We take the centre vector of "love" and the context vector of "my" and compute their inner product:

\(s(\text{love}, \text{my}) = \mathbf{v}_{\text{love}}^\top \mathbf{c}_{\text{my}}\)

This scalar plays the same role as an entry in the co-occurrence matrix. Large value means strong association, small value means weak. As sentences accumulate, the learning process adjusts both vectors so this inner product grows for common pairings and shrinks for rare ones.

Extend this across the entire vocabulary and every potential entry of the V×V table can be approximated implicitly:

\(\text{implicit co-occurrence}(w_i, w_j) \approx \mathbf{v}_{w_i}^\top \mathbf{c}_{w_j}\)

Instead of V² stored numbers, we have two collections of V vectors. The full table still exists conceptually. It's just never written down. It lives in the geometry of the vector space.

Two vectors, one implicit cell

v_centre = (1.8, 1.2)
c_context = (1.6, 1.5)
v_centre^⊤ c_context = 4.68

4.68

This is what the matrix would have stored as the co-occurrence of these two words. Except we never stored it. The vectors produced it on demand.

Drag either vector tip. Try the presets. The dot product rises when the vectors point the same way and falls when they diverge. That single number is the implicit matrix cell.

This is the conceptual payoff of the entire post. The co-occurrence value is no longer retrieved from storage. It's generated by interaction between a centre vector and a context vector. The same cell can now be adjusted by moving either vector, which is what makes learning possible. A static table can't do that. A pair of vectors can.

A co-occurrence value is no longer retrieved from storage. It's generated by interaction between a centre vector and a context vector.

Geometry as memory

Each word now exists as a point in a shared space. Its position isn't arbitrary. It's shaped by every context the word has appeared in and every role it has played relative to other words. The history of a word is no longer a row of counts. It's encoded in where that word settles.

Meaning is no longer stored in cells. It's stored in relative positions.

Why similar words end up close

Words that appear in similar contexts get exposed to similar patterns of adjustment. Their vectors get nudged in similar directions, sentence after sentence. Over time, those shared pressures cause them to drift together. This isn't something we program. Closeness emerges as a consequence of shared experience.

Words with different histories get pulled in different directions. Distance becomes a record of difference in usage. Similar histories produce proximity. Different histories produce separation.

Learning as accumulation

No single sentence defines what a word means. A word doesn't acquire meaning the first time it appears. Meaning emerges slowly through repeated exposure. Each sentence contributes a small nudge. Over time these nudges accumulate into a stable position.

This mirrors how humans learn language. A child doesn't hear "dog" once and fully understand it. The word appears in many situations, with different dogs, in different sentences. A coherent concept forms gradually.

Watch clusters form from random noise

Step: 0 / 200 Clusters forming: shuffling…

Every word starts at a random location. A tiny corpus of sentences drives the simulation: at each step, words that appear in the same sentence get pulled slightly toward each other, and unrelated words get pushed apart. No labels. No categories. Just adjacency. Watch royals, pets, and foods settle into their own neighbourhoods. Exactly what we'll see Word2Vec do in the next post, only simpler.

Learning doesn't happen in a single step. It happens through the slow accumulation of experience.

Where we go next

We've seen why explicit tables struggle to scale and adapt. We've seen how word relationships can be preserved without ever writing down a V×V matrix. We've watched meaning migrate from cells into geometry, where memory takes the form of relative position instead of fixed entries.

Words are no longer defined by lists of counts. They're defined by where they settle in a space shaped by experience. Similar histories produce proximity. Different histories produce separation. Learning emerges as accumulation, not revelation.

The idea is now in place. What remains is learning how to actually find these vectors.

Next we'll move from intuition to procedure. We'll ask what forces shape these vectors, how interactions turn into learning, and how simple mathematical operations give rise to rich structure.

From Counting to Learning

What co-occurrence was actually telling us

When V×V meets the real world

What the table is really made of

Keeping the V×V idea without storing the V×V table

Reconstructing a co-occurrence entry

Geometry as memory

Why similar words end up close

Learning as accumulation

Where we go next

Comments