Before Transformers: How Embeddings Learn Meaning

From Characters to Concepts

How raw text representation evolved into structured meaning, and why that matters for anything that learns.

Where we left off

Last time we got text into machines. Actually got it in there. Characters got stable numerical identities through Unicode, and those numbers could be stored, copied, sent across the planet, and recovered intact on the other side. After decades of broken encodings and documents that fell apart whenever they moved between systems, text finally had a reliable home inside machines.

That mattered more than it sounds. Text stopped breaking. Files survived migration. Different languages could sit in the same system without colliding. A huge amount of modern software quietly depends on this working.

But what we actually built was a way for machines to hold text. Not a way for machines to do anything with it. The text is stable. It isn't meaningful. Those are very different things.

Text became representable. Nothing became meaningful.

What's actually inside those numbers

Unicode numbers are identifiers. That's all they are. Entries in a giant lookup table that everyone on the planet agreed to use. "A" is 65 everywhere. "你" is 20320 everywhere. That agreement is the whole point, and it's why Unicode works.

What those numbers don't contain matters just as much. The number for one character has no idea what's next to it. A sequence of Unicode values doesn't know it spells a word. A word doesn't know it's in a sentence. There's no concept of similarity or distance baked in. To the machine they're just isolated integers sitting in memory with nothing connecting them.

This isn't a bug. Unicode was never meant to capture relationships between characters. It was meant to stop people arguing about what "A" means on different machines. Once you see it that way, its limitations aren't failures. They're just the edge of what it was built to do.

Why learning changes the rules

When all we needed machines to do was store text, display it and send it places, Unicode was perfect. Learning systems ask a completely different question. They're not preserving symbols. They're hunting for patterns. What shows up together? What behaves similarly in different contexts? What predicts what?

All of those are questions about relationships. And relationships need some kind of space to exist in. You need a way to say "these two things are close" or "these two things are far apart." Without that, a learning system is stuck comparing numbers for exact equality: same or not same. That's it. That's all it's got to work with.

Learning doesn't operate on labels. It operates on relationships.

If your learner only ever sees a flat sequence of unrelated integers, it has to figure out every relationship from scratch through pure statistical exposure. It's possible, but it's like learning geography without ever seeing a map. You'd get there, but you'd waste a ridiculous amount of effort rediscovering structure that could have been given to you upfront.

The question we can't avoid anymore

Getting text into numbers was necessary. It's nowhere near enough. If a machine is going to learn from text, those numbers need to live in a space where relationships are possible. Where patterns can form naturally. Where "similar" is something the system can actually measure, not something it has to invent from nothing.

Meaning doesn't arrive as a definition. It arrives as structure.

Before we try to build that kind of space for language, it helps to look at a domain where this idea is already obvious. Where numbers don't just label things but live inside a geometry. Where similarity isn't a metaphor. It's just maths.

Why images are the best place to start

Images give us something text never had: built-in structure. An image isn't a symbol. It's a grid of numbers laid out in space. Before any learning even happens, there's already geometry to work with. Pixels have neighbours. Regions have patterns. Two photos of the same thing tend to light up similar areas.

The dataset that makes this most tangible is MNIST: 70,000 handwritten digits, each a tiny 28×28 grayscale image.

MNIST dataset

Different people writing the same digits. Same class, never the same pixels.

Nothing semantic has happened yet. The machine has no concept of "three" or "eight." All it has are pixel intensities. Those intensities already encode something useful: two handwritten threes tend to light up similar regions, while a three and an eight don't. The similarity is already there, baked into the data, waiting to be measured. That's the crucial difference from text.

An image is a location, not a picture

Each MNIST image is 28×28 pixels. Read those values row by row and you get a list of 784 numbers. Nothing clever. Just rearranging a grid into a line.

Flatten a digit: watch pixels become coordinates

Pick a digit

28 × 28 grid

Hover any pixel to see its coordinate.

Flattened vector — 784 coordinates

Each bar = one pixel's intensity.
Dimension: 784 • Lit pixels: – • Norm ‖x‖: –

Hover any pixel on the grid. The matching coordinate in the vector highlights. The image isn't a picture anymore. It's a point in 784-dimensional space. Two different digits become two different points. Two similar digits become two nearby points. Structure gets smuggled in for free, just by laying pixels out in a row.

The image is no longer a picture. It's a location. A point in a 784-dimensional space. Once things live in space, geometry applies automatically. Two images that look similar to us will tend to land near each other. Images that look different will land far apart. No labels needed for this to be true.

What similarity actually means, mathematically

Once an image is flattened, it's just a vector: a point \(\mathbf{x} \in \mathbb{R}^{784}\) where each coordinate is a pixel intensity. "Do these two digits look alike?" becomes a geometry question: how close are two points, and how aligned are their directions?

Two measures show up everywhere because they capture different intuitions.

Euclidean distance

The straightforward one. Two vectors \(\mathbf{x}\) and \(\mathbf{y}\):

\[d(\mathbf{x}, \mathbf{y}) = \lVert \mathbf{x} - \mathbf{y} \rVert_2 = \sqrt{\sum_{i=1}^{784} (x_i - y_i)^2}\]

Small distance means the two images agree on most pixels. But it's fragile. Shift the stroke slightly, use a thicker pen, change the brightness, and lots of coordinates move at once. The distance balloons even when you'd confidently say "same digit."

Cosine similarity

Takes a different angle. Literally. Instead of comparing raw pixel values, it compares directions:

\[\mathrm{cosine}(\mathbf{x}, \mathbf{y}) = \frac{\mathbf{x}^{\top}\mathbf{y}}{\lVert \mathbf{x} \rVert_2 \, \lVert \mathbf{y} \rVert_2}\]

The numerator \(\mathbf{x}^{\top}\mathbf{y} = \sum_i x_i y_i\) measures raw overlap: do bright pixels in one image tend to be bright in the other? Dividing by the norms removes scale, so a uniformly darker or thicker version of the same digit doesn't tank the score.

Geometrically this is just \(\cos(\theta)\), the cosine of the angle between two vectors. Small angle means high similarity. Perpendicular means unrelated.

One quick check. Suppose \(\mathbf{y} = \alpha \mathbf{x}\) for some \(\alpha > 0\). Same pattern, just scaled — a thicker stroke or a different brightness. Then:

\[\mathrm{cosine}(\mathbf{x}, \alpha\mathbf{x}) = \frac{\alpha \lVert\mathbf{x}\rVert_2^2}{\lVert\mathbf{x}\rVert_2 \cdot \alpha \lVert\mathbf{x}\rVert_2} = 1\]

Perfect similarity. Cosine treats positive rescaling as "no change in identity." That's exactly right for handwritten digits. The shape matters. How dark the ink is doesn't.

Drag two vectors. Watch both metrics disagree.

Cosine similarity

1.00

Only the angle matters.

Euclidean distance

0.00

Length differences count.

Drag either vector tip. Try the "same direction, different length" preset — cosine stays at 1 while Euclidean reports a big gap. That's why we usually reach for cosine when comparing shapes rather than magnitudes.

Cosine measures angle. Euclidean measures straight-line distance. When two vectors point the same way but differ in length, Euclidean says they're far apart. Cosine says they're identical. For something like handwritten digits, where the same shape can appear in many weights and brightnesses, cosine's indifference to scale is usually what you want.

Real digits, real scores

Each cell shows the similarity between two stylised digits treated as 28×28 vectors. Bright green cells on the diagonal are a digit compared to itself. The off-diagonal pattern matters more: 3 and 8 tend to score high on cosine because they share a lot of overlapping curves, while 1 and 0 sit far apart. The machine hasn't been taught what any of these digits are. The structure is already in the pixels.

Once data lives in a space, meaning becomes something you can measure.

Why this matters for language

This is why we started with images. They show us what learning systems actually need, and what text doesn't naturally provide.

Unicode gave text stable identities. Identities are isolated points. No distance, no angle, no neighbourhood. A learning system looking at Unicode integers can only check: same number or different number? That's not enough to learn anything interesting.

Learning requires a space where similarity is measurable.

Images already have that space. Text doesn't. If language is going to be learnable the way images are, we need to build one. That's the problem embeddings solve. And that's where we're heading next.

From Characters to Concepts

Where we left off

What's actually inside those numbers

Why learning changes the rules

The question we can't avoid anymore

Why images are the best place to start

An image is a location, not a picture

What similarity actually means, mathematically

Euclidean distance

Cosine similarity

Why this matters for language

Comments