Word2Vec: The Foundational Root of Modern LLMs
Long before GPT-4 and Claude, a small 2013 paper from Mikolov and colleagues at Google quietly rewrote how machines understand language. This is the story of word2vec — from the linguistics it stole, through every matrix multiplication, all the way to the attention blocks of today's frontier models. With interactive math you can poke at.
In 2013, a small team at Google led by Tomáš Mikolov published two short papers that, with the benefit of hindsight, did something enormous: they taught a computer that the relationship between king and queen is the same kind of relationship as the one between man and woman — using nothing but raw text and a few hundred lines of code. Every embedding layer in every LLM you use today is a direct descendant of that idea.
If you've ever wondered why a transformer's first move is to look up each token in a giant matrix and turn it into a vector — and what that vector actually means — the answer is word2vec. This post is the long-form, beginner-friendly, but math-honest tour. We'll start with the linguistics, walk through every matrix multiplication, derive the loss, fix the bottleneck that almost killed the idea, and end at the embedding + attention stack of a modern LLM. Sliders, animations, and one piece of vector arithmetic you can drag with your mouse included.
Each major section has an interactive visual. The math is presented gently first, then formally. If a formula feels heavy, scroll past it and play with the visual underneath — the intuition almost always survives the equations.
Part 1 — The idea that made it possible
“You shall know a word by the company it keeps.”
That sentence is from the British linguist J.R. Firth in 1957, and it is the single most important sentence in the history of NLP. It's called the distributional hypothesis, and it says: the meaning of a word is captured, statistically, by the other words that tend to appear near it.
Take two sentences: “She poured the wine into the glass.” and “She poured the water into the glass.” The words wine and water never have to be defined for you to feel they belong to a similar neighborhood: both are pourable, both end up in glasses, both follow “she poured the.” If we collected millions of sentences and tallied which words live next to which, two words with similar neighborhoods would, by definition, mean similar things.
Notice what's not in those pairs: no dictionary, no parser, no part-of-speech tags, no grammar rules. Just raw co-occurrence. The bet word2vec makes is that this is enough — that meaning is a statistical artefact of context, and a model with no linguistic knowledge whatsoever can recover it by getting good at predicting which word shows up next to which.
Part 2 — From words to vectors (the matrix math)
A neural network does not eat strings. It eats numbers. So the first job is to turn each word in our vocabulary into a number — and then into a vector of numbers, because a single number can't express anything as rich as meaning.
Step 1: one-hot vectors (the dumb way)
Suppose our vocabulary has V = 10,000 words. We assign each word an index 0 … V−1, and represent it as a vector of length V that's zero everywhere except a single 1 at its index. So the word king might be [0, 0, …, 1, …, 0] with the 1 at position 4,217.
This is mathematically clean but useless as a representation: every pair of words is exactly the same distance apart. king and queen are as “similar” as king and banana. We need to compress these huge sparse vectors into small dense ones where geometry actually carries meaning.
Step 2: an embedding matrix is just a lookup table dressed as math
We create a matrix E of shape V × d, where d is small — say 300. Row i of E is the d-dimensional embedding vector for word i. To get the embedding for word king, we compute x · E, where x is its one-hot vector. Because x is zero everywhere except position 4,217, this matrix multiplication picks out exactly row 4,217 of E. That's it — “looking up an embedding” is a one-hot times matrix product.
| 0.50 | 0.48 | -0.59 | -0.01 | -0.82 | 0.56 |
| 0.21 | 0.11 | 0.49 | 0.44 | -0.64 | -0.02 |
| 0.10 | -0.95 | 0.26 | 0.15 | 0.47 | 0.41 |
| 0.70 | -0.44 | 0.07 | -0.93 | 0.30 | 0.20 |
| 0.00 | 0.38 | 0.68 | -0.49 | 0.05 | -0.90 |
| -0.94 | 0.01 | 0.06 | 0.39 | 0.65 | -0.54 |
A one-hot times a matrix is just a row lookup. That single line of math is the embedding layer of every modern LLM.
# A V×d embedding matrix is literally a learnable lookup table.
import numpy as np
V, d = 10_000, 300
E = np.random.randn(V, d) * 0.01 # random init, will be learned
word_to_id = {"king": 4217, "queen": 4218, "man": 901, "woman": 902}
def embed(word: str) -> np.ndarray:
one_hot = np.zeros(V)
one_hot[word_to_id[word]] = 1.0
return one_hot @ E # equivalent to E[word_to_id[word]]
print(embed("king").shape) # (300,)Compression forces the model to invent structure. With only 300 dimensions to work with, the network can't memorise — it has to discover that certain directions in vector space correspond to gender, tense, plurality, sentiment, etc. The dimensions aren't labelled, but they emerge.
Part 3 — The two training games: CBOW and Skip-gram
We now have a giant matrix E of random numbers. We need a training signal that nudges similar-meaning words toward similar rows. word2vec frames this as a self-supervised game with the corpus as the only teacher.
- CBOW (Continuous Bag-of-Words) — given the surrounding context words, predict the center word. Fast, smooths over noise, works well on frequent words.
- Skip-gram — given the center word, predict each of the surrounding context words. Slower per step, but much better at rare words and at capturing fine-grained relationships. This is the variant most people mean when they say “word2vec”.
Same embedding matrix, opposite directions. CBOW averages the context to predict the center; Skip-gram predicts each context word from the center.
The forward pass, in honest math
Skip-gram has two matrices to learn: an input embedding matrix E (shape V×d) for center words, and an output embedding matrix U (shape V×d) for context words. Most people don't realise there are two — but the asymmetry matters, and at the end we usually throw U away and keep only E as “the embeddings”.
Given a center word c, we compute its center embedding v_c = E[c]. To score how likely each candidate word w is to be a context word, we take the dot product u_w · v_c, where u_w = U[w]. A higher dot product means more similar in direction, which we interpret as more likely. To turn V raw scores into a proper probability distribution, we apply softmax:
P(w | c) = exp(u_w · v_c) / Σ_{w' ∈ V} exp(u_{w'} · v_c)The training loss for a single (center, context) pair is just the negative log-likelihood of the true context word: L = −log P(w_true | c). Sum this over every (center, context) pair in the corpus, run SGD, and the gradients gently rotate the rows of E and U so that words appearing in similar contexts end up with similar directions.
Two words are 'similar' if their embedding vectors point in similar directions — measured by cosine similarity, which is just the normalised dot product. Length doesn't matter; angle does.
Part 4 — The softmax problem (and the trick that fixed it)
Look back at that softmax denominator. It's a sum over every word in the vocabulary. For V = 100,000 words, every single training step requires 100,000 dot products and exponentials — and then a backward pass updating every output row u_w'. With billions of (center, context) pairs in a real corpus, this is computationally hopeless. This is why earlier neural language models from Bengio (2003) had been famously slow.
Negative sampling: turn one giant classification into many tiny ones
Mikolov's second paper introduced negative sampling, the trick that made word2vec practical. Instead of asking “which of these 100,000 words is the right context word?”, we ask a much cheaper binary question: “is this pair (center, candidate) a real one from the corpus, or did I make it up?”
For each true (c, w) pair, we sample k ≈ 5–20 random negative words from the vocabulary (weighted by a smoothed unigram distribution, P(w)^0.75, which down-weights very frequent words). We then train the model to push the dot product up for the real pair and down for the negatives, using a sigmoid loss instead of softmax:
L = −log σ(u_w · v_c) − Σ_{i=1..k} log σ(−u_{w_i} · v_c)
↑ ↑
the true context word k random negativesEach step now touches only k+1 output rows instead of all V — a 1000× speedup on a 100k-word vocabulary with k=10. This single change is what turned a clever idea into a paper everybody could reproduce on a laptop. (The other common trick, hierarchical softmax, uses a Huffman tree to reduce the cost to O(log V); negative sampling won in practice because it's simpler and trains faster on big corpora.)
The exact same problem — a softmax over a vocabulary of 50k–250k tokens — shows up in every transformer's output layer. Modern LLMs pay the full cost there because they only do it once per output token, but the lineage of tricks (sampled softmax, sub-word tokenisation, speculative decoding) all trace back to the same scaling pressure Mikolov first hit.
Part 5 — The magic trick: vector arithmetic
Once you've trained the model, something startling falls out for free. The embeddings turn out to encode relationships as approximately constant vector offsets. The classic example:
vec("king") − vec("man") + vec("woman") ≈ vec("queen")Equivalently: the vector that takes you from man to king is almost the same vector that takes you from woman to queen. That vector means, roughly, “add royalty.” The model was never told what royalty is — it discovered the direction from co-occurrence statistics alone. The same trick works for verb tense (walk → walked parallel to swim → swam), country/capital pairs (Paris − France + Italy ≈ Rome), and dozens of other linguistic regularities.
Why does this work? A handwave that turns out to be true
Intuitively: if king and queen differ only in gender, and man and woman also differ only in gender, then both differences must point in roughly the same direction in vector space — the gender axis. Subtracting cancels everything they have in common (royalty, humanity, age, etc.) and leaves only the gender component. Adding woman then re-applies that component to king, landing on queen.
Formally, Levy & Goldberg (2014) showed that the skip-gram-with-negative-sampling objective is implicitly factorising a shifted PMI (pointwise mutual information) matrix of word co-occurrences. The arithmetic works because PMI of (king, royal) and PMI of (queen, royal) are similar; PMI of (king, male) and PMI of (queen, female) are also similar; and when you subtract and add, the consistent components survive.
Real corpora encode real human biases. The same vector arithmetic that gives you king−man+woman = queen also gives you doctor−man+woman = nurse on many corpora. word2vec is a mirror; whatever is in the text comes out in the geometry. This was one of the earliest, cleanest demonstrations of dataset bias in ML.
Part 6 — From word2vec to modern LLMs
Word2vec's central trick — learn an embedding matrix so that prediction-from-context works — is alive in every model on the OpenAI, Anthropic, and Google leaderboards. What changed?
- 1Tokens instead of words. Modern tokenisers (BPE, WordPiece, SentencePiece) split text into sub-words like
▁agent,▁swarm. The embedding matrix becomes a token-vector matrix, but the shape is identical: vocab_size × d. - 2Context-dependent vectors. Word2vec gives each word a single static vector — bank (river) and bank (money) collide. Transformers solve this with self-attention: each token's output vector is a weighted sum of every other token's vector in the sequence, so the representation of bank literally depends on whether river or deposit is nearby.
- 3Deeper stacks. Word2vec is one shallow layer of embeddings + one projection. A transformer is dozens of attention + feed-forward layers operating on those same starting vectors. The first matrix lookup is pure word2vec; everything after is post-processing.
- 4Much bigger d and V. Frontier LLMs use vocabularies of 100k–250k tokens and embedding dimensions of 4k–18k. Same shape, three orders of magnitude bigger.
Where word2vec lives today
Beyond being the conceptual ancestor of every LLM embedding layer, word2vec — and its close cousins GloVe and FastText — are still the right tool for plenty of jobs:
- Lightweight semantic search when you need millions of vectors on a CPU and don't want to pay for a transformer encoder.
- RAG candidate retrieval where dense retrievers like BGE, E5, and OpenAI's embedding APIs are direct descendants — same
text → vectorinterface, much better quality, courtesy of a transformer encoder trained on the same kind of contrastive objective negative sampling pioneered. - Recommender systems: item2vec, user2vec, prod2vec — same algorithm, products instead of words.
- Graph embeddings: node2vec, DeepWalk — random walks on a graph become 'sentences', then it's just skip-gram again.
Every time your agent retrieves a memory by embedding similarity, every time your RAG pipeline finds a relevant chunk, every time a transformer attends to a token — you are using a refined, scaled-up version of the same idea Mikolov shipped in 2013. The matrix has gotten bigger; the trick is the same.
Part 7 — A reading order if you want to go deeper
- Mikolov et al., 2013a — Efficient Estimation of Word Representations in Vector Space. Introduces CBOW and Skip-gram. Six pages, no fluff.
- Mikolov et al., 2013b — Distributed Representations of Words and Phrases and their Compositionality. Introduces negative sampling, sub-sampling of frequent words, and the phrase-detection trick.
- Levy & Goldberg, 2014 — Neural Word Embedding as Implicit Matrix Factorization. The “oh, that's what's really happening” paper.
- Pennington, Socher & Manning, 2014 — GloVe. A different objective (global co-occurrence factorisation) reaching almost identical embeddings — a strong sanity check on the whole programme.
- Devlin et al., 2018 — BERT. The moment context-dependent embeddings replaced static ones as the default.
- Vaswani et al., 2017 — Attention Is All You Need. The architecture that made BERT and everything after possible.
Wrapping up
Word2vec is one of those rare ideas where the implementation is short enough to read in an afternoon, the math is honest enough to actually understand, and the consequences are large enough that you're still using it — in disguise — every time you talk to an LLM. If this post was your first encounter, the best follow-up is to open a notebook and train one yourself on a few megabytes of text. You will be surprised how quickly king − man + woman starts pointing at queen in a vector space you built with your own hands.
And the next time someone tells you transformers are a totally new paradigm, you can smile and remember: the very first thing every one of them does is multiply a one-hot vector by a learned matrix. Mikolov already shipped that in 2013.
Further reading & references
- Mikolov et al. (2013) — Efficient Estimation of Word Representations (arXiv:1301.3781)
- Mikolov et al. (2013) — Distributed Representations / Negative Sampling (arXiv:1310.4546)
- Levy & Goldberg (2014) — Word Embedding as Implicit Matrix Factorization
- Pennington, Socher & Manning (2014) — GloVe
- Vaswani et al. (2017) — Attention Is All You Need
- Devlin et al. (2018) — BERT
- Original word2vec C source — Google Code archive
- AgentSwarms Notebooks — embeddings build-alongs
Was this useful?
Comments
Loading comments…