[{"content":" This is a flyover A high-altitude tour of the pre-training pipeline. Each stage gets its own deep dive elsewhere in the series; this post exists to give those posts shared vocabulary to refer back to.\nPre-training is the process that turns a freshly initialized neural network into a base model that can produce plausible continuations of text. Everything else built on top of an LLM (instruction tuning, RLHF, deployment) assumes pre-training has already happened.\nAt altitude, the pipeline looks like this. Text comes in. It gets chopped into tokens. Tokens become vectors. The vectors flow through a stack of transformer blocks that let every position look at every other position. The top-layer vectors produce a probability distribution over the next token. That distribution gets compared to the actual next token in the corpus, producing a loss. The loss flows backward through the network, nudging billions of parameters in the direction that reduces it. Repeat for trillions of tokens.\nThe same picture as a diagram:\nflowchart TB text[\"Raw text\"] tokens[\"Token IDs\"] embeddings[\"Embedding vectors(+ position info)\"] block[\"Transformer block × N(self-attention + MLP)\"] distribution[\"Next-tokendistribution\"] loss[\"Cross-entropy loss\"] backprop[\"Backprop + Optimizer\"] text --\u003e tokens tokens --\u003e embeddings embeddings --\u003e block block --\u003e distribution distribution --\u003e loss loss --\u003e backprop backprop -.-\u003e|\"update parameters\"| block Forward pass goes top to bottom. The dashed arrow back from the optimizer to the transformer stack is the training loop: every batch, the gradients computed from the loss are used to nudge the parameters, and the whole forward pass runs again. Each box maps to a deep-dive post (or set of posts) further down in the series; the section headings below each carry a forward-link.\nSee it in 3D Brendan Bycroft\u0026rsquo;s interactive LLM visualization is the most beautiful thing you can do with this material: a 3D walkthrough of a small GPT model\u0026rsquo;s full forward pass, every matrix and activation visible and explorable. Worth opening in another tab as you read this post and the deep dives that follow.\nEach stage below has, or will have, its own deep-dive post.\nText becomes tokens The model cannot read characters; it can only read integers. Tokenization is the function that turns a string into a sequence of integer IDs drawn from a fixed vocabulary, set once at training time.\nDeep dive: Tokens and Tokenization.\nTokens become embeddings Each token ID indexes into the embedding matrix, a $V \\times d$ lookup table where $V$ is the vocabulary size and $d$ is the model\u0026rsquo;s hidden dimension. The row at index $i$ is the embedding vector for token $i$: a $d$-dimensional point in a learned semantic space. The integer sequence becomes a sequence of vectors.\nDeep dive: coming soon.\nPositions Attention treats its input as a set, not a sequence. Without explicit position information, \u0026ldquo;dog bites man\u0026rdquo; and \u0026ldquo;man bites dog\u0026rdquo; look identical. Positional encodings inject information about where each token sits in the sequence, either added to the embeddings or baked into attention\u0026rsquo;s geometry.\nDeep dive: coming soon.\nThe transformer stack The vectors flow through a stack of identical transformer blocks. Each block runs self-attention (every position looks at every other position) followed by an MLP (per-position mixing). Layer norms and residual connections hold the math stable across many layers. By the time the sequence reaches the top, each vector is a context-aware representation of its position given everything around it.\nDeep dive: coming soon.\nFrom vector to next-token distribution The final layer takes each vector through the output matrix (also $V \\times d$, often weight-tied to the embedding matrix) and produces a $V$-dimensional vector of logits, raw scores over the vocabulary. A softmax turns those scores into a probability distribution. That is the model\u0026rsquo;s prediction for the next token at each position.\nDeep dive: coming soon.\nThe loss Pre-training optimizes one objective: cross-entropy between the model\u0026rsquo;s predicted distribution and the actual next token in the corpus. For each position in training data, the loss is $-\\log p(\\text{actual next token})$. Summed over the corpus, this is what the optimizer minimizes.\nDeep dive: coming soon.\nBackprop and the optimizer The loss flows backward through every layer of the network, computing gradients with respect to every parameter via backpropagation. An optimizer (almost always AdamW for modern LLMs) uses those gradients to update parameters. One step of this loop processes a batch; pre-training runs millions of steps.\nDeep dive: coming soon.\nScale Pre-training a modern LLM is a feat of scale: trillions of training tokens, thousands of GPUs, months of wall-clock time. The math does not change with scale; the engineering does. Data pipelines that keep GPUs fed, parallelism schemes that split a model across thousands of devices, mixed-precision math, checkpointing.\nDeep dive: coming soon.\nThe rest of this phase of the series unpacks each stage in depth. Future deep-dive posts can refer back here when they need to gesture at \u0026ldquo;the model\u0026rdquo; or \u0026ldquo;the loss\u0026rdquo; or \u0026ldquo;the embedding matrix\u0026rdquo; without re-explaining them every time.\nA note on series terminology \u0026ldquo;Pre-training\u0026rdquo; in this series labels Phase 1: the architectural foundations of model building (tokens, embeddings, attention, the transformer). The training loop itself, where gradient descent runs over trillions of tokens, gets its own phase (\u0026ldquo;Training\u0026rdquo;). In ML literature these are usually folded together as \u0026ldquo;pre-training\u0026rdquo;; I\u0026rsquo;m splitting them here for pedagogical clarity.\n","permalink":"https://bearisland.dev/posts/pretraining-overview/","summary":"\u003caside class=\"callout callout-info\"\u003e\n\u003cdiv class=\"callout-title\"\u003e\n\u003cspan class=\"callout-icon\" aria-hidden=\"true\"\u003e\u003csvg width=\"14\" height=\"14\" viewBox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2.4\" stroke-linecap=\"round\" stroke-linejoin=\"round\"\u003e\u003ccircle cx=\"12\" cy=\"12\" r=\"10\"/\u003e\u003cline x1=\"12\" y1=\"16\" x2=\"12\" y2=\"12\"/\u003e\u003cline x1=\"12\" y1=\"8\" x2=\"12.01\" y2=\"8\"/\u003e\u003c/svg\u003e\u003c/span\u003e\n\u003cspan\u003eThis is a flyover\u003c/span\u003e\n\u003c/div\u003e\n\u003cdiv class=\"callout-body\"\u003e\n\u003cp\u003eA high-altitude tour of the pre-training pipeline. Each stage gets its own deep dive elsewhere in the series; this post exists to give those posts shared vocabulary to refer back to.\u003c/p\u003e\n\n\u003c/div\u003e\n\u003c/aside\u003e\n\n\u003cp\u003ePre-training is the process that turns a freshly initialized neural network into a base model that can produce plausible continuations of text. Everything else built on top of an LLM (instruction tuning, RLHF, deployment) assumes pre-training has already happened.\u003c/p\u003e","title":"Pre-training: An Overview"},{"content":"Ask GPT-4 how many r\u0026rsquo;s are in \u0026ldquo;strawberry\u0026rdquo; and it will confidently say two. The right answer is three. This isn\u0026rsquo;t because the model can\u0026rsquo;t count. It\u0026rsquo;s because it never sees the letters at all.\nEvery Large Language Model (LLM) starts with the same operation: text comes in, gets chopped into chunks called tokens, and those chunks become integer IDs that index into an embedding matrix. The chunks aren\u0026rsquo;t characters and they aren\u0026rsquo;t words. They\u0026rsquo;re something more specific, and the specificity matters more than most people realize.\nWhat a \u0026ldquo;token\u0026rdquo; really is Most people first meet the word \u0026ldquo;token\u0026rdquo; through prices and limits: \u0026ldquo;1,500 tokens used\u0026rdquo;, \u0026ldquo;the context window is 128K tokens\u0026rdquo;. Those numbers are real, but they hide what a token actually is.\nA token is the smallest unit of input a specific model can perceive. Each model has its own fixed list of tokens, called its vocabulary, decided once at training time. GPT-4\u0026rsquo;s vocabulary isn\u0026rsquo;t Claude\u0026rsquo;s. Claude\u0026rsquo;s isn\u0026rsquo;t Llama\u0026rsquo;s.\nWhen you send text to a model, the text gets chopped into pieces from that model\u0026rsquo;s vocabulary, and each piece is swapped for an integer ID. Only those IDs ever reach the model. The model never sees text. It sees a sequence of integer indices into its own private alphabet.\nSo tokens aren\u0026rsquo;t \u0026ldquo;roughly like words\u0026rdquo; or \u0026ldquo;kind of like characters\u0026rdquo;. They\u0026rsquo;re the atoms of perception for one specific model, and they\u0026rsquo;re the only language that model speaks. Two models fed the same English sentence will produce two different integer sequences, often of different lengths:\n\"I love strawberry milkshakes!\" GPT-4 I ·love ·str aw berry ·milk sh akes ! 9 tokens Llama 3 I ·love ·straw berry ·milk shakes ! 7 tokens Each chip is one token. · marks a leading space (so ·love is the token love, distinct from love). Splits are approximate; the interactive playground at the end of the post shows exact tokenization. The same sentence is nine tokens to GPT-4 and seven tokens to Llama 3. Not because Llama is smarter or the sentence changed, but because the two models have different vocabularies. To GPT-4, the token ·straw doesn\u0026rsquo;t exist as a single chunk, so \u0026ldquo;strawberry\u0026rdquo; splits across three pieces. Llama 3\u0026rsquo;s vocabulary happens to include ·straw, so it gets through in two.\nHere\u0026rsquo;s GPT-4\u0026rsquo;s actual tokenizer running in your browser. Type anything: your name, a strange word, a sentence in another language. Each chip below is one token.\nGPT-4 tokenizer (cl100k_base) loading… How does a model end up with one specific vocabulary instead of another? The dominant algorithm is Byte Pair Encoding, or BPE.\nBPE, the algorithm BPE is an algorithm for deciding which subword chunks deserve to be tokens, given a corpusA corpus is the dataset of text used to train the tokenizer (and the model). Typically a giant mix of web pages, books, code, and other text. For modern models it\u0026rsquo;s measured in trillions of tokens. and a target vocabulary size. It starts small and grows the vocabulary one merge at a time, always merging the most frequent adjacent pair in the corpus.\nThe whole algorithm fits on a sticky note.\nThe setup. You have:\nA corpus to tokenize. A target vocabulary size $V$ (a number you choose; typical values are 30,000 to 100,000). You want to end up with a list of $V$ tokens such that common substrings (the, ing, to) get their own token, so common text compresses into short sequences. Rare substrings decompose into smaller pieces, down to single characters in the worst case, so nothing is ever out-of-vocabulary.\nThe algorithm.\nInitialize the vocabulary as every distinct character in the corpus. Scan the corpus and count every adjacent pair of tokens. Take the most frequent pair, merge it into a new token, and add it to the vocabulary. Repeat steps 2 and 3 until the vocabulary has $V$ entries. That\u0026rsquo;s it. No clever scoring, no neural networkA computational model made of layers of trainable mathematical functions whose parameters are tuned to fit data. Modern LLMs are massive neural networks. BPE, by contrast, is plain bookkeeping with no learned parameters., no second pass. The \u0026ldquo;merge\u0026rdquo; in step 3 doesn\u0026rsquo;t do anything sophisticated. It just declares: from now on, whenever you see t followed by h in this corpus, treat them as one symbol called th.\nTwo details matter:\nThe originals don\u0026rsquo;t disappear: when t and h get merged into th, all three are now in the vocabulary. If a word later happens to use t followed by some other character, the tokenizer can still represent it. The vocabulary grows monotonically. Pairs get re-counted after each merge: once th is a token, the next iteration might find that th + e is the new top pair → merge → the. Then + the → the. Multi-character common words emerge from running the same 4-step loop with no extra cleverness. The vocabulary builds combinatorially. A worked example Real corpora are noisier than this The corpus below is a stylized word-frequency table. Real corpora are raw text. Production BPE runs a pre-tokenization step that turns text into a table like this before any merging happens. Show details▾ raw text pre-tokenizer (regex) word frequencies BPE The three stages:\nPre-tokenize. Split raw text into \u0026ldquo;words\u0026rdquo; with a regex. GPT-2\u0026rsquo;s famous pattern matches letter sequences, number sequences, contractions, and runs of punctuation. Each match becomes a \u0026ldquo;word\u0026rdquo;. Count. Tally unique words and their frequencies. The result is a table that looks like the one below. BPE. Run the 4-step merge algorithm on that table. Pre-tokenization does two things. It prevents merges from crossing word boundaries (so you don\u0026rsquo;t get nonsense tokens like the cat), and it makes the algorithm tractable: counting pairs on a billion-row word table is dramatically cheaper than on a trillion-character stream.\nSentencePiece (used by Llama, T5, Gemma) is the notable exception. It skips pre-tokenization, treats the raw stream as input, and replaces spaces with a special character ▁. That works better for languages without explicit word boundaries like Chinese and Japanese. We\u0026rsquo;ll get into the variants later.\nLet me run it on a tiny corpus: just two words, cat appearing 3 times and mat appearing 2 times.\ncat × 3 mat × 2 The initial vocabulary is the 4 distinct characters that appear: c, a, t, m. Every word starts as a sequence of single-character tokens.\nInitial state cat ×3 c a t mat ×2 m a t Vocabulary (4 tokens) c a t m Iteration 1. Count every adjacent pair, weighted by word frequency:\npair count (c, a) 3 (a, t) 3+2 = 5 (m, a) 2 Winner: (a, t) → at. The suffix at appears in both words, which is why it scores highest. Merge it:\nAfter merge (a, t) → at cat ×3 c at mat ×2 m at Vocabulary (5 tokens) c a t m at Iteration 2. Re-count:\npair count (c, at) 3 (m, at) 2 (c, at) → cat wins because cat is the more frequent word. Merge:\nAfter merge (c, at) → cat cat ×3 cat mat ×2 m at Vocabulary (6 tokens) c a t m at cat After two merges the vocabulary holds 6 tokens: c, a, t, m, at, cat. Notice what just happened. The word cat now tokenizes to a single token. The word mat still takes two tokens (m + at), because BPE judged cat worth its own ID but not yet mat. In a larger corpus where mat was more common, it would eventually merge too. This is exactly what real tokenizers look like: common words collapse to one token, rarer words decompose into shared subword pieces like the at suffix.\nA two-word corpus only takes the algorithm so far. Let\u0026rsquo;s step through a richer four-word corpus to watch meaningful subwords emerge.\nVocabulary So the whole algorithm is bookkeeping. No machine learning, no scoring functions. The structure that emerges (suffixes like est, common words like low, eventually multi-character tokens for frequent words like the, ing, tion) is a direct snapshot of the corpus\u0026rsquo;s frequency statistics.\nByte-level BPE Look back at one line from the algorithm: \u0026ldquo;the initial vocabulary is every distinct character in the corpus\u0026rdquo;. That works fine if the corpus is plain English with no surprises. The moment you feed BPE the actual internet (Chinese, emoji, code, accented letters, rare Unicode codepointsUnicode\u0026rsquo;s numeric IDs for characters, written as U+XXXX in hex. E.g. U+0041 for A, U+1F353 for 🍓. About 150,000 codepoints in total, covering every script, symbol, and emoji.), the \u0026ldquo;distinct characters\u0026rdquo; set explodes, and worse: any rare codepoint the corpus didn\u0026rsquo;t include is still out-of-vocabulary at the character level.\nGPT-2 introduced a fix that\u0026rsquo;s now near-universal: don\u0026rsquo;t start with characters. Start with bytesA byte is just 8 bits, a number from 0 to 255. Everything stored on a computer (text, images, programs) ultimately lives as a sequence of bytes; text is just a particular interpretation of byte sequences via an encoding like UTF-8..\nThere are exactly 256 possible byte values, so:\nThe initial vocabulary is fixed at 256, regardless of corpus. Every byte is in the vocabulary, by definition. Any text representable on a computer is, by definition, a byte sequence. Out-of-vocabulary is eliminated by construction. The worst case for any input is \u0026ldquo;fall back to bytes\u0026rdquo;. The UTF-8 wrinkle. Most modern text is encoded as UTF-8A variable-length encoding that maps each Unicode character to 1 to 4 bytes. ASCII takes 1 byte, most European scripts 2, most Asian scripts 3, emoji 4., where each Unicode character becomes a sequence of 1 to 4 bytes:\ncharacter bytes (hex) bytes A 41 1 é C3 A9 2 中 E4 B8 AD 3 🍓 F0 9F 8D 93 4 ASCII is just \u0026ldquo;UTF-8 where every character is one byte\u0026rdquo;, so plain English text is unchanged. But 中 enters the tokenizer as the 3-byte sequence E4 B8 AD, not as a single character.\nAfter BPE training on a multilingual corpus, the merges could end up producing a single token for the sequence E4 B8 AD. Those three bytes always appear together in any valid UTF-8 encoding of 中. The byte triple gets compressed into a \u0026ldquo;character-shaped\u0026rdquo; token via merging, the same way est and low did in the English example. The algorithm doesn\u0026rsquo;t change. We just swapped the starting alphabet.\nInput: \"Hello 🍓!\" Character-level H e l l o · ⚠ UNK ! 8 tokens 🍓 isn't in the vocabulary. Replaced with \u0026lt;UNK\u0026gt;. The character is lost — the model can never recover it. Byte-level H e l l o · F0 9F 8D 93 ! 11 tokens 🍓 decomposes into 4 bytes F0 9F 8D 93. Every byte is in the vocabulary by construction. Nothing is lost. Same input, two tokenizers. The character-level one fails on any character it wasn\u0026rsquo;t trained to know. The byte-level one cannot fail. Byte-level BPE pays in tokens to win in coverage:\nThe cost: non-ASCII text uses more tokens when the training corpus underrepresents the script. A Chinese sentence run through an English-heavy model decomposes into byte-level chunks rather than character-shaped tokens. Same string, more tokens. This is why API pricing tends to hit Chinese, Arabic, and Hindi harder than English. The guarantee: nothing is ever out-of-vocabulary. The starting vocabulary is fixed at 256 entries, every byte sequence is representable by construction, and there\u0026rsquo;s no \u0026lt;UNK\u0026gt; token to lose information to. Once you internalize that the model literally never sees characters (only integer IDs corresponding to byte sequences that may or may not align with human characters), a bunch of LLM weirdness stops being mysterious. The strawberry problem is one of them. We\u0026rsquo;ll get there.\nVocabulary size as a design knob Vocabulary size $V$ (the number of distinct tokens in the model\u0026rsquo;s vocabulary) is a hyperparameter, meaning it is set by hand before training rather than learned from data. The obvious instinct is that bigger should be better, since common substrings collapse into single tokens and text compresses into shorter sequences. So why do real models stop at 32K to 256K? Why not a vocabulary of a million tokens, or ten million?\nThe short answer: $V$ controls three different costs at once and only one benefit, and the cost quickly becomes severe.\nAlongside $V$ sits one other number that shows up in nearly every formula below: $d$, the model\u0026rsquo;s hidden dimension. It\u0026rsquo;s the width of every vector the model passes around internally. For a 7B-class model $d$ is around 4,096; for 70B-class models it grows to 8,192. Bigger $d$ gives vectors more room to encode meaning, but compute grows with $d^2$. Most of the formulas below are some flavor of $V \\cdot d$.\nA quick clarification on those size labels: \u0026ldquo;7B\u0026rdquo; means the model has 7 billion learned parameters in total, \u0026ldquo;70B\u0026rdquo; means 70 billion. That total is a fixed budget the whole model has to share. Even the vocab tables we\u0026rsquo;re about to discuss come out of it: every parameter the designer spends on one part of the model is a parameter that cannot go to another part.\nVectors, dimensions, and matrix shapes What dimension $d$ actually means, and what the product $V \\cdot d$ is counting. Show details▾ A vector is an ordered list of numbers. When we say a vector has dimension $d = 4{,}096$, we mean it holds 4,096 numbers in a row. The order matters: each position carries its own meaning. Every token in an LLM becomes a vector, and the model does its math on those vectors instead of on the text.\nA matrix is a stack of vectors. A matrix of shape $V \\times d$ has $V$ rows, each one a $d$-dimensional vector. Its total size, the product $V \\cdot d$, is just the count of numbers stored in it.\nFor example, LLaMA 2 has $V = 32{,}000$ and $d = 4{,}096$, so its embedding matrix holds $V \\cdot d = 32{,}000 \\times 4{,}096 \\approx 131{,}000{,}000$ numbers. Every one is a parameter the model has to learn during training.\nThe benefit: compression. Bigger $V$ means more common substrings get their own token, which means a given document encodes into fewer tokens. Shorter sequences are worth a lot:\nLess work per document: the model processes fewer tokens to read the same text. More content per budget: a fixed input window holds more real text. Lower compute cost: both training and inference scale with token count, so each gets cheaper. Cost 1: embedding matrices. Every token needs its own row in the embedding matrix, which has shapeA matrix\u0026rsquo;s shape names its row and column counts. Shape $V \\times d$ means $V$ rows and $d$ columns, holding $V \\cdot d$ numbers in total. $V \\times d$. There\u0026rsquo;s also a matching output matrix at the top of the model that projects each final vector back to a $V$-dimensional distribution over the vocabulary. That matrix is also $V \\times d$. So just the vocab tables cost:\n$$\\text{vocab parameters} = 2 \\cdot V \\cdot d$$\n(Some models weight-tieA common trick where the same matrix is used both as the embedding lookup at the input and as the output projection at the top of the model, cutting the parameter cost in half. The two ends of the model share the same vocabulary, so reusing the matrix mostly works. embedding and output, cutting this to $V \\cdot d$. The principle is the same.)\nWith $d = 4{,}096$:\n$V$ model vocab parameters 32,000 LLaMA 2 262 M 128,000 LLaMA 3 1.05 B 256,000 Gemini 2.10 B 1,000,000 hypothetical 8.19 B At $V = 1\\text{M}$, you\u0026rsquo;ve spent the parameter count of an entire 8B-class model on lookup tables alone. None of that capacity goes to the rest of the model, where the actual processing happens. Every parameter spent on $V$ is a parameter you cannot spend on reasoning capacity.\nThe benefit shrinks with each new token. Cost grows linearly with $V$: every new token in the vocabulary costs the same. Benefit does not. Real text is dominated by a small number of very common tokens, so once those have their own vocabulary entries, each additional token covers vanishing additional content.\nEmpirically, English text tokenizes to roughly:\n$V$ tokens per word comment 1,000 $\\approx$ 5 essentially character-level 30,000 $\\approx$ 1.3 common words are one token 100,000 $\\approx$ 1.15 most words and common phrases consolidate 1,000,000 $\\approx$ 1.05 tiny extra gain, huge extra cost Compression gain scales roughly with $\\log V$.\nSo the central tradeoff is on the table: parameter cost rising linearly with $V$, compression gain rising only with $\\log V$. Before stacking on the other costs, the two curves are worth a long look side by side. Drag the dial, type into it, watch where you sit on each.\nVocabulary size V tokens Parameter cost — 1K 10K 100K 1M Tokens per word — 1K 10K 100K 1M The asymmetry is plain. Past 100K to 256K, you pay linear cost for vanishing additional gain. But two more cost considerations haven\u0026rsquo;t entered the picture yet, and both pull the optimum further toward smaller $V$.\nCost 2: rare tokens barely get trained. A token\u0026rsquo;s row in the embedding matrix only gets trained on the times that token appears in the data. A token that shows up millions of times gets a well-trained embedding. A token that shows up a handful of times barely gets trained at all.\nReal text is brutally skewed. Zipf's lawAn empirical regularity in natural language: the kth most frequent word (or token) appears about 1/k as often as the most frequent one. Holds, roughly, across languages, corpora, and subword tokenization. says the $k$-th most common token appears proportional to $1/k$ as often as the most common. Practically:\nThe top 1,000 tokens cover roughly 80% of all text. The top 10,000 cover something like 95%. Everything beyond is the long tail. On a 1 trillion-token training corpus (a typical pre-training scale):\n$V = 32\\text{K}$: even the rarest tokens see tens of thousands of updates. Embeddings converge. $V = 1\\text{M}$: hundreds of thousands of long-tail tokens see only 10 to a few hundred updates each. Those embeddings stay close to their random initialization. The parameters are allocated but never learn anything useful. Cost 3: each prediction gets expensive. Every time the model picks the next token, it first produces a probability distribution over all $V$ tokens in the vocabulary: its prediction for what comes next. To produce that distribution, it computes a score for every token (a $V \\times d$ matrix multiplication, called the unembedding), then normalizes those $V$ scores into probabilities through a softmaxA function that converts a list of raw scores into probabilities: bigger scores get bigger probabilities, and the results all sum to 1..\nThe $V \\times d$ matrix has $V \\cdot d$ entries, and producing each prediction means touching every one of them, costing $V \\cdot d$ basic arithmetic operations. Each layer in the rest of the model costs something like $12 \\cdot d^2$ operations per token, give or take, depending on the architecture. The two are comparable when $V$ is around $12 \\cdot d$. For $d = 4{,}096$ that crossover lands somewhere near $V = 50{,}000$. Beyond it, the prediction is one of the most expensive single operations the model does per token. Training also gets harder: the model has to learn to pick the right token from more options.\nPutting it together. Two competing curves:\n$$ \\text{parameter cost} \\propto V, \\qquad \\text{compression gain} \\propto \\log V $$\nThat gives a clear Pareto frontierA term from optimization: the curve of best-possible tradeoffs between two competing objectives. At any point on it, you cannot improve one objective without hurting the other.. At small $V$ (below 30K), spending a small extra parameter budget yields big compression gains: spend more. At large $V$ (above 256K), spending huge extra parameters yields almost nothing: stop. The sweet spot is wherever marginal cost matches marginal gain.\nWhere real models land. For modern hardware and modern $d$ values, the empirical answer sits in the 30K to 256K range:\nmodel $V$ comment LLaMA 1 / LLaMA 2 32,000 English-focused, parameter-efficient GPT-2 50,257 GPT-4 (cl100k_base) $\\approx$ 100,000 LLaMA 3 128,256 jumped specifically for multilingual coverage Gemini 256,000 heavy multilingual The dominant pressure pushing $V$ up is multilingual coverage. Each new script (Cyrillic, Arabic, Devanagari, Chinese-Japanese-Korean or CJK) wants its own token budget. The alternative, decomposing those scripts into bytes via the byte-level BPE we just saw, bloats sequence length unacceptably for users writing in those languages.\nVariants: BPE, WordPiece, and SentencePiece So far we\u0026rsquo;ve focused on BPE, since it\u0026rsquo;s the dominant algorithm and the one used directly by most modern models. But it\u0026rsquo;s not the only one. Two variants share the rest of the landscape, and both are common enough that any survey of LLM tokenization has to cover them.\nAll three solve the same problem: split text into subword chunks drawn from a fixed vocabulary. They differ on two axes: what they merge (the scoring function for picking which pair to combine) and how they treat the raw text (whether they pre-tokenize before merging).\nBPE. Already covered above. Frequency-based: at each iteration, merge the most common adjacent pair. Runs on pre-tokenized words (regex-split on whitespace and punctuation first, GPT-style). The most common standalone choice, used directly by the GPT family.\nWordPiece. Google\u0026rsquo;s variant, originally introduced for speech recognition and later adopted by BERTBidirectional Encoder Representations from Transformers: a 2018 Google language model focused on understanding text rather than generating it. Predates the modern LLM wave but is still widely used for classification, search, and similar tasks where you want to analyze text rather than produce it.. Same overall loop as BPE: start with characters, count pairs, merge the top pair, repeat. What changes is the scoring function. Where BPE picks the pair with highest joint count, WordPiece picks the pair whose merging most increases the corpus\u0026rsquo;s likelihood under a unigram model. Concretely:\n$$\\text{score}(a, b) = \\frac{\\text{count}(a, b)}{\\text{count}(a) \\cdot \\text{count}(b)}$$\nThe numerator (joint count) rewards pairs that appear together often. The denominator (product of individual counts) penalizes pairs whose pieces are already frequent on their own. Intuition: BPE picks the most frequent pair; WordPiece picks the \u0026ldquo;stickiest\u0026rdquo; pair, the one whose pieces co-occur more often than chance would predict. In practice the two produce vocabularies that look very similar at the same target size, with subtle differences in how low-frequency content gets handled. Used by BERT, RoBERTa, DistilBERT, ELECTRA, and most of the BERT family.\nSentencePiece. Google\u0026rsquo;s other tokenizer, and a bigger philosophical departure. The shift is that SentencePiece skips pre-tokenization entirely. It reads raw text directly, treating whitespace as just another character (rendered visibly as ▁, the lower-one-eighth block).\nUnderneath, SentencePiece can run BPE or a different algorithm called the unigram language model (a probabilistic alternative we won\u0026rsquo;t go into here). Either way, the no-pre-tokenization choice is the defining feature.\nWhy does that matter? Because pre-tokenization assumes whitespace marks word boundaries, and that assumption breaks outside the Indo-European family. Chinese, Japanese, Thai, and Khmer do not put spaces between words. A BPE tokenizer that pre-splits on whitespace will treat an entire Chinese paragraph as a single \u0026ldquo;word\u0026rdquo;, which breaks the merge logic. SentencePiece sidesteps the problem by not assuming whitespace means anything special.\nThe trade-off: SentencePiece adds the ▁ character to tokens that begin a new \u0026ldquo;word\u0026rdquo; (tokens preceded by a space in the original text), so the original whitespace can be recovered from the token sequence. Spaces are not lost; they are encoded into the tokens themselves. Used by LLaMA (with BPE underneath), T5, Gemma, mT5, and most multilingual models.\nThe practical landscape. Which family uses what:\nModel family Tokenizer GPT (GPT-2, GPT-3, GPT-4) BPE (with pre-tokenization) BERT family (BERT, RoBERTa, DistilBERT, ELECTRA) WordPiece LLaMA, T5, Gemma, mT5 SentencePiece (BPE underneath) One useful note for reading papers and model cards: when something says \u0026ldquo;our tokenizer is SentencePiece\u0026rdquo;, that\u0026rsquo;s a toolkit claim. The underlying algorithm is almost always still BPE. The SentencePiece-vs-BPE choice is mostly about pre-tokenization handling and multilingual support, not the merge algorithm itself.\nThe strawberry investigation The most famous demonstration of tokenization\u0026rsquo;s hidden weirdness is the strawberry problem. Ask a language model how many r\u0026rsquo;s are in \u0026ldquo;strawberry\u0026rdquo; and watch it confidently miss.\nHow many r\u0026rsquo;s are in \u0026ldquo;strawberry\u0026rdquo;? There are 2 r\u0026rsquo;s in \u0026ldquo;strawberry\u0026rdquo;. The right answer is three: s-t-r-a-w-b-e-r-r-y. This is not the model being stupid. It\u0026rsquo;s the model running into the limits of its own perception.\nWhy it happens. The model never sees the letters. By the time the word reaches the transformerThe neural-network architecture at the heart of every modern LLM. Stacked layers that pass token vectors through attention (which mixes information across positions) and feed-forward networks (which mix information within each position). (covered in depth later in the series), it has already been chopped into tokens. GPT-4\u0026rsquo;s tokenizer splits \u0026ldquo;strawberry\u0026rdquo; into three pieces:\n\"strawberry\" through GPT-4's tokenizer strawberry str aw berry 3 tokens The three r's are distributed across the tokens: one in str, two in berry. The model never sees them as separate letters at all. Each token is an integer ID. The model never sees s, t, r, a, w, b, e, r, r, y individually. From its perspective, the input is three opaque IDs.\nTo answer \u0026ldquo;how many r\u0026rsquo;s are in strawberry\u0026rdquo;, the model would have to:\nSpell out each token internally: str → s, t, r. aw → a, w. berry → b, e, r, r, y. Count r\u0026rsquo;s per token: 1, 0, 2. Add: 3. None of those steps is a native operation for a token-level predictor. The model has to know from training what letters live inside each token, then perform multi-step counting reasoning that it has no built-in primitive for.\nA useful analogy. Imagine you only know a word by hearing it spoken, never by reading it. Someone asks how many c\u0026rsquo;s are in \u0026ldquo;macchiato\u0026rdquo;. You can probably get there, but only by mentally rehearsing the spelling first and then counting. The LLM is in that situation for every word it processes.\nModern frontier models often pass. Ask GPT-4, Claude, or Gemini today and they will usually answer three. They get there through some combination of:\nTargeted training: specific letter-counting examples included in post-training data. Chain-of-thought reasoning: the model spells the word out token by token in its own output, then counts. Tool use: calling a spelling utility. But the underlying limitation has not gone away. Smaller open models still fail. Subtler letter-level questions, counting double letters, finding the second-to-last consonant, judging whether two words rhyme exactly, can still trip up frontier models. The mitigations are scaffolding around a token-level core that simply does not have letter-level structure.\nThe broader point. The strawberry failure is the cleanest demonstration of a deeper truth: an LLM\u0026rsquo;s \u0026ldquo;alphabet\u0026rdquo; is its token vocabulary, and that alphabet does not decompose into letters. Anything that needs character-level operations (counting letters, finding anagrams, detecting palindromes, judging rhymes, manipulating spelling, swapping letter positions) is uphill for the architecture.\nOnce you internalize that, a whole class of \u0026ldquo;the model is dumb\u0026rdquo; reactions stops being mysterious.\nSummary What we covered:\nTokens are the model\u0026rsquo;s atoms of perception. Each model has its own vocabulary, decided once at training time. Two models given the same sentence produce different integer sequences. BPE produces that vocabulary by repeatedly merging the most frequent adjacent pair, starting from raw characters or bytes. Simple bookkeeping, no machine learning underneath. Byte-level BPE eliminates out-of-vocabulary failures. Starting from the 256 possible bytes guarantees any input is representable. Vocabulary size $V$ is a design knob. Parameter cost grows linearly with $V$, compression gain grows roughly with $\\log V$. Real models land at 30K to 256K, with multilingual coverage pushing the upper end. WordPiece and SentencePiece are variants on the same merge-based core. WordPiece changes the scoring; SentencePiece changes how raw text is handled. The model is blind to letters. It never sees them, only tokens. Any character-level task (counting, anagrams, rhymes) is uphill for a token-level architecture. The model now reads text as a sequence of integer IDs. But an integer alone tells the model nothing about how one token relates to another. To do anything useful with those IDs (compute similarities, predict the next token, propagate gradients), they have to become vectors. That\u0026rsquo;s where this series goes next.\n","permalink":"https://bearisland.dev/posts/tokens-and-tokenization/","summary":"\u003cp\u003eAsk GPT-4 how many r\u0026rsquo;s are in \u0026ldquo;strawberry\u0026rdquo; and it will confidently say two. The right answer is three. This isn\u0026rsquo;t because the model can\u0026rsquo;t count. It\u0026rsquo;s because it never sees the letters at all.\u003c/p\u003e\n\u003cp\u003eEvery Large Language Model (LLM) starts with the same operation: text comes in, gets chopped into chunks called \u003cem\u003etokens\u003c/em\u003e, and those chunks become integer IDs that index into an \u003ca href=\"https://bearisland.dev/posts/pretraining-overview/#tokens-become-embeddings\"\u003eembedding matrix\u003c/a\u003e. The chunks aren\u0026rsquo;t characters and they aren\u0026rsquo;t words. They\u0026rsquo;re something more specific, and the specificity matters more than most people realize.\u003c/p\u003e","title":"Tokens and Tokenization"}]